[PATCH v2 00/25] TCP Extra-option framework for TCP-MD5 and SMC
by Christoph Paasch
Finally, the second version of this patch-set.
At netdev I talked about this with Eric, and he is looking forward to see
TCP-MD5 get out of the hot TCP data-path. He was planning to avoid TCP-MD5
with static jump-labels and also move TCP-MD5 in a separate file.
There are quite a few changes in this new set. I extended much more the
extra-option framework. Making the list on a per-socket basis in the
request-socket, tcp-socket and time-wait-socket. This eliminates the need
for RCU-locking. See patch "tcp_extra_options: Make extra-option list per-socket"
I also added callbacks for response_write/prepare that can be called from the
tcp_v4_send_ack/rst, and v6 counterparts. This is needed because in these
code-paths the skb and sk have entirely different semantics.
Finally, in the input-path I added a check-callback and a post-process callback.
This is needed because after parsing the option in tcp_parse_options() we also
need to act on them. The actions can be either before any other checks (e.g.,
as is the case for TCP_MD5, because we want to drop segments) or after the
validity checks (e.g., as is needed for SMC).
And I also opted SMC into the new framework.
And, I made the TCP-MD5 input code-path use the framework.
There are still a few things that I need to double-check, notably
sk_nocaps_add() in TCP_MD5.
And, I need to test ;-)
But I would like to have some early feedback already - especially on the patches
prefixed with "tcp_extra_options". If they look good, I can merge them into Mat's
patch, which will simplify my patch-juggling.
I want to get this patch-set out to netdev as an RFC as quickly as possible, so
that we can make some progress.
Christoph Paasch (24):
tcp: Write options after the header has been fully done
tcp: Pass sock to tcp_options_write instead of tcp_sock
tcp: Pass skb in tcp_options_write
tcp: Allow tcp_fast_parse_options to drop segments
tcp: Make smc_parse_options return 1 on success
tcp_md5: Don't pass along md5-key
tcp_md5: Detect key inside tcp_v4_send_ack instead of passing it as an
argument
tcp_md5: Detect key inside tcp_v6_send_response instead of passing it
as an argument
tcp_extra_options: Extend tcp_extra_options_write
tcp_extra_options: Make extra-option list per-socket
tcp_extra_options: Export module fixes
tcp_extra_options: Add response_prepare callback
tcp_extra_options: Check static branch before _parse
tcp_extra_options: Skip fast-path when extra-options are present
tcp_extra_options: Pass sk to tcp_parse_options instead of tp
tcp_extra_options: Add check-callback right after parsing the option
tcp_extra_options: Allow to parse experimental options
tcp_extra_options: Add post_process callback
tcp_smc: Make SMC use TCP extra-option framework
tcp_md5: Check for TCP_MD5 after TCP Timestamps in
tcp_established_options
tcp_md5: Move TCP-MD5 code out of TCP itself
tcp_md5: Use tcp_extra_options in output path
tcp_md5: Cleanup TCP-code
tcp_md5: Use TCP extra-options on the input path
Mat Martineau (1):
tcp: Register handlers for extra TCP options
drivers/infiniband/hw/cxgb4/cm.c | 2 +-
include/linux/inet_diag.h | 1 +
include/linux/tcp.h | 41 +-
include/linux/tcp_md5.h | 38 ++
include/net/inet_sock.h | 3 +-
include/net/tcp.h | 216 ++++---
net/ipv4/Makefile | 1 +
net/ipv4/syncookies.c | 6 +-
net/ipv4/tcp.c | 372 ++++++++---
net/ipv4/tcp_diag.c | 81 +--
net/ipv4/tcp_input.c | 134 ++--
net/ipv4/tcp_ipv4.c | 553 ++--------------
net/ipv4/tcp_md5.c | 1317 ++++++++++++++++++++++++++++++++++++++
net/ipv4/tcp_minisocks.c | 76 +--
net/ipv4/tcp_output.c | 183 +-----
net/ipv6/syncookies.c | 6 +-
net/ipv6/tcp_ipv6.c | 386 ++---------
net/smc/af_smc.c | 170 ++++-
18 files changed, 2148 insertions(+), 1438 deletions(-)
create mode 100644 include/linux/tcp_md5.h
create mode 100644 net/ipv4/tcp_md5.c
--
2.15.0
4 years, 6 months
[PATCH] Revert tcp_skb_cb to its original size
by Shoaib Rao
Christoph,
Following is a patch based on branch mptcp_v0.91.
I have looked into the issue that you pointed out. It is same as the partial ACK issue. If there is no partial ack everything will work and this patch covers all the non partial ack cases.
For the partial ACK case the fix is very simple. We can send the packet out without any mapping and the previous mapping should cover it. We can also update the mapping i.e. care a sub mapping.
The Receiver code (detect mapping()) is very implementation based. I will change it to accept two more mapping, sub mapping that does not violate an existing mapping and a super mapping.
I will provide that patch later. Also note that the fields used in this patch are specific to this release.
Shoaib
---
include/linux/skbuff.h | 2 +-
include/net/tcp.h | 32 +++++++++-------
net/mptcp/mptcp_output.c | 97 ++++++++++++++++++++++++++++++------------------
3 files changed, 80 insertions(+), 51 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f66cd5e..ca2e26a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -540,7 +540,7 @@ struct sk_buff {
* want to keep them across layers you have to do a skb_clone()
* first. This is owned by whoever has the skb queued ATM.
*/
- char cb[56] __aligned(8);
+ char cb[48] __aligned(8);
unsigned long _skb_refdst;
void (*destructor)(struct sk_buff *skb);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 655ecd4..b476e86 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -842,14 +842,18 @@ struct tcp_skb_cb {
__u32 tcp_gso_segs;
};
-#ifdef CONFIG_MPTCP
- __u8 mptcp_flags; /* flags for the MPTCP layer */
- __u8 dss_off; /* Number of 4-byte words until
- * seq-number */
-#endif
__u8 tcp_flags; /* TCP header flags. (tcp[13]) */
- __u8 sacked; /* State flags for SACK/FACK. */
+ /* Below union is fine as the skb will use one or the other
+ * The SKB in the rtx queue will set sacked and does not care
+ * about dss_off.
+ * Eventually dss_off will not be needed.
+ */
+
+ union {
+ __u8 sacked; /* State flags for SACK/FACK. */
+ __u8 dss_off;
+ };
#define TCPCB_SACKED_ACKED 0x01 /* SKB ACK'd by a SACK block */
#define TCPCB_SACKED_RETRANS 0x02 /* SKB retransmitted */
#define TCPCB_LOST 0x04 /* SKB is lost */
@@ -860,8 +864,14 @@ struct tcp_skb_cb {
TCPCB_REPAIRED)
__u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */
- /* 1 byte hole */
- __u32 ack_seq; /* Sequence number ACK'd */
+
+ __u8 mptcp_flags;
+ union {
+ __u32 ack_seq; /* Sequence number ACK'd */
+ __u32 mptcp_data_seq;
+ __u32 path_mask;
+
+ };
union {
union {
struct inet_skb_parm h4;
@@ -869,12 +879,6 @@ struct tcp_skb_cb {
struct inet6_skb_parm h6;
#endif
} header; /* For incoming frames */
-#ifdef CONFIG_MPTCP
- union { /* For MPTCP outgoing frames */
- __u32 path_mask; /* paths that tried to send this skb */
- __u32 dss[6]; /* DSS options */
- };
-#endif
};
};
diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
index 691ef6f..5318459 100644
--- a/net/mptcp/mptcp_output.c
+++ b/net/mptcp/mptcp_output.c
@@ -57,34 +57,13 @@ EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
/* get the data-seq and end-data-seq and store them again in the
* tcp_skb_cb
*/
+/* Rao: May need work */
static bool mptcp_reconstruct_mapping(struct sk_buff *skb)
{
- const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
- u32 *p32;
- u16 *p16;
-
if (!mptcp_is_data_seq(skb))
return false;
- if (!mpdss->M)
- return false;
-
- /* Move the pointer to the data-seq */
- p32 = (u32 *)mpdss;
- p32++;
- if (mpdss->A) {
- p32++;
- if (mpdss->a)
- p32++;
- }
-
- TCP_SKB_CB(skb)->seq = ntohl(*p32);
-
- /* Get the data_len to calculate the end_data_seq */
- p32++;
- p32++;
- p16 = (u16 *)p32;
- TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
+ TCP_SKB_CB(skb)->seq = TCP_SKB_CB(skb)->mptcp_data_seq;
return true;
}
@@ -182,7 +161,7 @@ static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk
/* Segment goes back to the MPTCP-layer. So, we need to zero the
* path_mask/dss.
*/
- memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
+ TCP_SKB_CB(skb)->path_mask = 0;
/* We need to find out the path-mask from the meta-write-queue
* to properly select a subflow.
@@ -319,25 +298,30 @@ combine:
}
}
-static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
- __be32 *ptr)
+
+static int mptcp_write_dss_mapping(const struct tcp_sock *tp,
+ const struct sk_buff *skb, __be32 *ptr)
{
const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
__be32 *start = ptr;
__u16 data_len;
- *ptr++ = htonl(tcb->seq); /* data_seq */
+ *ptr++ = htonl(tcb->mptcp_data_seq); /* data_seq */
/* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
if (mptcp_is_data_fin(skb) && skb->len == 0)
*ptr++ = 0; /* subseq */
else
- *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
+ *ptr++ = htonl(tcb->seq - tp->mptcp->snt_isn); /* subseq */
- if (tcb->mptcp_flags & MPTCPHDR_INF)
+ if (tcb->mptcp_flags & MPTCPHDR_INF) {
data_len = 0;
- else
+ } else {
data_len = tcb->end_seq - tcb->seq;
+ /* mptcp_entail_skb adds one for FIN */
+ if (tcb->tcp_flags & TCPHDR_FIN)
+ data_len -= 1;
+ }
if (tp->mpcb->dss_csum && data_len) {
__be16 *p16 = (__be16 *)ptr;
@@ -395,6 +379,47 @@ static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_b
* To avoid this we save the initial DSS mapping which allows us to
* send the same DSS mapping even for fragmented retransmits.
*/
+#if 0
+static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
+ __be32 *ptr)
+{
+ const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+ __be32 *start = ptr;
+ __u16 data_len;
+
+ *ptr++ = htonl(tcb->seq); /* data_seq */
+
+ /* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
+ if (mptcp_is_data_fin(skb) && skb->len == 0)
+ *ptr++ = 0; /* subseq */
+ else
+ *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
+
+ if (tcb->mptcp_flags & MPTCPHDR_INF)
+ data_len = 0;
+ else
+ data_len = tcb->end_seq - tcb->seq;
+
+ if (tp->mpcb->dss_csum && data_len) {
+ __be16 *p16 = (__be16 *)ptr;
+ __be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
+ __wsum csum;
+
+ *ptr = htonl(((data_len) << 16) |
+ (TCPOPT_EOL << 8) |
+ (TCPOPT_EOL));
+ csum = csum_partial(ptr - 2, 12, skb->csum);
+ p16++;
+ *p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
+ } else {
+ *ptr++ = htonl(((data_len) << 16) |
+ (TCPOPT_NOP << 8) |
+ (TCPOPT_NOP));
+ }
+
+ return ptr - start;
+}
+
static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
{
struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
@@ -424,6 +449,7 @@ static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *s
return mptcp_dss_len/sizeof(*ptr);
}
+#endif
static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
{
@@ -469,8 +495,8 @@ static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
if (mptcp_is_data_fin(subskb))
mptcp_combine_dfin(subskb, meta_sk, sk);
- mptcp_save_dss_data_seq(tp, subskb);
-
+ TCP_SKB_CB(subskb)->mptcp_flags |= MPTCPHDR_SEQ;
+ tcb->mptcp_data_seq = tcb->seq;
tcb->seq = tp->write_seq;
/* Take into account seg len */
@@ -1197,10 +1223,9 @@ void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
}
if (OPTION_DATA_ACK & opts->mptcp_options) {
- if (!mptcp_is_data_seq(skb))
- ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
- else
- ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
+ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
+ if (mptcp_is_data_seq(skb))
+ ptr += mptcp_write_dss_mapping(tp, skb, ptr);
}
if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
struct mp_prio *mpprio = (struct mp_prio *)ptr;
--
2.7.4
4 years, 7 months
Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
by cpaasch@apple.com
On 09/11/17 - 04:32:54, Boris Pismenny wrote:
> +Ilya and Liran
>
> Hi,
>
> > -----Original Message-----
> > From: cpaasch(a)apple.com [mailto:cpaasch@apple.com]
> > Sent: Thursday, November 09, 2017 13:13
> > To: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Boris Pismenny
> > <borisp(a)mellanox.com>
> > Cc: mptcp(a)lists.01.org
> > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> >
> > +Boris
> >
> > On 20/10/17 - 16:02:31, Mat Martineau wrote:
> > > The sk_buff control buffer is of limited size, and cannot be enlarged
> > > without significant impact on systemwide memory use. However, additional
> > > per-packet state is needed for some protocols, like Multipath TCP.
> > >
> > > An optional shared control buffer placed after the normal struct
> > > skb_shared_info can accomodate the necessary state without imposing
> > > extra memory usage or code changes on normal struct sk_buff
> > > users. __alloc_skb will now place a skb_shared_info_ext structure at
> > > skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
> > > sk_buff continue to use the skb_shinfo() macro to access shared
> > > info. skb_shinfo(skb)->is_ext is set if the extended structure is
> > > available, and cleared if it is not.
> > >
> > > pskb_expand_head will preserve the shared control buffer if it is present.
> > >
> > > Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> > > ---
> > > include/linux/skbuff.h | 24 +++++++++++++++++++++-
> > > net/core/skbuff.c | 56 ++++++++++++++++++++++++++++++++++++++-------
> > -----
> > > 2 files changed, 66 insertions(+), 14 deletions(-)
> >
> > Boris, below is the change I mentioned to you.
> >
> > It allows to allocate 48 additional bytes on-demand after skb_shared_info.
> > As it is on-demand, it won't increase the size of the skb for other users.
> >
> > For example, TLS could start using it when it creates the skb that it
> > pushes down to the TCP-stack. That way you don't need to handle the
> > tls_record lists.
> >
>
> One small problem is that TLS doesn't create SKBs. As a ULP it calls the transport send
> Functions (do_tcp_sendpages for TLS). This function receives a page and not a SKB.
yes, that's a good point. Mat has another patch as part of this series,
that adds an skb-arg to sendpages
(https://lists.01.org/pipermail/mptcp/2017-October/000130.html)
That should do the job for you.
> We decided not to create the SKB outside of the TCP layer to reduce the number of
> changes we made to TCP.
>
> It would be nice if we could use something like that. Did you talk to DaveM about
> upstreaming this?
No, we didn't talk yet to anyone outside of this list here about it.
We were looking for a user of it outside of MPTCP but couldn't find one.
Now, it seems like we found one that would be interested :)
I think this infrastructure here would simplify your code quite a bit, no?
> We will definitely find it useful for the receive side, there we allocate the SKB in the driver.
Interesting! So even there you could use it. We were under the impression
that it would be of less interest for the receive-side.
Christoph
>
> > See below for the rest of the patch.
> >
> >
> > Christoph
> >
> >
> > >
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index 03634ec2f918..873910c66df9 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct
> > sk_buff *skb,
> > > * the end of the header data, ie. at skb->end.
> > > */
> > > struct skb_shared_info {
> > > - __u8 __unused;
> > > + __u8 is_ext:1,
> > > + __unused:7;
> > > __u8 meta_len;
> > > __u8 nr_frags;
> > > __u8 tx_flags;
> > > @@ -530,6 +531,24 @@ struct skb_shared_info {
> > > #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
> > >
> > >
> > > +/* This is an extended version of skb_shared_info, also invariant across
> > > + * clones and living at the end of the header data.
> > > + */
> > > +struct skb_shared_info_ext {
> > > + /* skb_shared_info must be the first member */
> > > + struct skb_shared_info shinfo;
> > > +
> > > + /* This is the shared control buffer. It is similar to sk_buff's
> > > + * control buffer, but is shared across clones. It must not be
> > > + * modified when multiple sk_buffs are referencing this structure.
> > > + */
> > > + char shcb[48];
> > > +};
> > > +
> > > +#define SKB_SHINFO_EXT_OVERHEAD \
> > > + (SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
> > > + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> > > +
> > > enum {
> > > SKB_FCLONE_UNAVAILABLE, /* skb has no fclone (from head_cache)
> > */
> > > SKB_FCLONE_ORIG, /* orig skb (from fclone_cache) */
> > > @@ -856,6 +875,7 @@ struct sk_buff {
> > > #define SKB_ALLOC_FCLONE 0x01
> > > #define SKB_ALLOC_RX 0x02
> > > #define SKB_ALLOC_NAPI 0x04
> > > +#define SKB_ALLOC_SHINFO_EXT 0x08
> > >
> > > /* Returns true if the skb was allocated from PFMEMALLOC reserves */
> > > static inline bool skb_pfmemalloc(const struct sk_buff *skb)
> > > @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const
> > struct sk_buff *skb)
> > >
> > > /* Internal */
> > > #define skb_shinfo(SKB) ((struct skb_shared_info
> > *)(skb_end_pointer(SKB)))
> > > +#define skb_shinfo_ext(SKB) \
> > > + ((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
> > >
> > > static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff
> > *skb)
> > > {
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 40717501cbdd..397edd5c0613 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t
> > flags, int node,
> > > * instead of head cache and allocate a cloned (child) skb.
> > > * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
> > > * allocations in case the data is required for writeback
> > > + * If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
> > > + * with an extended shared info struct.
> > > * @node: numa node to allocate memory on
> > > *
> > > * Allocate a new &sk_buff. The returned buffer has no headroom and a
> > > @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t
> > gfp_mask,
> > > int flags, int node)
> > > {
> > > struct kmem_cache *cache;
> > > - struct skb_shared_info *shinfo;
> > > struct sk_buff *skb;
> > > u8 *data;
> > > + unsigned int shinfo_size;
> > > bool pfmemalloc;
> > >
> > > cache = (flags & SKB_ALLOC_FCLONE)
> > > @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size,
> > gfp_t gfp_mask,
> > > /* We do our best to align skb_shared_info on a separate cache
> > > * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
> > > * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> > > - * Both skb->head and skb_shared_info are cache line aligned.
> > > + * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
> > > + * cache line aligned.
> > > */
> > > size = SKB_DATA_ALIGN(size);
> > > - size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > - data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > > + if (flags & SKB_ALLOC_SHINFO_EXT)
> > > + shinfo_size = SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info_ext));
> > > + else
> > > + shinfo_size = SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info));
> > > + data = kmalloc_reserve(size + shinfo_size, gfp_mask, node,
> > &pfmemalloc);
> > > if (!data)
> > > goto nodata;
> > > /* kmalloc(size) might give us more room than requested.
> > > * Put skb_shared_info exactly at the end of allocated zone,
> > > * to allow max possible filling before reallocation.
> > > */
> > > - size = SKB_WITH_OVERHEAD(ksize(data));
> > > + size = ksize(data) - shinfo_size;
> > > prefetchw(data + size);
> > >
> > > /*
> > > @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t
> > gfp_mask,
> > > */
> > > memset(skb, 0, offsetof(struct sk_buff, tail));
> > > /* Account for allocated memory : skb + skb->head */
> > > - skb->truesize = SKB_TRUESIZE(size);
> > > + if (flags & SKB_ALLOC_SHINFO_EXT)
> > > + skb->truesize = SKB_TRUESIZE(size) +
> > SKB_SHINFO_EXT_OVERHEAD;
> > > + else
> > > + skb->truesize = SKB_TRUESIZE(size);
> > > skb->pfmemalloc = pfmemalloc;
> > > refcount_set(&skb->users, 1);
> > > skb->head = data;
> > > @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size,
> > gfp_t gfp_mask,
> > > skb->transport_header = (typeof(skb->transport_header))~0U;
> > >
> > > /* make sure we initialize shinfo sequentially */
> > > - shinfo = skb_shinfo(skb);
> > > - memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > > - atomic_set(&shinfo->dataref, 1);
> > > - kmemcheck_annotate_variable(shinfo->destructor_arg);
> > > + if (flags & SKB_ALLOC_SHINFO_EXT) {
> > > + struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
> > > + shinfo_ext->shinfo.is_ext = 1;
> > > + memset(&shinfo_ext->shinfo.meta_len, 0,
> > > + offsetof(struct skb_shared_info, dataref) -
> > > + offsetof(struct skb_shared_info, meta_len));
> > > + atomic_set(&shinfo_ext->shinfo.dataref, 1);
> > > + kmemcheck_annotate_variable(shinfo_ext-
> > >shinfo.destructor_arg);
> > > + memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
> > > + } else {
> > > + struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > + memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > > + atomic_set(&shinfo->dataref, 1);
> > > + kmemcheck_annotate_variable(shinfo->destructor_arg);
> > > + }
> > >
> > > if (flags & SKB_ALLOC_FCLONE) {
> > > struct sk_buff_fclones *fclones;
> > > @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int
> > nhead, int ntail,
> > > {
> > > int i, osize = skb_end_offset(skb);
> > > int size = osize + nhead + ntail;
> > > + int shinfo_size;
> > > long off;
> > > u8 *data;
> > >
> > > @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int
> > nhead, int ntail,
> > >
> > > if (skb_pfmemalloc(skb))
> > > gfp_mask |= __GFP_MEMALLOC;
> > > - data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info)),
> > > - gfp_mask, NUMA_NO_NODE, NULL);
> > > + if (skb_shinfo(skb)->is_ext)
> > > + shinfo_size = SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info_ext));
> > > + else
> > > + shinfo_size = SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info));
> > > + data = kmalloc_reserve(size + shinfo_size, gfp_mask,
> > NUMA_NO_NODE, NULL);
> > > if (!data)
> > > goto nodata;
> > > - size = SKB_WITH_OVERHEAD(ksize(data));
> > > + size = ksize(data) - shinfo_size;
> > >
> > > /* Copy only real data... and, alas, header. This should be
> > > * optimized for the cases when header is void.
> > > @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int
> > nhead, int ntail,
> > > memcpy((struct skb_shared_info *)(data + size),
> > > skb_shinfo(skb),
> > > offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
> > > + if (skb_shinfo(skb)->is_ext) {
> > > + int offset = offsetof(struct skb_shared_info_ext, shcb);
> > > + memcpy((struct skb_shared_info_ext *)(data + size + offset),
> > > + &skb_shinfo_ext(skb)->shcb,
> > > + sizeof(skb_shinfo_ext(skb)->shcb));
> > > + }
> > >
> > > /*
> > > * if shinfo is shared we must drop the old head gracefully, but if it
> > > --
> > > 2.14.2
> > >
> > > _______________________________________________
> > > mptcp mailing list
> > > mptcp(a)lists.01.org
> > >
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.0
> > 1.org%2Fmailman%2Flistinfo%2Fmptcp&data=02%7C01%7Cborisp%40mellano
> > x.com%7C57765cda687e4c3ab8ed08d527283bd5%7Ca652971c7d2e4d9ba6a4
> > d149256f461b%7C0%7C0%7C636457976113353851&sdata=RtZwfWQx%2FSw
> > HoF9miFLpa2r4kA4pf0ta7la6k7F3OZI%3D&reserved=0
4 years, 7 months
Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
by cpaasch@apple.com
On 09/11/17 - 07:31:40, Ilya Lesokhin wrote:
> One of the issues I see with TLS is that we need to update
> This control buffer when SKBs are split or merged.
> Have you given any thought into how it should be done?
>
> In any case, having a hint with a pointer to record could help TLS
> Even if the hint is not always accurate.
This would still incur the cost of maintaining and allocating the records in
the list. Do you have an estimate as to the cost of it in terms of CPU
cycles? (should be possible to have a rough measurement with perf)
In general, there seems to be a need for adding meta-data to skb's. (I just
looked at skb_shared_info->meta_len which was also added recently for XDP).
With all these use-cases, it might be worth to have something clean to store
this meta-data.
Another idea we had was to store meta-data rather somewhere in the linear
memory of the skb (e.g., between skb->head and skb->data or between
skb->end and skb->tail).
Christoph
>
> > -----Original Message-----
> > From: Boris Pismenny
> > Sent: Thursday, November 09, 2017 8:48 AM
> > To: cpaasch(a)apple.com
> > Cc: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Ilya Lesokhin
> > <ilyal(a)mellanox.com>; Liran Liss <liranl(a)mellanox.com>; mptcp(a)lists.01.org
> > Subject: RE: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> >
> >
> >
> > > -----Original Message-----
> > > From: cpaasch(a)apple.com [mailto:cpaasch@apple.com]
> > > Sent: Thursday, November 09, 2017 13:48
> > > To: Boris Pismenny <borisp(a)mellanox.com>
> > > Cc: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Ilya Lesokhin
> > > <ilyal(a)mellanox.com>; Liran Liss <liranl(a)mellanox.com>;
> > > mptcp(a)lists.01.org
> > > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> > >
> > > On 09/11/17 - 04:32:54, Boris Pismenny wrote:
> > > > +Ilya and Liran
> > > >
> > > > Hi,
> > > >
> > > > > -----Original Message-----
> > > > > From: cpaasch(a)apple.com [mailto:cpaasch@apple.com]
> > > > > Sent: Thursday, November 09, 2017 13:13
> > > > > To: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Boris
> > > Pismenny
> > > > > <borisp(a)mellanox.com>
> > > > > Cc: mptcp(a)lists.01.org
> > > > > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> > > > >
> > > > > +Boris
> > > > >
> > > > > On 20/10/17 - 16:02:31, Mat Martineau wrote:
> > > > > > The sk_buff control buffer is of limited size, and cannot be
> > > > > > enlarged without significant impact on systemwide memory use.
> > > > > > However,
> > > additional
> > > > > > per-packet state is needed for some protocols, like Multipath TCP.
> > > > > >
> > > > > > An optional shared control buffer placed after the normal struct
> > > > > > skb_shared_info can accomodate the necessary state without
> > > > > > imposing extra memory usage or code changes on normal struct
> > > > > > sk_buff users. __alloc_skb will now place a skb_shared_info_ext
> > > > > > structure at
> > > > > > skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of
> > > struct
> > > > > > sk_buff continue to use the skb_shinfo() macro to access shared
> > > > > > info. skb_shinfo(skb)->is_ext is set if the extended structure
> > > > > > is available, and cleared if it is not.
> > > > > >
> > > > > > pskb_expand_head will preserve the shared control buffer if it is
> > present.
> > > > > >
> > > > > > Signed-off-by: Mat Martineau
> > > > > > <mathew.j.martineau(a)linux.intel.com>
> > > > > > ---
> > > > > > include/linux/skbuff.h | 24 +++++++++++++++++++++-
> > > > > > net/core/skbuff.c | 56
> > ++++++++++++++++++++++++++++++++++++++--
> > > -----
> > > > > -----
> > > > > > 2 files changed, 66 insertions(+), 14 deletions(-)
> > > > >
> > > > > Boris, below is the change I mentioned to you.
> > > > >
> > > > > It allows to allocate 48 additional bytes on-demand after skb_shared_info.
> > > > > As it is on-demand, it won't increase the size of the skb for other users.
> > > > >
> > > > > For example, TLS could start using it when it creates the skb that
> > > > > it pushes down to the TCP-stack. That way you don't need to handle
> > > > > the tls_record lists.
> > > > >
> > > >
> > > > One small problem is that TLS doesn't create SKBs. As a ULP it calls
> > > > the
> > > transport send
> > > > Functions (do_tcp_sendpages for TLS). This function receives a page
> > > > and not a
> > > SKB.
> > >
> > > yes, that's a good point. Mat has another patch as part of this
> > > series, that adds an skb-arg to sendpages
> > >
> > (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.
> > > 01.org%2Fpipermail%2Fmptcp%2F2017-
> > >
> > October%2F000130.html&data=02%7C01%7Cborisp%40mellanox.com%7C8d
> > 0
> > >
> > 4f7b1c39649a6e80d08d5272d17fe%7Ca652971c7d2e4d9ba6a4d149256f461
> > b
> > >
> > %7C0%7C0%7C636457996996299024&sdata=YjycapXvL2N%2FkrPj15oTy2igh
> > C
> > > 23j1lKmWetdJtdXLU%3D&reserved=0)
> > >
> > > That should do the job for you.
> > >
> > > > We decided not to create the SKB outside of the TCP layer to reduce
> > > > the
> > > number of
> > > > changes we made to TCP.
> > > >
> > > > It would be nice if we could use something like that. Did you talk
> > > > to DaveM
> > > about
> > > > upstreaming this?
> > >
> > > No, we didn't talk yet to anyone outside of this list here about it.
> > >
> > > We were looking for a user of it outside of MPTCP but couldn't find one.
> > > Now, it seems like we found one that would be interested :)
> > >
> > >
> > > I think this infrastructure here would simplify your code quite a bit, no?
> >
> > I'm not sure. I'll need to think about that.
> >
> > I'm worried that in some cases it won't work, i.e. the shared_info wouldn't
> > reach our driver and then again we would be forced to use the tls_get_record
> > function.
> > One example, is the tcp_mtu_probe function, which creates a new skb. Another
> > is tcp_collapse_retrans. We need to ensure all of these are covered when we
> > pass any offloaded TLS skb to the driver.
> >
> > >
> > > > We will definitely find it useful for the receive side, there we
> > > > allocate the SKB
> > > in the driver.
> > >
> > > Interesting! So even there you could use it. We were under the
> > > impression that it would be of less interest for the receive-side.
> > >
> > >
> > > Christoph
> > >
> > > >
> > > > > See below for the rest of the patch.
> > > > >
> > > > >
> > > > > Christoph
> > > > >
>
4 years, 7 months
Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
by cpaasch@apple.com
On 09/11/17 - 06:48:09, Boris Pismenny wrote:
> > -----Original Message-----
> > From: cpaasch(a)apple.com [mailto:cpaasch@apple.com]
> > Sent: Thursday, November 09, 2017 13:48
> > To: Boris Pismenny <borisp(a)mellanox.com>
> > Cc: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Ilya Lesokhin
> > <ilyal(a)mellanox.com>; Liran Liss <liranl(a)mellanox.com>; mptcp(a)lists.01.org
> > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> >
> > On 09/11/17 - 04:32:54, Boris Pismenny wrote:
> > > +Ilya and Liran
> > >
> > > Hi,
> > >
> > > > -----Original Message-----
> > > > From: cpaasch(a)apple.com [mailto:cpaasch@apple.com]
> > > > Sent: Thursday, November 09, 2017 13:13
> > > > To: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Boris
> > Pismenny
> > > > <borisp(a)mellanox.com>
> > > > Cc: mptcp(a)lists.01.org
> > > > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> > > >
> > > > +Boris
> > > >
> > > > On 20/10/17 - 16:02:31, Mat Martineau wrote:
> > > > > The sk_buff control buffer is of limited size, and cannot be enlarged
> > > > > without significant impact on systemwide memory use. However,
> > additional
> > > > > per-packet state is needed for some protocols, like Multipath TCP.
> > > > >
> > > > > An optional shared control buffer placed after the normal struct
> > > > > skb_shared_info can accomodate the necessary state without imposing
> > > > > extra memory usage or code changes on normal struct sk_buff
> > > > > users. __alloc_skb will now place a skb_shared_info_ext structure at
> > > > > skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of
> > struct
> > > > > sk_buff continue to use the skb_shinfo() macro to access shared
> > > > > info. skb_shinfo(skb)->is_ext is set if the extended structure is
> > > > > available, and cleared if it is not.
> > > > >
> > > > > pskb_expand_head will preserve the shared control buffer if it is present.
> > > > >
> > > > > Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> > > > > ---
> > > > > include/linux/skbuff.h | 24 +++++++++++++++++++++-
> > > > > net/core/skbuff.c | 56 ++++++++++++++++++++++++++++++++++++++--
> > -----
> > > > -----
> > > > > 2 files changed, 66 insertions(+), 14 deletions(-)
> > > >
> > > > Boris, below is the change I mentioned to you.
> > > >
> > > > It allows to allocate 48 additional bytes on-demand after skb_shared_info.
> > > > As it is on-demand, it won't increase the size of the skb for other users.
> > > >
> > > > For example, TLS could start using it when it creates the skb that it
> > > > pushes down to the TCP-stack. That way you don't need to handle the
> > > > tls_record lists.
> > > >
> > >
> > > One small problem is that TLS doesn't create SKBs. As a ULP it calls the
> > transport send
> > > Functions (do_tcp_sendpages for TLS). This function receives a page and not a
> > SKB.
> >
> > yes, that's a good point. Mat has another patch as part of this series,
> > that adds an skb-arg to sendpages
> > (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.
> > 01.org%2Fpipermail%2Fmptcp%2F2017-
> > October%2F000130.html&data=02%7C01%7Cborisp%40mellanox.com%7C8d0
> > 4f7b1c39649a6e80d08d5272d17fe%7Ca652971c7d2e4d9ba6a4d149256f461b
> > %7C0%7C0%7C636457996996299024&sdata=YjycapXvL2N%2FkrPj15oTy2ighC
> > 23j1lKmWetdJtdXLU%3D&reserved=0)
> >
> > That should do the job for you.
> >
> > > We decided not to create the SKB outside of the TCP layer to reduce the
> > number of
> > > changes we made to TCP.
> > >
> > > It would be nice if we could use something like that. Did you talk to DaveM
> > about
> > > upstreaming this?
> >
> > No, we didn't talk yet to anyone outside of this list here about it.
> >
> > We were looking for a user of it outside of MPTCP but couldn't find one.
> > Now, it seems like we found one that would be interested :)
> >
> >
> > I think this infrastructure here would simplify your code quite a bit, no?
>
> I'm not sure. I'll need to think about that.
>
> I'm worried that in some cases it won't work, i.e. the shared_info wouldn't reach our
> driver and then again we would be forced to use the tls_get_record function.
> One example, is the tcp_mtu_probe function, which creates a new skb. Another
> is tcp_collapse_retrans. We need to ensure all of these are covered when we pass
> any offloaded TLS skb to the driver.
Yes, that's a good point. Making sure that all of these are covered will be
tricky.
There could even be other places further down the stack before the driver
where an skb is copied/modified.
Christoph
>
> >
> > > We will definitely find it useful for the receive side, there we allocate the SKB
> > in the driver.
> >
> > Interesting! So even there you could use it. We were under the impression
> > that it would be of less interest for the receive-side.
> >
> >
> > Christoph
> >
> > >
> > > > See below for the rest of the patch.
> > > >
> > > >
> > > > Christoph
> > > >
>
4 years, 7 months
[PATCH 0/2] Shared control buffer proposal
by Mat Martineau
Here's an implementation of the "shared control buffer" we recently
discussed on this list. This uses a SKB_SHINFO_EXT flag for __alloc_skb
to optionally allocate extra space following skb_shared_info. Regular
(non-extended) skbs are not changed at all. The new skbs with extended
shared info can be safely handled by existing code, although low-level
operations that copy or reallocate skb data and shared info may strip
the extended information if they are not aware of it.
The second patch allows users of do_tcp_sendpages to provide a skb with
an extended control buffer to use for the first packet. In MPTCP, this
would allow a DSS mapping for a large chunk of data to be sent in the
first packet, with the rest of the mapped data sent in subsequent
packets. It also conveniently ensures that the beginning of the provided
data is not coalesced in to an existing skb so the DSS mapping and the
beginning of the mapped data stay together.
I still need to verify how GSO and TSO handle these skbs. When I tested
this code by using SKB_ALLOC_SHINFO_EXT with every TCP allocation, the
extended information was available when TCP headers are written.
Mat Martineau (2):
skbuff: Add shared control buffer
tcp: Let do_tcp_sendpages accept a pre-allocated initial skb
include/linux/skbuff.h | 24 +++++++++++++++++++++-
include/net/tcp.h | 2 +-
net/core/skbuff.c | 56 ++++++++++++++++++++++++++++++++++++++------------
net/ipv4/tcp.c | 12 ++++++++---
net/tls/tls_main.c | 3 ++-
5 files changed, 78 insertions(+), 19 deletions(-)
--
2.14.2
4 years, 7 months
tls: Add generic NIC offload infrastructure
by Christoph Paasch
Take a look at https://patchwork.ozlabs.org/patch/835816/.
TLS also needs to maintain a per-record mapping on the skb, which is similar
to the DSS-mapping.
For each TLS record, they seem to allocate a data-structure and maintain a
list of it. tls_get_record() allows to find the record based on the
sequence-number.
They seem like a good candidate for an skb_shared_info_ext :-)
I will try to find the guy who presented it here at netdev and talk to him
about it.
Christoph
4 years, 7 months