[p2p-hackers] MTU in the real world
David Barrett
dbarrett at quinthar.com
Tue May 31 21:46:18 UTC 2005
Ah, thanks -- this is precisely the kind of story I'm looking to hear.
But I agree its conditions are a bit unusual. Do you know of any
similar attempts of using big MTUs over a standard consumer internet
connection?
-david
On Tue, 31 May 2005 10:40 am, Serguei Osokine wrote:
> On Tuesday, May 31, 2005 David Barrett wrote:
>> With this in mind, have you tried using a MTU bigger than 1500 bytes
>> and been bitten by it?
>
> Yes. That was not your typical everyday situation, but I think
> some on this list might find it entertaining anyway:
>
> We tried to use UDP to transfer stuff over a gigabit LAN inside
> the cluster. Pretty soon we discovered that with small (~1500 byte)
> packets the CPU was the bottleneck, because you can send only so many
> packets per second, and the resulting throughput was nowhere close to
> a gigabit. (You have to send almost 100K such packets a second to
> achieve a gigabit throughput, and we were doing several times less
> on our 2-CPU 2.4GHz Win XP boxes.)
>
> So then we tried to increase the UDP datagram size. The gigabit
> switch did not support jumbo frames, by the way, so we were fragmenting
> as soon as we exceeded 1500. The throughput went up, and was pretty
> decent with 64-KB datgrams (don't remember the exact numbers, but it
> was close to a gigabit and generally everything was peachy).
>
> Which is when the funny things started to happen. In the middle
> of a test, the communication channel would just shut down and nothing
> would be delivered over it for a minute or two (though both the sender
> and the receiver kept looking fine and no errors were returned by the
> socket calls - sender was sending data, but the receiver recfrom()
> call was not getting it); after that pause the channel would wake up
> as if nothing happened (except for several gigabytes of lost data),
> work normally for a few minutes, after which this shutdown would be
> repeated, and so on.
>
> Took us a while to figure out what was going on, but here is the
> scoop: the gigabit LAN had a fairly small, but nonetheless non-zero
> packet loss rate. When one 1500-byte frame from a 64-KB datgram is
> lost, the rest of the datagram frames (all 62 KB)have to be buffered
> somewhere in case the missing frame arrives and the datagram can be
> fully reassembled. This arrival will never happen, but the socket
> layer does not know that, so it has to keep the partial datagram for
> a while, discarding all its frames if the missing frame won't arrive
> before some timeout (RFC 1122 recommends this timout value to be
> between 60 and 120 seconds, and this seems to be in line with what
> we saw).
>
> Now, the gigabit link sends quite a lot of data - 100MB+ per
> second, to be precise. Even with 0.01% loss rate, you're losing about
> 10,000 bytes per second. This is no big deal, but every 1500 bytes lost
> cause you to store 62KBs of partial datagrams, so with the loss rate
> above you have to store 400 KB of new data every second. If this data
> expires in 120 seconds, you need about 50 MB for the partial datagram
> storage in the socket layer - and proportionally more if your data loss
> rate is higher than 0.01%. And this amount of memory is something that
> the socket layer in Win XP simply does not have. So as soon as it runs
> out of memory for the partially assembled datagrams, it stops the data
> delivery and waits for the memory to be released. Apparently after it
> gets enough free memory, it switches the data delivery back on again.
>
> This approach does seem funny, and I don't see any compelling
> reason for the socket layer to handle that situation in this "trigger"
> fashion - either it works normally, or shuts down the data delivery
> completely. Might have handled this a bit more gracefully, I'd think.
> But this was Windows, and there was no arguing with it. (We were stuck
> with Windows for unrelated reasons.)
>
> So the bottom line was, we had to go with TCP, because there was
> no way we could make the UDP transport that would be both fast enough
> and would work on our hardware/OS combination. And the part about
> "would work" was definitely related to an attempt to send the datgrams
> that would exceed MTU. (Datagrams smaller than MTU sucked performance-
> wise when compared to TCP, but that is another story - gigabit cards
> tend to offload plenty of TCP functionality from the CPU, so it was
> not that the UDP was particularly bad, but rather that TCP performance
> was very good.)
>
> Best wishes -
> S.Osokine.
> 31 May 2005.
>
> -----Original Message-----
> From: p2p-hackers-bounces at zgp.org
> [mailto:p2p-hackers-bounces at zgp.org]On
> Behalf Of David Barrett
> Sent: Tuesday, May 31, 2005 3:11 AM
> To: Peer-to-peer development.
> Subject: [p2p-hackers] MTU in the real world
>
>
> I've read in multiple places that it's best to have a UDP MTU of under
> 1500 bytes. However, it sounds like most of this is based on
> theoretical analysis, and not on real-world experience.
>
> With this in mind, have you tried using a MTU bigger than 1500 bytes
> and
> been bitten by it? Basically, do you know of any emperical analysis
> (of
> any level of formality) of a real-world UDP application that supports
> or
> refutes the 1500 byte rule of thumb?
>
> Furthermore, I've read that if you "connect" your UDP socket to the
> remote side and then start sending large packets and backing off
> slowly,
> the socket layer will compute the "real" MTU between two endpoints, and
> you can obtain it through "getsockopt". Do you know of anyone who's
> tried this, and the results?
>
> -david
> _______________________________________________
> p2p-hackers mailing list
> p2p-hackers at zgp.org
> http://zgp.org/mailman/listinfo/p2p-hackers
> _______________________________________________
> Here is a web page listing P2P Conferences:
> http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences
> _______________________________________________
> p2p-hackers mailing list
> p2p-hackers at zgp.org
> http://zgp.org/mailman/listinfo/p2p-hackers
> _______________________________________________
> Here is a web page listing P2P Conferences:
> http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences
More information about the P2p-hackers
mailing list