What’s Inside your TCP Optimization Toolbox?
Note: this post delves fairly deep into the inner workings of TCP, so a quick refresher on TCP prior might be helpful.
In Paddy’s previous post, he talked about how to load resources intelligently given the constraints of the last mile options (ISP/Cell Tower). Before content is delivered from the edge to the end user, it must first get to the edge from the original source by traversing the global IP infrastructure that we call the “middle mile” in industry jargon. Today’s post specifically talks about this phase of the content’s journey, which primarily focuses on TCP optimizations.
Most people think that TCP optimizations means "set some values for some parameters and be done with it." For example, there is undue attention focused on the initial congestion window parameter(initcwnd) settings. We try to show here that a holistic approach, examining each detail of data transfer, is what’s needed for sustained and consistent TCP performance.
Since we are performance-obsessed at Instart Logic, let’s start by taking a look at the impact of bandwidth/latency on a given web page’s load time:
What this demonstrates is that content reduction techniques are important in low bandwidth contexts, while request reduction shines in high latency regimes. To demonstrate the same thing in different way, consider the following breakdown of the Google home page:
At high speeds the “requests” become the most critical factor for performance, but at low speeds the “bytes” dictate what the end user experiences. Most synthetic performance measurement services such as Keynote, Gomez, and Catchpoint will be more sensitive to the number of requests due to their high speed connectivity, whereas Real User Monitoring (RUM) tools like NewRelic, SOASTA, WebPageTest (using throttling) will be more sensitive to the volume of content delivered. Be sure to test on both types of platforms to get a realistic view of the performance experienced by your end users.
Which layer to focus on: HTTP vs. TCP
Now let's say you want to optimize the number of requests, and you have read that the standard best practice says that the way to implement fewer HTTP resources is to package multiple small resources into one bundle. For example, rather than sending three individual resources, the same content can be sent in one resource bundle. This way you expect to save two round trips – or at least, this is the accepted wisdom.
However depending on certain conditions, three separate connections downloading n bytes will complete much faster than one connection downloading 3n bytes. This is because there is no 1-1 correlation between a given HTTP request and a TCP round trip. HTTP uses TCP to actually segment your request/response into packets and sends a certain number of packets in a "train." Just because we have one big HTTP request, this does not necessarily translate to a shorter time for the browser to receive all of the bytes for the combined request.
The answer to what is the optimal bundling strategy lies in TCP mechanics, which dictate the delivery dynamics of any web resource.
Based on this information, we can re-interpret the same latency graphic above as “the reduction in the number of back and forth exchanges between transacting TCP peers.” This should help convince you to focus on optimizing TCP round trips rather than HTTP request reductions.
Why TCP Sucks for Long Distance Data Transfer
It turns out that most default TCP stacks aren't set up for use over today's WAN and satellite links, even gigabit ethernet – anything with either a high bandwidth or delay, or both. To understand why, let’s see how much TCP can transfer in 1 second using basic arithmetic (complicated models are out there, but the following is sufficient for understanding the concept):
Throughput < BufferSize/NetLatency ===> NetLatency < BufferSize/Throughput
This should tell you that if the network latency is greater than 5 ms, the throughput will be limited even with the maximum possible value of the receiver buffer. For example, a 100ms link with a 32KB receive buffer, caps the throughput at 2.56Mbps regardless of the available capacity. This should convince you that something is broken with TCP for long haul delivery.
Given the above situation, we employ the following five heuristics to specifically overcome this handicap of TCP for high bandwidth-delay paths. Please note that due to their interdependencies, you need all of these working in unison, rather than deploying any single one of these options.
The bandwidth-delay for a 100Mbps link across the US is 90KB – which means you need to be pumping that much data to fully utilize the link capacity. Given that the middle mile nodes have a greater-than-1Gbps link between them, and given their geographical dispersion, we would want to set the minimum value for the TCP congestion window, and never fall below that so as to ensure maximum network utilization. Even at slow start, and after a timeout, the congestion window will have to remain at least at this value. When we set it to 30 or more, we can ensure that most HTML/JSON responses get sent in a single flight of packets even after slow start or packet loss. (30 x 1500 bytes = 45KB, more than 90 percent of the Top 1000 sites' HTML response size.)
Simply adjusting the congestion floor won’t do, as we will be limited by the number of acknowledgements we receive. When a TCP receiver uses delayed acknowledgment, this also slows down the rate of growth of the congestion window of the sender and reduces the sender throughput. Moreover, for HTTP-type request/response traffic, there is no hope of piggy-backing the ACK on the data anyway. So disabling the delayed acknowledgement on our edge PoPs will ensure that we can sustain the data transfer as fast as the sender can send it, without bogging it down.
Retransmission Timer Optimization
Once we have a floor on the window and remove delayed ACKs, we have the pump primed to send data at high throughput. However, TCP timeouts are unavoidable (full window loss, lost re-transmit), so we should try reducing the time spent waiting for a timeout. While it can visibly improve throughput, this solution should be viewed with caution because it also increases the probability of premature timeouts. So, estimating the right re-transmission timeout (RTO) value is important for achieving a timely response to packet losses, while avoiding premature timeouts.
A premature timeout has two negative effects:
- It leads to a spurious re-transmission;
- With every timeout, TCP enters the slow start mode – even though no packets are lost. Since there is no congestion, TCP thus would underestimate the link capacity and throughput would suffer.
TCP has a conservative minimum RTO (RTOmin) value to guard against spurious re-transmissions. The Linux TCP stack uses an RTOmin value of 200ms. Unfortunately, this value may be at times greater than round-trip times for end-user connections (which are typically about 20-50ms). To fix this situation the following approach may be employed:
- reduce RTOmin to 20ms;
- estimate the current RTO value as 3x the current smoothed RTT
By disabling delayed acknowledgements, we don't need the minimum to be at 200ms. Our tests with mobile clients has shown that this strategy helps achieve a timely response to packet losses, while retaining a rather small risk of spurious re-transmissions in case of RTT spikes.
While the above techniques again optimize for a train of packets, the last packet in a train is not eligible for fast recovery, and hence will time out in the classic sense. The only way to avoid a "classic" RTO re-transmit and to start either slow start or fast re-transmit mechanisms in the case of loss of a last-sent packet (or a bunch of last-sent packets) is to resend it, if we did not receive its ACK for a time a bit longer than a single RTT. Two packets have a higher probability of arriving at their destination, so we resend the last packet in a train. The same tactic can be used for SYN and SYN/ACK packets when establishing connections to make the establishment time faster.
As network speeds increase, there is a greater chance that packets won’t arrive in the same order we sent them. This occurs when the order of packets is inverted due to multi-path routing or parallelism at routers and communicating hosts. It can affect performance because:
- It causes unnecessary re-transmission: When the TCP receiver gets packets out of order, it sends duplicate ACKs to trigger the fast re-transmit algorithm at the sender. These ACKs make the TCP sender infer that a packet has been lost and retransmits it. If the temporary sequence number gap is caused by reordering, then the duplicate ACKs and the fast re-transmission are unnecessary and a waste of bandwidth.
- It limits transmission speed: When fast re-transmission is triggered by duplicate ACKs, the TCP sender assumes it is an indication of network congestion. It reduces its congestion window to limit the transmission speed, which needs to grow larger from a slow start again. If reordering happens frequently, the congestion window is at a small size and can hardly grow larger. As a result, TCP has to transmit packets at a limited speed and cannot efficiently utilize the bandwidth.
Results of our measurements demonstrate the high prevalence of packet reordering to packet losses across high-speed backbone networks with a degree of reordering up to 90 packets. Investigations of real IP flows show also that most reordered packets arrive at the receiver with time lags less than 10ms. To take into account this fact, the following strategy can be employed as a means of blocking the impact of this phenomenon on performance:
- If the first dupACK is detected, the stack is blocked from any actions on this event for a certain time.
- If the actual packet reordering took place, this timeout is enough for self-recovery.
- If the packet loss took place, a "standard" fast re-transmit algorithm starts.
As you can see, the benefits are material and significant. All of Instart Logic's customers have access to these TCP benefits by virtue of our Global Network Accelerator.
Now, let’s circle back to our original question – how should you package individual resources for high performance, end-to-end application delivery? The answer is to treat each resource like a packet and model it after TCP dynamics.
We have a lot more to say on this topic. Stay tuned to hear how this theory helps you better package and bundle your assets.
Robert T Morris gives you some magic numbers – like why TCP won't work if the packet loss climbs up to more than 2%, among other things.