Personal Miscellaneous TCP/IP GRID Quality of ServiceMulti-Cast  
Background high-speed tcp Transfer Tests Web 100TCP Tuning  

TCP Algorithms

There have been many implementations of TCP, most notably TCP Tahoe which introduced slow start, congestion avoidance and fast retransmit. TCP Reno also introduced fast recovery. Another development in TCP networking are Selective Acknowledgements (SACKS) and this shall also be introduced in this section. Discussions on the strength and weaknesses of these various implementations are discussed else-where~\cite{comparisons}.

In order to account for and adjust the data flow to minimise the effect of network problems and to maintain a steady data flow, modern implementations of TCP utilise the four mentioned algorithms. These algorithms are described in the following section.


Slow Start

Problems of just using the receiver and sender windows for calculation of data transfers are fine if both the sender and the receiver are on the same LAN. However, modern applications of network traffic often involve data moving from a fast LAN on to a slower WAN link. This causes implications upon how much data we can possibly shift without congesting the network.

The role of the 'slow start' algorithm allows a connection to quickly 'grow' into 'free' bandwidth whilst paying close attention to any possible problems that a connection may face. By implementing a variable called the congestion window (cwnd) we can effectively determine the amount of data that is permitted by the sender on the network at any instance. By implementing slow start, we push data onto the network at a rate such that as much of the network bandwidth is utilised as possible without negative consequences to other users on the network.

The algorithm to slow start is as follows,

  • Upon a fresh connection, the value of cwnd is set to one segment (usually 536 or 512 bytes - as advertised by the other end).
  • As an ACK is received signalling a positive data transfer, the value of cwnd is increased by one segment. As the connection progresses with more concurrent ACKS being received by the sender, the value of cwnd grows exponentially.
  • Upon each instance during the connection, the sender is only permitted to send up to the minimum of the cwnd and the receiver's window.

Should an error of any kind occur upon transmitting data, it is important that we should 'back-off' the tranfer by setting it to half the value of cwnd. This (combined with ssthresh - see later) sets the TCP transfer up such that as much of the network is utilised as possible.


Congestion Avoidance

Once the network tolerance is reached, routers on the link may hold packets in long queues due to overloading as routers get swamped. This leads to congestion as the routers can not shrink the queue size quickly and so may choose to drop packets. If we were to use 'slow start' after congestion, we would be trying to force data back onto the network - at the expense of both ours and other users'. Hence, we need a way of slowing finding and adapting to the network. Congestion Avoidance (CA) assumes that data may allow slightly little more data to be put onto the network, should this be false, then an error will be generated - either from a timeout (an unacknowledged packet) or a Duplicate ACK (DupACK) arising from packets arriving out-of-order (and hence returning the same ACK value). Over time, and with another variable ssthresh (discussed later), we can adapt to the problems of the link and maintain a constant transfer rate at the limits of the network.

When probing the network, we do not want to flood the network with too much data as this would have a negative impact on the rest of the internet community. Therefore, only a small increase in the throughput should be permitted. Likewise, we don't want to kill the transfer altogether when an error occurs, so we should go back to a value which we 'know' is maintainable by the network.

The CA algorithm assumes only packet loss - either occurring from a timeout or from a DupACK. It is assumed that packet loss from damaged packets is small and hence negligible. It is also closely tied in with the slow start algorithm, so much so that it utilises the same cwnd variable to determine how much data is allowed onto the network. It is summed up below,

  • During the data transfer, if a timeout occurs, set cwnd to half of the current window size.
  • On receiving a new ACK for data, increase cwnd by only segment size * segment size/cwnd. This ensures a slow increase to the value of cwnd and successful probing of the network.
  • When sending data through the network, the sender must consider the values of the cwnd and the receivers 'advertised' window size. In order not to waste bandwidth, the sender must only send the lower of the two values.


Window Adjustment Policy One reason for using 12 as a the decrease term, as opposed to the 78 in [JRC87], was the following hand-waving: When a packet is dropped, you're either starting (or restarting after a drop) or steady-state sending. If you're starting, you know that half the current window size `worked', i.e., that a window's worth of packets were exchanged with no drops (slowstart guarantees this). Thus on congestion you set the window to the largest size that you know works then slowly increase the size. If the connection is steady-state running and a packet is dropped, it's probably because a new connection started up and took some of your bandwidth. We usually run our nets with ae ^ 0:5 so it's probable that there are now exactly two conversations sharing the bandwidth. I.e., you should reduce your window by half because the bandwidth available to you has been reduced to half. And, if there are more than two conversations sharing the bandwidth, halving your window is conservative -- and being conservative at high traffic intensities is probably wise.

Ssthresh

Slow start and congestion avoidance are the two essential algorithms in TCP transfers, they ensure that the network is fully utilised yet not detrimental to other users. In effect they are continuously fighting against each other, the slow start algorithm trying to grow the cwnd whilst congestion avoidance trying to maintain it such that it does not overload the network. Therefore, it is essential to maintain a balance between the two algorithms. This is achieved by implementing another variable called Slow Start Threshold (ssthresh).

By using this variable, we effectively determine a threshold at which the connection should do slow start and when to do congestion avoidance. In order to do this, ssthresh is continuously updated dynamically with the connection,

  • Upon a new connection, cwnd is set to one segment and ssthresh is set to 65535 bytes.
  • When congestion on the network occurs, ssthresh is set to one half of the value of window size.
  • If however, congestion occurs because of a timeout from a packet ACK overrunning its RTO, then we force slow start by setting cwnd to one segment.

By having ssthresh update as such, we can define that if cwnd is greater than ssthresh, TCP reverts to a congestion avoidance phase - ensuring that we don't overuse the network; and when cwnd is less than or equal to ssthresh, TCP enters a slow start phase ensuring that we don't waste our bandwidth.


Fast Retransmit

Generation of an immediate ACK (a DupACK) is often received when the remote machine receives an out-of-order segment. As TCP implementations sends the next wanted segment as an ACK, rather than the segment in which it had received, the arrival of DupACKs suggest either segment have been lost or reordered while in transit due to delayed packets.

One way that could suggest a lost segment is by the arrival of three or more DupACKs in a row. If this happens, TCP should retransmit what seems to be the missing segment without waiting for the retransmission timer to expire (as would be the normal case). This would speed up transfer as the sender would not have to wait until the packet times-out before sending the suspicious segment. This is known a fast re-transmit.

The TCP data sender only retransmits a packet after a retransmit timeour has occurred, or after three duplicate acks have arrived triggering the Fast Retransmit algorithm. A single retransmit timeout might result in the retransmission of several data packets, but each invocation of the Fast Retransmit algorithm leads to the retransmission of only a single data packet.

When an out-of-order segment is received, TCP should generate an
immediate acknowledgment (ACK). This ACK is to let other end know that a segment was received out of order. If three or more duplicate ACKs are received, it indicates that a segment has been lost and retransmission should occur immediately

 

Fast Recovery

The question after receiving a series of DupACKs is what should we do? If we were to receive three DupACKs, there is strong chance that the network is congested. So, if we were to force slow start we would abruptly reduce the data flow and cause wild oscillations in the throughput. As such, congestion avoidance should be performed.

In order to force TCP connection into congestion avoidance, we again use the ssthresh variable as described above,

Upon receipt of the third DupACK, ssthresh is set to half the value of the current cwnd, but no less than two segments. This ensures that the connection is not forced into slow start.

  • The missing segment is sent and the cwnd is set to the value of ssthresh plus three times the segment size. This ensures that the network is in a steady state as the three DupACKs have left the network.
  • Each time another DupACK arrives, we must account for the fact that another segment has left the network. So, cwnd is incremented by the segment size to account for this. If it is allowed by the new value of cwnd, a new packet is sent.
  • Once the receiver has received new data and corresponding ACKs sent, we should start congestion avoidance.

It is worth noting that the ACK that acknowledges new data should also acknowledge all intermediate segments that were confirmed with the DupACKS. By setting the cwnd to the value of ssthresh,the connection enters congestion avoidance.

When the third duplicate ACK is received, set ssthresh to one-half the
current congestion window but no less than two segments. Retransmit the missing segment. Each time another duplicate ACK arrives, increment congestion window by the segment size.

Selective ACKs

The method of ACKs was originally designed such that the receiver effectively tells the sender which segment it is next expecting. The problem with this method is that should a single segment be lost, but all following segments received, ACKs will only point to that of the one lost. This makes it very difficult for the senders to know which segments were successful and which were lost. A more effective way of acknowledging data would be to tell the sender which segments it has received\cite{SACKS}. This is the purpose of using Selective ACKs (SACKS).

Upon the loss or receipt of out-of-order segments, the receiver stores the segments temporally in its buffer. It then notifies the sender of the continuous segments (or blocks) of data that it has via the ACK signal. From this, the sender can then determine which segments are missing and hence which it should resend. By telling the sender precisely what it has, the sender should be able to send the missing segment quicker than simply waiting for a time-out to occur\footnote{The method of having to wait for a time-out to occur before resending the data is also called \emph{go-back-n}}.

To extend this scheme, DupACKs were also introduced which accounted for occurrences of a duplicate contiguous sequence of data received by the receiver in the most recent packet. Further details can be found else-where\cite{DSACKS}.

By implementing SACKs and DSACKs, TCP connections should be more robust in environments with many retransmissions, a large variance in RTT, losses in returning (ACK) paths and especially in instances where packets were received out-of-order.

However, in the absense of SACKS, there is little information availabel to the TCP sender in making retransmission decisions during fast recovery. An RFC (2582) describing TCP performing without SACKS is here.

TCP may experience poor performance when multiple packets are lost from one window of data because a TCP sender can only acknowledge a single lost packet per round trip time. A selective Acknowledgment (SACK) mechanism can help to overcome these limitations. With selective acknowledgements, the data receiver can inform the sender about all segments that have arrived successfully; sender need retransmit only the segments that have actually been lost. This option is an SACK enabling option that only sent in a SYN segment to indicate that the SACK option can be used once the connection is established. SACK can be used only if the two sides understand the option.

SACK option: This option is enabling only when Sack-permitted option is enabled. It is sent by a data receiver to inform the data sender of non-contiguous blocks of data that have been received and queued. Each contiguous block of data queued at the data receiver is defined in the SACK option by two 32-bit unsigned integers in network byte order: Left edge of block (the first sequence number of this block) and right edge of block (the sequence immediately following the last sequence number of this block. SACK allow for multiple losses in a transmission window to be recovered in one RTT, significantly lessening the time to recover when the RTT is large. Several TCP additional extensions for improving TCP performance have been introduced in the last decade.


 

Mon, 30 September, 2002 13:54 Next Page
 
 
    email me!
© 2001-2003, Yee-Ting Li, email: ytl@hep.ucl.ac.uk, Tel: +44 (0) 20 7679 1376, Fax: +44 (0) 20 7679 7145
Room D14, High Energy Particle Physics, Dept. of Physics & Astronomy, UCL, Gower St, London, WC1E 6BT