Transmission Control Protocol (TCP) is a transport layer protocol which provides the abstraction of a reliable network over an unreliable channel. It is coupled with Internet Protocol (IP), a datagram layer protocol which provides host-to-host routing, to form the backbone of the network stack. IP gateways connect networks of different bandwidths which can lead to congestion due to buffer overflows. This congestion can be caused by either sender overwhelming the receiver or by either side overwhelming the underlying network.
Flow control manages the rate of data transmission between two nodes to ensure that a fast sender doesn’t overwhelm a slow receiver. In this method, each node advertises its receive window (rwnd) which is the size of its available buffer space to store incoming data. This value can be dynamically adjusted by the node and communicated via the ACK packet to the other node.
Flow control handles the congestion at the nodes but either side has the potential to overwhelm the network. In 1988, Jacobson and Karels introduced a number of algorithms which tackled this problem: slow-start, congestion avoidance, fast retransmit and fast recovery. These algorithms were immediately incorporated in the TCP protocol and are known for preventing an internet meltdown.
The goal of these algorithms was to make sure that the flow in a TCP connection runs stably with a full window of data in transit.
Slow-start algorithm is designed to estimate the capacity of the network between a sender and a receiver. It gradually increases the amount of data in transit. Each node adds a congestion window (cwnd) to their per-connection state. This value is a limit on the amount of data a sender can have in-flight before receiving an acknowledgment from the client. The start value of cwnd is set to 1 MSS (Maximum Segment Size) i.e. the largest amount of data that can be received in a single TCP segment. As each ACK arrives, cwnd is increased by one segment i.e. two packets are generated per ACK: one for the ack (a packet has left the system and a new packet can take its place) and one because the ack opens up the congestion window by one packet. This leads to an exponential growth in the congestion window as it doubles after each RTT (Roundtrip time). The congestion window is coupled with receiver’s flow window (rwnd) and a sender only sends the minimum of (cwnd, rwnd) data in the network.
The paper in 1987 by Chiu and Jain on Analysis of Increase and Decrease Algorithms for Congestion Avoidance shows the general pattern of response time and throughput of a network as the network load increases.
We can see that at first as load increases, throughput increases rapidly and response time increases slowly. But after load reaches the network capacity, the queue starts building up and packets are dropped. Throughput suddenly decreases and response time increases rapidly beyond this point. Cliff describes this point after which throughput decreases quickly and delay approaches infinity. The knee is used to mark the point after which throughput increases very slowly but delay increases very rapidly.
Slow-start is used to reach knee quickly. It initializes the connection with a window and doubles the amount of data in flight for every round-trip until it exceeds a system-configured congestion threshold (ssthresh). This threshold is a guess about the location of the knee. After this point, congestion avoidance algorithm takes over which intends to slow down slow-start and stay left of the cliff.
Congestion avoidance increments congestion window by MSS * (MSS/cwnd) after receiving every ACK packet. This means that the congestion window is increased by one segment after a full window of segments have been acknowledged.
Retransmit Timeout Interval
Data segments and acknowledgments can get lost in the network. TCP sets a timeout period when it sends the data, and if the data is unacknowledged when the timeout expires, it retransmits the data again. The timeout is indicative of congestion in the network as packets are being dropped. After the timeout expires, TCP sets ssthresh to half of the congestion window and the congestion window is set to 1.
Fast Retransmit and Fast Recovery
A TCP receiver sends a duplicate ACK to the sender when it receives a data segment that is out of order. Instead of waiting for the timeout period to expire, fast retransmit algorithm assumes that if the sender receives three duplicate acknowledgments, then the data segment is lost and immediately retransmits the lost segment. This way TCP avoids waiting for a timeout in order for retransmission to begin.
After a fast retransmit, ssthresh is set to half of the congestion window and the congestion window is set to 1. Fast recovery algorithm observes that duplicate packets are indicative of network still being capable of delivering some segments and thus cuts the congestion window in half. Fast retransmit and recovery (FRR) is implemented by many TCP congestion algorithms such as TCP Reno. The aim is to prevent expensive timeouts and avoid slow-starting again.
Additive Increase and Multiplicative Decrease
The paper by Chiu and Jain also draws the tradeoffs between different feedback control algorithms to show which one converges to use equal amounts of a contended link in case of multiple flows. AIMD (Additive-increase/multiplicative-decrease) is the algorithm which converges multiple flows to stability and provides a fair cycle. It combines linear growth of the congestion window with an exponential reduction when a congestion takes place.
a > 0 where a = additive increase parameter 0 < b < 1 where b = multiplicative decrease parameter if congestion is not detected: cwnd = cwnd + a else: cwnd = cwnd * b
In TCP, after slow-start, congestion avoidance sets the additive increase factor a to one MSS per round-trip time and the multiplicative decrease factor b to 1/2.
Initially: // Set rwnd and MSS during TCP handshake cwnd = 1 MSS ssthresh = 64 kb duplicateACKCount = 0 When a data segment is sent: window = min(cwnd, rwnd) while next_packet < unacked_packet + window: transmit next packet When an ACK is received: duplicateACKCount = 0 if cwnd < ssthresh: // slow-start cwnd += 1 MSS else: // congestion avoidance cwnd += MSS * (MSS / cwnd) // update rwnd from ACK packet If re-transmit timeout occurs: ssthresh = cwnd / 2 cwnd = 1 duplicateACKCount = 0 When a duplicate ACK is received: duplicateACKCount += 1 if duplicateACKCount >= 3: ssthresh = cwnd / 2 cwnd = cwnd / 2