Congestion Avoidance and Control

V. Jacobson. 1988. Congestion avoidance and control. In Symposium proceedings on Communications architectures and protocols (SIGCOMM ’88), Vinton Cerf (Ed.). ACM, New York, NY, USA, 314-329

Transmission Control Protocol (TCP) is a transport layer protocol which provides the abstraction of a reliable network over an unreliable channel. It is coupled with Internet Protocol (IP), a datagram layer protocol which provides host-to-host routing, to form the backbone of the network stack. IP gateways connect networks of different bandwidths which can lead to congestion due to buffer overflows. This congestion can be caused by either sender overwhelming the receiver or by either side overwhelming the underlying network.

Flow control manages the rate of data transmission between two nodes to ensure that a fast sender doesn’t overwhelm a slow receiver. In this method, each node advertises its receive window (rwnd) which is the size of its available buffer space to store incoming data. This value can be dynamically adjusted by the node and communicated via the ACK packet to the other node.

Flow control handles the congestion at the nodes but either side has the potential to overwhelm the network. In 1988, Jacobson and Karels introduced a number of algorithms which tackled this problem: slow-start, congestion avoidance, fast retransmit and fast recovery. These algorithms were immediately incorporated in the TCP protocol and are known for preventing an internet meltdown.

The goal of these algorithms was to make sure that the flow in a TCP connection runs stably with a full window of data in transit.

Slow-Start

Slow-start algorithm is designed to estimate the capacity of the network between a sender and a receiver. It gradually increases the amount of data in transit. Each node adds a congestion window (cwnd) to their per-connection state. This value is a limit on the amount of data a sender can have in-flight before receiving an acknowledgment from the client. The start value of cwnd is set to 1 MSS (Maximum Segment Size) i.e. the largest amount of data that can be received in a single TCP segment. As each ACK arrives, cwnd is increased by one segment i.e. two packets are generated per ACK: one for the ack (a packet has left the system and a new packet can take its place) and one because the ack opens up the congestion window by one packet. This leads to an exponential growth in the congestion window as it doubles after each RTT (Roundtrip time). The congestion window is coupled with receiver’s flow window (rwnd) and a sender only sends the minimum of (cwnd, rwnd) data in the network.

Congestion Avoidance

The paper in 1987 by Chiu and Jain on Analysis of Increase and Decrease Algorithms for Congestion Avoidance shows the general pattern of response time and throughput of a network as the network load increases.

 

We can see that at first as load increases, throughput increases rapidly and response time increases slowly. But after load reaches the network capacity, the queue starts building up and packets are dropped. Throughput suddenly decreases and response time increases rapidly beyond this point. Cliff describes this point after which throughput decreases quickly and delay approaches infinity. The knee is used to mark the point after which throughput increases very slowly but delay increases very rapidly.

Slow-start is used to reach knee quickly. It initializes the connection with a window and doubles the amount of data in flight for every round-trip until it exceeds a system-configured congestion threshold (ssthresh). This threshold is a guess about the location of the knee. After this point, congestion avoidance algorithm takes over which intends to slow down slow-start and stay left of the cliff.

Congestion avoidance increments congestion window by MSS * (MSS/cwnd) after receiving every ACK packet. This means that the congestion window is increased by one segment after a full window of segments have been acknowledged.

Retransmit Timeout Interval

Data segments and acknowledgments can get lost in the network. TCP sets a timeout period when it sends the data, and if the data is unacknowledged when the timeout expires, it retransmits the data again. The timeout is indicative of congestion in the network as packets are being dropped. After the timeout expires, TCP sets ssthresh to half of the congestion window and the congestion window is set to 1.

Fast Retransmit and Fast Recovery

A TCP receiver sends a duplicate ACK to the sender when it receives a data segment that is out of order. Instead of waiting for the timeout period to expire, fast retransmit algorithm assumes that if the sender receives three duplicate acknowledgments, then the data segment is lost and immediately retransmits the lost segment. This way TCP avoids waiting for a timeout in order for retransmission to begin.

After a fast retransmit, ssthresh is set to half of the congestion window and the congestion window is set to 1. Fast recovery algorithm observes that duplicate packets are indicative of network still being capable of delivering some segments and thus cuts the congestion window in half. Fast retransmit and recovery (FRR) is implemented by many TCP congestion algorithms such as TCP Reno. The aim is to prevent expensive timeouts and avoid slow-starting again.

Additive Increase and Multiplicative Decrease

The paper by Chiu and Jain also draws the tradeoffs between different feedback control algorithms to show which one converges to use equal amounts of a contended link in case of multiple flows. AIMD (Additive-increase/multiplicative-decrease) is the algorithm which converges multiple flows to stability and provides a fair cycle. It combines linear growth of the congestion window with an exponential reduction when a congestion takes place.

a > 0 where a = additive increase parameter
0 < b < 1 where b = multiplicative decrease parameter

if congestion is not detected:
    cwnd = cwnd + a
else:
    cwnd = cwnd * b

In TCP, after slow-start, congestion avoidance sets the additive increase factor a to one MSS per round-trip time and the multiplicative decrease factor b to 1/2.

Pseudo-code

Initially:
    // Set rwnd and MSS during TCP handshake
    cwnd = 1 MSS 
    ssthresh = 64 kb 
    duplicateACKCount = 0

When a data segment is sent:
    window = min(cwnd, rwnd)
    while next_packet < unacked_packet + window:
        transmit next packet

When an ACK is received:
    duplicateACKCount = 0
    if cwnd < ssthresh:
        // slow-start
        cwnd += 1 MSS
    else:
        // congestion avoidance
        cwnd += MSS * (MSS / cwnd)

    // update rwnd from ACK packet

If re-transmit timeout occurs:
    ssthresh = cwnd / 2
    cwnd = 1
    duplicateACKCount = 0

When a duplicate ACK is received:
    duplicateACKCount += 1
    if duplicateACKCount >= 3:
        ssthresh = cwnd / 2
        cwnd = cwnd / 2
Advertisements

Data Center TCP (DCTCP)

Alizadeh, Mohammad and Greenberg, Albert and Maltz, Dave and Padhye, Jitu and Pate, Parveen and Prabhakar, Balaji and Sengupta, Sudipta and Sridharan, Murari, Data Center TCP

DCTCP is a congestion control protocol designed especially to meet the requirements of data center networks. Data centers host applications which receive both short and long flows. Long flows are throughput sensitive whereas short flows are delay sensitive and require high burst tolerance.

In a conventional TCP model, satisfying both low delay and high throughput is difficult. Low delay can be achieved by reducing the router queue size but that leads to an increase in packet drops. This reduces the throughput because sender reduces the congestion window when packets are dropped. Throughput can be improved by increasing the queue size which reduces the rate of packet drop. Since TCP congestion algorithm uses packet drop to adjust the speed of the link, buffering of packets increases the latency. So, reducing the delay decreases the throughput and vice-versa.

DCTCP creates a balance between delay and throughput by allowing the sender to respond according to the extent of congestion. It makes use of Explicit Congestion Notification (ECN) feature which is available in most modern switches. ECN uses a marking scheme at switches that sets the Congestion Experienced (CE) segment of packets as soon as the buffer occupancy exceeds a fixed small threshold. DCTCP enabled sender reduces the window by a factor that depends on the fraction of marked packets. This enables switches to maintain large queues but still reduce the delay and maintain the rate of packet drop. However, DCTCP trades off low latency and high throughput for a higher convergence time. Convergence time is the time taken for the flows to acquire their fair share of the network. As DCTCP reduces the congestion window gradually, it takes few RTTs longer than TCP to converge.

g042495

Data Centers also face performance impairments due to the interaction between short flows and long flows in production clusters. Switches in production are shared memory switches and use shallow packet buffers available to all switch ports. When short and long flows traverse the same queue of a switch, it leads to performance issues. Firstly, when many short flows converge on the same interface of a switch over a short period of time, it can create an incast which leads to packet loss. Secondly, even when there are no packet drops, it can lead to queue buildup impairment as short flows experience increased latency because they are in a queue behind packets from large flows. Short flows can also be impacted by the large flow activity on other ports as the activity on the different ports is coupled by shared memory pool.

DCTCP handles queue buildup, buffer pressure, and incast. In DCTCP, senders start reacting as soon as the queue hits a threshold which reduces the queueing delay on congested switch ports. This also reduces the buffer pressure as a congested port’s queue doesn’t grow exceedingly large. Incast is generally created by synchronized small flows on the same queue. As each flow have several packets to transmit, and their windows build up over multiple RTTs, DCTCP moderates the size of burst in one or two RTTs because it starts marking congestion early.

Overall, DCTCP provides low latency and high throughput for a mixed workload in a data center.

Arrakis: The Operating System is the Control Plane

Simon Peter and Jialin Li and Irene Zhang and Dan R. K. Ports and Doug Woos and Arvind Krishnamurthy and Thomas Anderson and Timothy Roscoe, 2014, Arrakis: The Operating System is the Control Plane

In the last paper, we looked at IX, an operating system design, which aimed to increase the I/O performance for server applications by reducing the complexity of a shared monolithic kernel. It divided the OS into a control plane (the kernel) and a data plane (library-based OS in userspace responsible for I/O).

Similarly, today we are going to look at Arrakis which also splits the kernel to speed-up I/O operations performed by applications. The main difference between Arrakis and IX is how the network I/O approach is handled. Arrakis protects the network stack from the application using control plane operations and hardware features whereas IX uses user-space rings to enforce security. Arrakis also adds optimization for storage whereas IX only optimizes for network performance.

Arrakis utilizes hardware I/O virtualization to minimize the kernel involvement. It focusses on hardware devices which expose an interface to create and destroy multiple instances of itself. This allows applications to utilize the virtual device instance to perform I/O operations. The author talks about two implementations of Arrakis: Arrakis/P which is compatible with the existing POSIX interface (for backward compatibility) and Arrakis/N which runs on a more efficient non-POSIX interface. Both the implementations are evaluated to show how they improve the network and the storage overhead in Linux.

Networking Stack Overhead: The paper breaks down the packet-processing overhead into (1) cost of processing at various layers (hardware, IP, and UDP) of the network stack, (2) overhead of scheduling the process and doing a context switch, (3) crossing from kernel to user-space and back, and (4) copying of packet data from kernel to user buffer on receive and back on response.

Arrakis eliminates the scheduling and kernel crossing cost entirely as the packets are sent to the user-space directly. It also reduces the overhead of processing at different layers of network stack because demultiplexing of packets to identify the destination application (socket) is not required anymore. Arrakis/N supports zero-copy I/O and thus removes the overhead of copying of packet data from kernel to user-space as well.

Storage Overhead: The overheads of a storage stack can be broken down into (1) data copying between kernel and user-space, (2) parameter and access control checks, (3) block and inode allocation, (3) virtualization, and (4) snapshot maintenance.

Arrakis Architecture:

Arrakis uses hardware virtualization to provide direct application level access to I/O devices. The control plane manages the virtual device instances and the dataplane is provided with an I/O stack library. The I/O stack library directly interfaces with the virtual device instance and provide a POSIX sockets API and Arrakis’s native API to the application.

Arrakis also introduces a hardware-independent layer which defines a set of features which should be provided by hardware I/O adapters for virtualization. These features are required to be implemented in hardware so that it can perform the operations of a traditional kernel.

Network Device Controller:

  1. Arrakis assumes network device controller to provide multiple virtual network interface cards (VNICs). These VNICs can be mapped to a separate protection domain and assigned to an application.
  2. It should support multiplexing and de-multiplexing of packets to user-space queues. Each VNIC can contain multiple pairs of send and receive queues.
  3. It should provide the ability to create transmit and receive filters responsible for filtering packets to send and receive queues. These filters are based on network packet headers and can be assigned by the kernel control plane.
  4. It should also have support for creating rate limiters for inter-packet pacing of I/O to queues. This is also a privileged operation and can be setup by the kernel control plane.

Storage Device Controller:

  1. For storage device controller, Arrakis hardware model requires access to multiple virtual storage interface controller (VSICs).
  2. It should provide independent storage command queues multiplexed by hardware.
  3. It should also expose functions to map virtual storage areas (VSAs) to extent of physical drives and associate them with VSICs.

Doorbells:

  1. Doorbells are the way of handling asynchronous events in Arrakis.
  2. Each queue inside a VIC has an associated doorbell which is responsible for notifying the application that an event such as I/O completion or packet arrival has occurred.
  3. Doorbells are delivered as a hardware-virtualized interrupt when an application is running or through the control plane when the application needs to be scheduled.

Screen Shot 2017-10-24 at 11.png

Overall, Arrakis splits the control plane and dataplane to remove the kernel from the critical I/O path. Control plane is invoked infrequently and is responsible for naming, access control, and resource limits. Data plane consists of the I/O devices responsible for protection, multiplexing, and I/O scheduling. Arrakis provides a link to the shared library versions of the drivers to the applications. These libraries directly interface the virtual function device driver and provide an API layer.

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Adam Belay and George Prekas and Ana Klimovic and Samuel Grossman and Christos Kozyrakis and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Today we have a paper which is inspired by the Dune model we looked at in the previous post. Here is a link to it for reference: Dune: Safe User-level Access to Privileged CPU Features.

IX uses the Dune architecture to separate the core functions of the kernel from network processing. The main motivation behind IX is to remove the tradeoff between high throughput and low latency while also providing strong protection and resource efficiency. It is needed to satisfy the requirements such as the microsecond-level responses to high-frequency incoming requests of today’s large-scale applications.

As we know from the Dune architecture, processes can utilize the virtualization hardware to get direct and safe access to privileged CPU features. IX uses this to isolate the kernel which runs in VMX root node from the networking stack which runs in the VMX non-root node. IX calls the core kernel as the control plane and the library based OS which does network processing as the dataplane.

Control Plane

Control Plane is the underlying module which consists of the Linux kernel and a user level program- IXCP. IXCP is responsible for monitoring resource usage and dataplane performance. It manages the resource allocation policies across each dataplane.

DataPlane

Dataplanes are given direct access to Network Interface Card (NIC) queues through memory-mapped I/O. They run a single, multi-threaded application in a single address space and have a dedicated core. So, multiple IX dataplanes are run to support multiple applications like httpd and memcached. Dataplanes run in two modes: kernel mode (VMX non-root ring 0) for packet processing and user mode (VMX non-root ring 3) for application logic. This helps to isolate networking stack from untrusted application code.

Run to completion and Adaptive Batching

IX adopts a run to completion all the stages policy i.e. it runs the networking stack and application logic sequentially. This improves the throughput and latency as both the stages access the same data which increases data locality. IX also uses adaptive batching which batches requests together in case of congestion. It means that the dataplane doesn’t wait for packets but instead if there is a congestion, it batches the requests together. Apart from improving the packet rate, adaptive batching reduces head of line blocking and increases instruction cache locality.

Screen Shot 2017-10-19 at 2.28.04 AM.png

Multi-queue NICs are used in IX with efficient distribution of network processing across multiple hardware threads run by dataplanes. The NIC uses a hashing function over the network data and maps it to a queue which is served by a single hardware thread. Control plane is responsible for designing the mapping for the hashing function to balance the traffic over multiple threads.

Each dataplane supports two kinds of hardware threads: elastic thread which interacts with dataplane to initiate and consume network I/O and background thread for application specific tasks such as garbage collection. The steps of a run-to-completion operation of an elastic thread in the IX dataplane are:

  1. The elastic thread polls the receive descriptor ring and sends new buffer descriptors for the NIC to use with future incoming requests.
  2. It processes the bounded number of packets through the TCP/IP stack (dataplane kernel mode) and generates event coordination.
  3. The event thread now switches to dataplane user mode where the application consumes all event conditions. The buffer used for the incoming packet is passed as read-only to userspace to facilitate zero-copy operation.
  4. The application processes the requests and responds with a batch of system calls.
  5. The thread switches back to kernel mode and processes all batched system call.
  6. It runs all the kernel timers in order to ensure compliant TCP behavior.
  7. It then places the outgoing Ethernet frame in the NIC’s transmit descriptor ring.

Screen Shot 2017-10-19 at 2.28.09 AM

In conclusion, we see an efficient and a reliable way to implement the networking stack by leveraging hardware virtualization and CPU features such as privilege access and memory I/O.

 

 

Dune: Safe User-level Access to Privileged CPU Features

Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazières, and Christos Kozyrakis. 2012. Dune: safe user-level access to privileged CPU features.

Operating System can be divided into two layers: userspace and the kernel. Applications such as the browser reside in the userspace and the kernel is designed to handle sensitive resources and maintain security between the applications. The operating system prevents userspace applications to access the resources available to the kernel.

Today’s paper talks about a new architecture, Dune, which exposes the privileged kernel features to applications directly but in a safe way. The hardware features available to the kernel have strong use cases at the user level such as speeding up garbage collection, privilege separation inside processes or sandboxing untrusted code.

Virtualization Hardware

Virtualization is the process of running virtual machines on top of an existing operating system. The host machine is the machine on which the virtualization takes place. It runs a hypervisor or a Virtual Machine Monitor (VMM) which is responsible for creating and managing virtual machines and running them in guest mode. Guest mode allows having access to privileged CPU features.

Dune utilizes the power of virtualization hardware to provide this functionality. Now, instead of using a VMM that is responsible for creating a whole virtual machine (basically a separate kernel and a stack) in guest mode, Dune has created its own dynamic kernel module which runs ordinary processes in the guest mode. The Dune module sits in the main kernel. It is responsible for configuring virtualization hardware and supporting process abstraction to run them in guest mode.  The processes which are run in the guest mode are called dune processes and have restricted but direct access to CPU features. The kernel can still run normal processes in host mode which doesn’t change CPU behavior.

Intel VT-x is a virtualization extension designed for x86 ISA which provides two operating modes of the CPU: VMX root node i.e. the host node running the VMM and the VMX non-root node which runs virtualized guest OS. Transitions between the two modes are facilitated by hardware. A VM-entry can be done to switch from root to non-root mode and VM-exit for vice-versa. VM Control Structure (VMCS) is maintained which stores specific guest OS state for transitions.

Dune module enables VT-x and runs the kernel in the VMX root mode. A dune enabled ioctl (system calls which are device specific) is provided which can be used to switch a normal process to a Dune process. Dune process gets access to hardware features such as exceptions, privileged modes, and virtual memory.  Dune also maintains separate VMCS for each thread of a process thus allowing a process to run threads in both root and non-root mode simultaneously.

Dune also provides a library, libdune, for user programs which is basically a set of utility functions that help applications manage these privileged hardware features such as exception and syscall handling, page table manager, ELP loader and more.

Screen Shot 2017-10-17 at 1.48.59 AM

Memory Management

Exceptions and privileged modes are explicitly available in Dune processes. But providing access to memory is a bit tricky.

Virtual memory is the method which maps memory address of a program to physical computer memory. It acts as an abstraction so that processes don’t have to manage shared memory space and have security due to memory isolation. An x86 page table which is stored in the kernel is used to translate memory from host-virtual to the host-physical memory address. This method is termed as paging.

Ideally, non-root processes want direct access to page tables so that they can benefit from fast access to virtual memory. But doing this compromises the security as we want to prevent arbitrary access to physical memory. So, paging translation occurs in two steps for Dune processes. A user-controlled page table maps guest virtual to guest physical address. Extended Page Table (EPT), provided by VT-x in the kernel, is then used to translate from guest-physical to host-physical. Dune configures the EPT so that it reflects the full address space of a process.

Screen Shot 2017-10-17 at 2.14.40 AM.png

In a conventional OS, Translation Lookaside Buffers (TLB) are used to store page table translations to reduce the time taken to access a memory location. TLB is usually flushed when a process switch occurs because the virtual to physical memory mapping changes. New architecture in Intel and AMD supports tagging as part of TLB. These tags store the address space to which an entry in TLB belongs to and thus avoid flushing.

Dune processes can frequently perform VM-exit and VM-entry to access the kernel leading to TLB flush. So, Virtual Process Identifiers (VPID) are used to tag linear translations and avoid flushing the TLB.

Syscalls and Privilege Modes

Operating system maintains protection rings as a way to secure data from untrusted behavior. x86 has four privilege levels ranging from 0 (most-privileged) to 3 (least-privileged). Level 0 is used to run the kernel and Level 3 is used to execute applications. A general protection error is thrown if a lesser privileged process tries to access a higher privileged process. System call are used by applications to request a service from the kernel.

VMX non-root mode maintains its own set of privilege rings. This allows Dune processes to allow hardware-enforced protection within a process i.e. they can sandbox untrusted code. Thus, system calls in a dune process are trapped into itself instead of accessing the kernel. This allows the process to prevent untrusted code to access the kernel directly. VMCALL, hypercall instruction, is used to make the system call to the kernel. A hypercall is a trap from non-root mode to root mode, similar to how syscall is a trap from normal application to the kernel.

Overall, Dune utilizes virtualization hardware to create lightweight virtual processes with access to privileged hardware features.

RFP: When RPC is Faster than Server-Bypass with RDMA

RFP: When RPC is Faster than Server-Bypass with RDMA Maomeng Su, Mingxing Zhang, Kang Chen, Zhenyu Guo2 Yongwei Wu.

This paper proposes a new RDMA-based RPC approach called the Remote Fetching Paradigm (RFP) which is intended to increase the performance and reduce the design cost in implementing RDMA for RPC.

Remote Procedure Call (RPC): Client sends a request to a remote server and the server is responsible for sending a reply. This method is denoted as server-reply in the paper. Drawback:

In-Bound vs Out-Bound asymmetry:

    In RDMA, the issuer of an operation (outbound RDMA) is responsible for ensuring that an operation is sent and processed and thus it performs more work than the one receiving the request (inbound RDMA). Server-reply or RPC implemented on top of RDMA is suboptimal as the server is responsible to send the reply back and thus bounded by the limit of outbound RDMA.

Remote Direct Memory Access (RDMA): Clients access server’s memory directly and bypass the kernel (zero-copy networking). This method is denoted as server-bypass in the paper. Drawbacks:

Bypass Access Asymmetry:

    When CPU processing is bypassed on the server, multiple clients need to coordinate their access to avoid conflicts. This increases the number of RDMA operation rounds.

Redesign Cost:

    New data structures need to be designed which increases the porting cost.

Terminology

RNIC (RDMA Network Interface Controller) is a special adapter which is used to provide support for RDMA. It must be present at both ends of the connection to use RDMA. RDMA also requires memories used being registered into RNIC. To faciliate this, RNIC has a local_buf  and provides malloc_buf and free_buf functions to allocate and free buffers registered into it. RDMA_Read and RDMA_Write are the functions which are used to read and write from the buffer allocated in RNIC.

Design Choices of RDMA-Based RPC

Client-polling: Server buffers the result in local memory instead of sending results back to clients through outbound RDMA operations. This means that server is not responsible for calling RDMA_Write to write the response in the local_buf at Client’s RNIC. Instead, clients are responsible for doing RDMA_Read to fetch the results remotely and store it in their local_buf.
Since both client and server cannot avoid outbound RDMA operation (RDMA_Write), RFP chooses to make server handle only inbound operation and client uses outbound operation.

Client switches to server-reply when the number of retries of remote fetching is larger than a certain threshold R. This means that server performs RDMA_Write after the threshold is reached. A mode_flag for each RPC request is used which designates whether remote fetching or server-reply is being used. This tuning is done to reduce the number of unnecessary RDMA operations.

Client uses a default fetching size F when fetching response from the server. When F is less than the result size then only one RDMA operation is required. Otherwise, it receives the size of the result in the response header during the first operation and performs an additional read to fetch the remaining data.

In RFP, the fetching size F and mode switch threshold R are adaptively set according to application characteristic and RNIC configuration.

Server Push: Server is responsible for processing the incoming requests. This avoids the multiple RPC calls in case of data-conflict and redesign of application-specific data structures. So, the server receives a message from the local_buf in RNIC and puts the response back in its local_buf which can be fetched by the clients.

 

Screen Shot 2017-10-14 at 10.42.46 PM

In summary, RFP combines the strength of RDMA into RPC while it also avoids its weakness.