Tuesday, January 21, 2014

An Introduction to the InfiniBand Architecture

In 1965, Dr. Gordon Moore observed that the industry was able to double the transistor density on a manufactured die every year (Gordon E. Moore, "The Continuing Silicon Technology Evolution Inside the PC Platform"). This observation became popular as Moore's Law and almost 40 years later it still holds as a fairly accurate estimate of the growth of transistor density (a more accurate estimate of the growth rate, which encompasses growth data from the past 45 years, is doubling of the density every 18 months).
This fast-paced growth in transistor density translates into CPU-performance increases of a similar magnitude, making applications such as data mining, data warehousing, and e-business commonplace. To reap the benefits of this growth in computational power, however, requires that the I/O subsystem is able to deliver the data needed by the processor subsystem at the rate at which is it needed. In the past couple of years, it has become clear that the current shared bus-based architecture will become the bottleneck of the servers that host these powerful but demanding applications.
Performance is just one dimension of the growing demands imposed on the I/O subsystem. As Wendy Vittori, general manager of Intel?s I/O products division, said, "The growth of e-Commerce and e-Business means more data delivered to more and more users. This data needs to move faster, with higher reliability and quality of service than ever before." (Intel, I/O Building Blocks, Ultimate I/O Performance.) E-commerce and e-business applications need to be available 24/7 to process transactions at very high rates. This implies that the desired solution to the architecture of the I/O system must be able to provide enhanced reliability and availability in addition to raw performance.
In this article, I will review each of the main architectural features of the InfiniBand as a solution to the corresponding limitation of the current I/O subsystem. The InfiniBand specification defines the architecture of the interconnect that will pull together the I/O subsystems of the next generation of servers and will ultimately even move to the powerful desktops of the future. The architecture is based on a serial, switched fabric that in addition to defining link bandwidths between 2.5 and 30 Gbits/sec, resolves the scalability, expandability, and fault tolerance limitations of the shared bus architecture through the use of switches and routers in the construction of its fabric. But before we delve into that discussion lets see who is behind this new I/O architecture and how it came about.

Who is working on it?

A few years ago, when the impending bottleneck at the I/O subsystem became a clear vision of the future, a number of industry leaders decided to take action and come up with the design for the I/O subsystem of the future. As it is common in the computer industry, two competing efforts started, one called the Future I/O initiative, which included Compaq, IBM, and Hewlett Packard; and the other one called the Next-Generation I/O initiative, which included Dell, Hitachi, Intel, NEC, Siemens, and Sun Microsystems. As the first versions of those specifications started to become available, the two groups decided to unify their efforts by bringing together the best ideas of each of the two separate initiatives. So, in August of 1999, the seven industry leaders, Compaq, Dell, Hewlett-Packard, IBM, Intel, Microsoft, and Sun Microsystems formed the InfiniBand Trade Association (ITA) (InfiniBand Trade Association, "What is the InfiniBand Trade Association.").
In addition to those seven steering committee members, the ITA consists of 11 sponsoring members, including among others 3Com, Cisco and EMC, and more than 200 member companies. By agreeing to join their efforts, the ITA was both able to eliminate the confusion in the market that would have resulted from the coexistence of the two competing standards but it was also able to release a first version of the specification very quickly. Version 1.0 of the InfiniBand Architecture Specification was released in October of 2000 and the 1.0a version, mainly consisting of minor changes to the 1.0 version, was released in June of 2001. The specification is available for download from the ITA Web site for free, even for non-members.

What are its main architectural features?

The Peripheral Component Interconnect (PCI) bus, which was first introduced in the early 90's, is the dominant bus, used in both desktop and server machines for attaching I/O peripherals to the CPU/memory complex. The most common configuration of the PCI bus is a 32-bit 33MHz version that provides a bandwidth of 133MB per second, although the 2.2 version of the specification allows for a 64-bit version at 33MHz for a bandwidth of 266MB per second and even a 64-bit 66MHz version for a bandwidth of 533MB per second. Even today's powerful desktop machines have lots of capacity available with the PCI bus in the typical configuration, but server machines are starting to hit the upper limits of the shared bus architecture. The availability of multiport Gigabit Ethernet NICs, along with one or more Fibre Channel I/O controllers can easily consume even the highest 64-bit, 66MHz version of the PCI bus.
To resolve this limitation on the bandwidth of the PCI bus, a number of solutions are becoming available in the market as interim solutions such as PCI-X and PCI DDR (Mellanox Technologies, "Understanding PCI Bus, 3GIO, and InfiniBand Architecture"). Both of them are backwards compatible upgrade paths to the current PCI bus. The PCI-X specification allows for a 64-bit version of the bus operating at the clock rate of 133 MHz, but this is achieved by easing some of the timing constraints. The shared bus nature of the PCI-X bus forces it to lower its fanout in order to achieve the high clock rate of 133 MHz.
A PCI-X system that is running at 133 MHz can have only one slot on the bus, two PCI-X slots would allow a maximum clock rate of 100 MHz, whereas the four slot configuration would drop down to a clock rate of 6 MHz (Compaq Computer Corporation, "PCI-X: An Evolution of the PCI Bus," September 1999, TC990903TB.). So, despite the temporary resolution of the PCI bandwidth limitation through these new upgrade technologies, there is a long term solution needed that cannot rely on a shared bus architecture.
InfiniBand breaks through the bandwidth and fanout limitations of the PCI bus by migrating from the traditional shared bus architecture into a switched fabric architecture. Figure 1, below, shows the simplest configuration of an InfiniBand installation, where two or more nodes are connected to one another through the fabric. A node represents either a host device such as a server or an I/O device such as a RAID subsystem. The fabric itself may consist of a single switch in the simplest case or a collection of interconnected switches and routers. I will describe the difference between a switch and a router a little later but those of you with a networking background have probably guessed already what the difference is.
Diagram.
Figure 1. InfiniBand Basic Fabric Topology
Each connection between nodes, switches, and routers is a point-to-point, serial connection. This basic difference brings about a number of benefits:
  • Because it is a serial connection, it only requires four as opposed to the wide parallel connection of the PCI bus.
  • The point-to-point nature of the connection provides the full capacity of the connection to the two endpoints because the link is dedicated to the two endpoints. This eliminates the contention for the bus as well as the resulting delays that emerge under heavy loading conditions in the shared bus architecture.
  • The InfiniBand channel is designed for connections between hosts and I/O devices within a Data Center. Due to the well defined, relatively short length of the connections, much higher bandwidth can be achieved than in cases where much longer lengths may be needed.
The InfiniBand specification defines the raw bandwidth of the base 1x connection at 2.5Gb per second. It then specifies two additional bandwidths, referred to as 4x and 12x, as multipliers of the base link rate. At the time that I am writing this, there are already 1x and 4x adapters available in the market. So, the InfiniBand will be able to achieve must higher data transfer rates than is physically possible with the shared bus architecture without the fan-out limitations of the later.
Lets now dig in a little deeper into the architecture of the InfiniBand to explore some of its additional benefits. Figure 2, below, illustrates a system area network that utilizes the InfiniBand architecture. In the example shown in the figure, the fabric consists of three switches that connect the six end nodes together. Each node connects to the fabric through a channel adapter. The InfiniBand specification classifies the channel adapters into two categories: Host Channel Adapters (HCA) and Target Channel Adapters (TCA).
HCAs are present in servers or even desktop machines and provide an interface that is used to integrate the InfiniBand with the operating system. TCAs are present on I/O devices such as a RAID subsystem or a JBOD subsystem. Each channel adapter may have one or more ports. As you can also see in the figure, a channel adapter with more than one port, may be connected to multiple switch ports. This allows for multiple paths between a source and a destination, resulting in performance and reliability benefits.
By having multiple paths available in getting the data from one node to another, the fabric is able to achieve transfer rates at the full capacity of the channel, avoiding congestion issues that arise in the shared bus architecture. Furthermore, having alternative paths results in increased reliability and availability since another path is available for routing of the data in the case of failure of one of the links.
Two more unique features of the InfiniBand Architecture that become evident in Figure 2 are the ability to share storage devices across multiple servers and the ability to perform third-party I/O. Third-party I/O is a term used to refer to the ability of two storage devices to complete an I/O transaction without the direct involvement of a host other than in setting up the operation. This feature is extremely important from the performance perspective since many such I/O operations between two storage devices can be totally off-loaded from the server, thereby eliminating the unnecessary CPU utilization that would otherwise be consumed.
Diagram.
Figure 2. System Area Network based on the InfiniBand Architecture
Host and Target Channel adapters present an interface to the layers above them that allow those layers to generate and consume packets. In the case of a server writing a file to a storage device, the host is generating the packets that are then consumed by the storage device. In contract to the channel adapter, switches simply forward packets between two of their ports based on the established routing table and the addressing information stored on the packets. A collection of end nodes connected to one another through one or more switches form a subnet. Each subnet must have at least one Subnet Manager that is responsible for the configuration and management of the subnet.
Related Reading
Routers are like switches in the respect that they simply forward packets between their ports. The difference between the routers and the switches, however, is that a router is used to interconnect two or more subnets to form a larger multi-domain system area network. Within a subnet, each port is assigned a unique identifier by the subnet manager called the Local ID or LID. In addition to the LID, each port is assigned a globally unique identifier called the GID. Switches make use of the LIDs for routing packets from the source to the destination, whereas Routers make use of the GIDs for routing packets across domains. More detailed information on the LIDs, GIDs, and their assignment is available either in the specification or in William Futral's book (William T. Futral, InfiniBand Architecture: Development and Deployment. A Strategic Guide to Server I/O Solutions, Intel Press, 2001).
One more feature of the InfiniBand Architecture that is not available in the current shared bus I/O architecture is the ability to partition the ports within the fabric that can communicate with one another. This is useful for partitioning the available storage across one or more servers for management reasons and/or for security reasons.


Before I discuss a few more important benefits of the InfiniBand Architecture, we need to dig a little deeper. Figure 3 illustrates the communications stack of the InfiniBand Architecture. Before I go through each of the terms that appear in the figure, we need to understand the drivers that brought about the structure of the InfiniBand network stack.
Diagram.
Figure 3. InfiniBand Communication Stack
In order to achieve better performance and scalability at a lower cost, system architects have come up with the concept of clustering, where two or more servers are connected together to form a single logical server. In order to achieve the most benefit from the clustering of multiple servers, the protocol used for communication between the physical servers must provide high bandwidth and low latency. Unfortunately, full-fledged network protocols such as TCP/IP, in order to achieve good performance across both LANs and WANs, have become so complex that both incur considerable latency and require many thousands of lines of code for their implementation.


To overcome these issues, Compaq, Intel, and Microsoft joined forces and came up with the Virtual Interface (VI) Architecture (VIA) specification, which was released in December of 1997. The VI Architecture is a server messaging protocol whose focus is to provide a very low latency link between the communicating servers. The specification defines four basic components: virtual interfaces, completion queues, VI Provides, and VI Consumers. The VIA specification is available from here, and it describes in detail each of the components. I won't describe them in detail here, and I only try to provide high-level information so that you can understand how low the latency is achieved.
In transferring a block of data from one server to another, latency arises in the form of overhead and delays that are added to the time needed to transfer the actual data. If we were to break down the latency into its components, the major contributors would be: a) the overhead of executing network protocol code within the operating system, b) context switches to move in and out of kernel mode to receive and send out the data, and c) excessive copying of data between the user level buffers and the NIC memory.
Since VIA was only intended to be used for communication across the physical servers of a cluster (in other words across high-bandwidth links with very high reliability), the specification can eliminate much of the standard network protocol code that deals with special cases. Also, because of the well-defined environment of operation, the message exchange protocol was defined to avoid kernel mode interaction and allow for access to the NIC from user mode. Finally, because of the direct access to the NIC, unnecessary copying of the data into kernel buffers was also eliminated since the user is able to directly transfer data from user-space to the NIC. In addition to the standard send/receive operations that are typically available in a networking library, the VIA provides Remote Direct Memory Access operations where the initiator of the operation specifies both the source and destination of a data transfer, resulting in zero-copy data transfers with minimum involvement of the CPUs.
Now the reason why I spent so much time talking about the VIA in an article about the InfiniBand Architecture is because the InfiniBand uses basically the VIA primitives for its operation at the transport layer. Now we can return to Figure 3 and describe all the terms that are shown. In order for an application to communicate with another application over the InfiniBand it must first create a work queue that consists of a queue pair (QP). In order for the application to execute an operation, it must place a work queue element (WQE) in the work queue. From there the operation is picked-up for execution by the channel adapter. Therefore, the Work Queue forms the communications medium between applications and the channel adapter, relieving the operating system from having to deal with this responsibility.
Each process may create one or more QPs for communications purposes with another application. Instead of having to arbitrate for the use of the single queue for the NIC card, as in a typical operating system, each queue pair has an associated context. Since both the protocol and the structures are all very clearly defined, queue pairs can implemented in hardware, thereby off-loading most of the work from the CPU. Once a WQE has been processed properly, a completion queue element (CQE) is created and placed in the completion queue. The advantage of using the completion queue for notifying the caller of completed WQEs is because it reduces the interrupts that would be otherwise generated.
The list of operations supported by the InfiniBand architecture at the transport level for Send Queues are as follows:
  1. Send/Receive: supports the typical send/receive operation where one node submits a message and another node receives that message. One difference between the implementation of the send/receive operation under the InfiniBand architecture and traditional networking protocols is that the InfiniBand defines the send/receive operations as operating against queue pairs.
  2. RDMA-Write: this operation permits one node to write data directly into a memory buffer on a remote node. The remote node must of course have given appropriate access privileges to the node ahead of time and must have memory buffers already registered for remote access.
  3. RDMA-Read: this operation permits one node to read data directly from the memory buffer of a remote node. The remote node must of course have given appropriate access privileges to the node ahead of time.
  4. RDMA Atomics: this operation name actually refers to two different operations that have the same effect but which operate different from one another. The Compare & Swap operation allows a node to read a memory location and if its value is equal to a specified value, then a new value is written in that memory location. The Fetch Add atomic operation reads a value and returns it to the caller and then add a specified number to that value and saves it back at the same address.
For Receive Queue the only type of operation is:
  1. Post Receive Buffer: identifies a buffer into which a client may send to or receive data from through a Send, RDMA-Write, RDMA-Read operation.
When a QP is created, the caller may associate with the QP one of five different transport service types. A process may create and use more than one QP, each of a different transport service type. The InfiniBand transport service types are:
  • Reliable Connection (RC): reliable transfer of data between two entities.
  • Unreliable Connection (UC): unreliable transfer of data between two entities. Like RC there are only two entities involved in the data transfer but message may be lost.
  • Reliable Datagram (RD): the QP can send and receive messages from one or more QPs using a reliable datagram channel (RDC) between each pair of reliable datagram domains (RDDs).
  • Unreliable Datagram (UD): the QP can send and receive messages from one or more QPs however the messages may get lost.
  • Raw Datagram: the raw datagram is a data link layer service which provides the QP with the ability to send and receive raw datagram messages that are not interpreted.
This is not meant to be a complete coverage of the InfiniBand Architecture specification and as such I left lots of implementation details out. I do include a list of references though that should provide additional reading for those interested in learning more about it.

What state is the InfiniBand Architecture in and what is Microsoft doing about it?

Due to the overwhelming support in the industry for this new I/O standard architecture, InfiniBand development has been able to move very quickly from specification to actual products appearing in the market. Hardware for putting together InfiniBand fabrics started to appear in the 2nd quarter of 2001 and Beta OS Support was expected to be available in most popular operating system by the 4th quarter of 2001. Customer trials are expect after the 2nd quarter of 2002 with production systems starting to become available by the 3rd quarter of 2002.
Related Reading
At the Microsoft WinHEC 2001, Rob Haydt, Program Manager of the Windows Base OS Group, gave a presentation titled "Windows InfiniBand Support Roadmap" (Rob Haydt, "Windows InfiniBand Support Roadmap," WinHEC 2001.). Within that presentation he indicated that InfiniBand will be supported in the first release of the Windows Whistler operating system. Second generation IB development will take place between late 2002-2003 and significant commercial deployments will begin until the 2003-2004 timeframe.
If you are eager to try out some of the concepts described in this article but you don't have access to InfiniBand hardware, the easiest way to do that at this point is through Winsock Direct. Winsock Direct is an alternative implementation of the Winsock library that transparently takes advantage of support in the hardware to implement RDMA and kernel bypassing optimizations. It initially became available in Windows 2000 Datacenter Server but is now also available in the Service Pack 2 Release of Windows 2000 Advanced Server. The white paper by Jim Pinkerton that we have reference, describes in detail how Winsock Direct works and includes lots of numerical on experiments the author conducted so it makes for some interesting reading (Jim Pinkerton, "Winsock Direct: The Value of System Area Networks," Microsoft Corporation, May 2001). Other related interesting developments include the definition of the SCSI RDMA Protocol (SRP) over InfiniBand which is work in progress and the definition of the Sockets Direct Protocol (SDP) whose goal is to define a sockets type API over InfiniBand.

O'Reilly & Associates recently released (January 2002) Windows 2000 Performance Guide.
Odysseas Pentakalos has been an independent consultant for 10 years in performance modeling and tuning of computer systems and in object-oriented design and development.

Return to the O'Reilly Network.

No comments:

Post a Comment