Tuesday, July 7, 2015

Setting up a basic infiniband network

4. Setting up a basic infiniband network

This sections describes how to set up a basic infiniband network and test its functionality.

4.1 Upgrade your Infiniband card and switch firmware

Before proceeding you should ensure that the firmware in your switches and infiniband cards is at the latest release. Older firmware versions may cause interoperability and fabric stability issues. Do not assume that just because your hardware has come fresh from the factory that it has the latest firmware on it.
You should follow the documentation from your vendor as to how the firmware should be updated.

4.2 Physically Connect the network

Connect up to your hosts and switches.

4.3 Choose a Subnet Manager

Each infiniband network requires a subnet manager. You can choose to run the OFED opensm subnet manager on one of the Linux clients, or you may choose to use an embedded subnet manager running on one of the switches in your fabric. Note that not all switches come with a subnet manager; check your switch documentation.

4.4 Load the kernel modules

Infiniband kernel modules are not loaded automatically. You should adding them to /etc/modules so that they are automatically loaded on machine bootup. You will need to include the hardware specific modules and the protocol modules.
/etc/modules:
# Hardware drivers
# Choose the apropriate modules from
# /lib/modules/<kernel-version>/updates/kernel/drivers/infiniband/hw
#
#mlx4_ib  # Mellanox ConnectX cards
#ib_mthca # some mellanox cards
#iw_cxgb3 # Chelsio T3 cards
#iw_nes # NetEffect cards
#
# Protocol modules
# Common modules
rdma_ucm
ib_umad
ib_uverbs
# IP over IB
ib_ipoib
# scsi over IB 
ib_srp
# IB SDP protocol
ib_sdp

4.5 (optional) Start opensm

If you are going to use the opensm suetnet manager, edit /etc/default/opensm and add the port GUIDs of the interfaces on which you wish to start opensm.
You can find the port GUIDs of your cards with the ibstat -p command:
# ibstat -p
0x0002c9030002fb05
0x0002c9030002fb06

/etc/default/opensm:
PORTS="0x0002c9030002fb05 0x0002c9030002fb06"

Note if you want to start opensm on all ports you can use the PORTS="ALL" keyword.
Start opensm:

#/etc/init.d/opensm start

If opensm has started correctly you should see SUBNET UP messages in the opensm logfile (/var/log/opensm.<PORTID>.log).

Mar 04 14:56:06 600685 [4580A960] 0x02 -> SUBNET UP

Note that you can start opensm on multiple nodes; one node will be the active subnet manager and the others will put themselves into standby.

4.6 Check network health

You can now check the status of the local IB link with the ibstat command. Connected links should be in the "LinkUp" state. The following output is from a dual ported card, only one of which (port1) is connected.

# ibstat
CA 'mlx4_0'
        CA type: MT25418
        Number of ports: 2
        Firmware version: 2.3.0
        Hardware version: a0
        Node GUID: 0x0002c9030002fb04
        System image GUID: 0x0002c9030002fb07
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 20
                Base lid: 2
                LMC: 0
                SM lid: 1
                Capability mask: 0x02510868
                Port GUID: 0x0002c9030002fb05
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510868
                Port GUID: 0x0002c9030002fb06

4.7 Check the extended network connectivity

Once the host is connected to the infiniband network you can check the health of all of the other network components with the ibhosts, ibswitches and iblinkinfo commands.
ibhosts displays all of the hosts visible on the network.

# ibhosts
Ca      : 0x0008f1040399d3d0 ports 2 "Voltaire HCA400Ex-D"
Ca      : 0x0008f1040399d370 ports 2 "Voltaire HCA400Ex-D"
Ca      : 0x0008f1040399d3fc ports 2 "Voltaire HCA400Ex-D"
Ca      : 0x0008f1040399d3f4 ports 2 "Voltaire HCA400Ex-D"
Ca      : 0x0002c9030002faf4 ports 2 "MT25408 ConnectX Mellanox Technologies"
Ca      : 0x0002c9030002fc0c ports 2 "MT25408 ConnectX Mellanox Technologies"
Ca      : 0x0002c9030002fc10 ports 2 "MT25408 ConnectX Mellanox Technologies"

ibswitches will display all of the switches in the network.
# ibswitches
Switch  : 0x0008f104004121fa ports 24 "ISR9024D-M Voltaire" enhanced port 0 lid 1 lmc 0

iblinkinfo will show the status and speed of all of the links in the network.
#iblinkinfo.pl 
Switch 0x0008f104004121fa ISR9024D-M Voltaire:
      1    1[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       2    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1    2[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      13    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1    3[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       4    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1    4[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      26    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1    5[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      27    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1    6[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      24    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1    7[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      28    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1    8[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      25    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1    9[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      31    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1   10[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      32    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1   11[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      33    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1   12[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      29    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
      1   13[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      30    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
          14[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  )
      1   15[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       3    1[  ] "Voltaire HCA400Ex-D" (  )
      1   16[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      10    1[  ] "Voltaire HCA400Ex-D" (  )
          17[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  )
          18[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  )
      1   19[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       7    2[  ] "Voltaire HCA400Ex-D" (  )
      1   20[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       6    2[  ] "Voltaire HCA400Ex-D" (  )
      1   21[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       5    2[  ] "Voltaire HCA400Ex-D" (  )
      1   22[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      21    1[  ] "Voltaire HCA400Ex-D" (  )
      1   23[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       9    2[  ] "Voltaire HCA400Ex-D" (  )
      1   24[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       8    1[  ] "Voltaire HCA400Ex-D" (  )

4.8 testing connectivity with ibping

ibping is an infiniband equivalent to the icmp ping command. Choose a node on the fabric and run a ibping server:
#ibping -S

Choose another node on your network, and then ping the port GUID of the server. (ibstat on the server will list the port GUID).

#ibping -G 0x0002c9030002fc1d
Pong from test.example.com (Lid 13): time 0.072 ms
Pong from test.example.com (Lid 13): time 0.043 ms
Pong from test.example.com (Lid 13): time 0.045 ms
Pong from test.example.com (Lid 13): time 0.045 ms

4.9 Testing RDMA performance

You can test the latency and bandwidth of a link with the ib_rdma_lat commands.
To test the latency, start the server on a node:
#ib_rdma_lat
and then start a client on another node, giving it the hostname of the server.
#ib_rdma_lat  hostname-of-server
   local address: LID 0x0d QPN 0x18004a PSN 0xca58c4 RKey 0xda002824 VAddr 0x00000000509001
  remote address: LID 0x02 QPN 0x7c004a PSN 0x4b4eba RKey 0x82002466 VAddr 0x00000000509001
Latency typical: 1.15193 usec
Latency best   : 1.13094 usec
Latency worst  : 5.48519 usec

You can test the bandwith of the link using the ib_rdma_bw command.
#ib_rdma_bw
and then start a client on another node, giving it the hostname of the server.
#ib_rdma_bw  hostname-of-server
855: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 |
855: Local address:  LID 0x0d, QPN 0x1c004a, PSN 0xbf60dd RKey 0xde002824 VAddr 0x002aea4092b000
855: Remote address: LID 0x02, QPN 0x004a, PSN 0xaad03c, RKey 0x86002466 VAddr 0x002b8a4e191000


855: Bandwidth peak (#0 to #955): 1486.85 MB/sec
855: Bandwidth average: 1486.47 MB/sec
855: Service Demand peak (#0 to #955): 1970 cycles/KB
855: Service Demand Avg  : 1971 cycles/KB

The perftest package contains a number of other similar benchmarking programs to test various aspects of your network.

No comments:

Post a Comment