This chapter describes how to monitor and control the InfiniBand fabric.
It contains the following topics:
14.1 Monitoring the InfiniBand Fabric
This section contains the following topics:
14.1.1 Identifying All Switches in the Fabric
You can use the
ibswitches
command to identify the Sun Network QDR InfiniBand Gateway Switches in the InfiniBand fabric in your Exalogic machine. This command displays the Global Unique Identifier (GUID), name, Local Identifier (LID), and LID mask control (LMC) for each switch. The output of the command is a mapping of GUID to LID for switches in the fabric.
On any command-line interface (CLI), run the following command:
# ibswitches
The output is displayed, as in the following example:
Switch : 0x0021283a8389a0a0 ports 36 "Sun DCS 36 QDR switch localhost" enhancedport 0 lid 15 lmc 0
Note:
The actual output for your InfiniBand fabric will differ from that in the example.14.1.2 Identifying All HCAs in the Fabric
You can use the
ibhosts
command to display identity information about the host channel adapters (HCAs) in the InfiniBand fabric in a subnet. This command displays the GUID and name for each HCA.
On the command-line interface (CLI), run the following command:
# ibhosts
The output is displayed, as in the following example:
Ca : 0x0003ba000100e388 ports 2 "nsn33-43 HCA-1" Ca : 0x5080020000911310 ports 1 "nsn32-20 HCA-1" Ca : 0x50800200008e532c ports 1 "ib-71 HCA-1" Ca : 0x50800200008e5328 ports 1 "ib-70 HCA-1" Ca : 0x50800200008296a4 ports 2 "ib-90 HCA-1" . . . #
Note:
The output in the example is just a portion of the full output and varies for each InfiniBand topology.14.1.3 Displaying the InfiniBand Fabric Topology
To understand the routing that happens within your InfiniBand fabric, the
ibnetdiscover
command displays the node-to-node connectivity. The output of the command is dependent upon the size of your fabric. You can also use this command to display the LIDs of HCAs.
On the command-line interface (CLI), enter the following command:
# ibnetdiscover
The output is displayed, as in the following example:
# Topology file: generated on Sat Apr 13 22:28:55 2002 # # Max of 1 hops discovered # Initiated from node 0021283a8389a0a0 port 0021283a8389a0a0 vendid=0x2c9 devid=0xbd36 sysimgguid=0x21283a8389a0a3 switchguid=0x21283a8389a0a0(21283a8389a0a0) Switch 36 "S-0021283a8389a0a0" # "Sun DCS 36 QDR switch localhost" enhanced port 0 lid 15 lmc 0 [23] "H-0003ba000100e388"[2](3ba000100e38a) # "nsn33-43 HCA-1" lid 14 4xQDR vendid=0x2c9 devid=0x673c sysimgguid=0x3ba000100e38b caguid=0x3ba000100e388 Ca 2 "H-0003ba000100e388" # "nsn33-43 HCA-1" [2](3ba000100e38a) "S-0021283a8389a0a0"[23] # lid 14 lmc 0 "Sun DCS 36 QDR switch localhost" lid 15 4xQDR
Note:
The actual output for your InfiniBand fabric will differ from that in the example.14.1.4 Displaying a Route Through the Fabric
You sometimes need to know the route between two nodes in the InfiniBand fabric. The
ibtracert
command can provide that information by displaying the GUIDs, ports, and LIDs of the nodes.On the command-line interface (CLI), run the following command:# ibtracert slid dlid
where
slid
is the LID of the source node and dlid
is the LID of the destination node in the fabric.
The output is displayed, as in the following example:
# ibtracert 15 14 # From switch {0x0021283a8389a0a0} portnum 0 lid 15-15 "Sun DCS 36 QDR switch localhost" [23] -> ca port {0x0003ba000100e38a}[2] lid 14-14 "nsn33-43 HCA-1" To ca {0x0003ba000100e388} portnum 2 lid 14-14 "nsn33-43 HCA-1" #
For this example:
The route starts at switch with GUID
0x0021283a8389a0a0
and is using port 0
. The switch is LID 15
and in the description, the switch host's name is Sun DCS 36 QDR switch localhost
. The route enters at port 23
of the HCA with GUID 0x0003ba000100e38a
and exits at port 2
. The HCA is LID 14
.
Note:
The actual output for your InfiniBand fabric will differ from that in the example.14.1.5 Displaying the Link Status of a Node
If you want to know the link status of a node in the InfiniBand fabric, run the
ibportstate
command to display the state, width, and speed of that node:
On the command-line interface (CLI), run the following command:
# ibportstate lid port
where
lid
is the LID of the node in the fabric, port
is the port of the node.
The output is displayed, as in the following example:
# ibportstate 15 23 PortInfo: # Port info: Lid 15 port 23 LinkState:.......................Active PhysLinkState:...................LinkUp LinkWidthSupported:..............1X or 4X LinkWidthEnabled:................1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkSpeedActive:.................10.0 Gbps Peer PortInfo: # Port info: Lid 15 DR path slid 15; dlid 65535; 0,23 LinkState:.......................Active PhysLinkState:...................LinkUp LinkWidthSupported:..............1X or 4X LinkWidthEnabled:................1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkSpeedActive:.................10.0 Gbps #
Note:
The actual output for your InfiniBand fabric will differ from that in the example.14.1.6 Displaying Counters for a Node
To help ascertain the health of a node in the fabric, use the
perfquery
command to display the performance, error, and data counters for that node:
On the command-line interface (CLI), enter the following command:
# perfquery lid port
where
lid
is the LID of the node in the fabric, and port
is the port of the node.
Note:
If a port value of 255 is specified for a switch node, the counters are the total for all switch ports.
For example:
# perfquery 15 23 # # Port counters: Lid 15 port 23 PortSelect:......................23 CounterSelect:...................0x1b01 SymbolErrors:....................0 . . . VL15Dropped:.....................0 XmtData:.........................20232 RcvData:.........................20232 XmtPkts:.........................281 RcvPkts:.........................281
Note:
The output in the example is just a portion of the full output.14.1.7 Displaying Data Counters for a Node
To list the data counters for a node in the fabric, use the
ibdatacounts
command.
On the command-line interface (CLI), enter the following command:
# ibdatacounts lid port
where
lid
is the LID of the node in the fabric, and port
is the port of the node.
For example:
# ibdatacounts 15 23 # XmtData:.........................6048 RcvData:.........................6048 XmtPkts:.........................84 RcvPkts:.........................84
Note:
The actual output for your InfiniBand fabric will differ from that in the example.14.1.8 Displaying Low-Level Detailed Information for a Node
If intensive troubleshooting is necessary to resolve a problem, you can use the
smpquery
command to display very detailed information about a node in the fabric.
On the command-line interface (CLI), enter the following command:
# smpquery switchinfo lid
where
lid
is the LID of the node in the fabric.
For example:
# smpquery switchinfo 15 # # Switch info: Lid 15 LinearFdbCap:....................49152 RandomFdbCap:....................0 McastFdbCap:.....................4096 LinearFdbTop:....................16 DefPort:.........................0 DefMcastPrimPort:................255 DefMcastNotPrimPort:.............255 LifeTime:........................18 StateChange:.....................0 LidsPerPort:.....................0 PartEnforceCap:..................32 InboundPartEnf:..................1 OutboundPartEnf:.................1 FilterRawInbound:................1 FilterRawOutbound:...............1 EnhancedPort0:...................1 # # smpquery portinfo lid port
Note:
The actual output for your InfiniBand fabric will differ from that in the example.14.1.9 Displaying Low-Level Detailed Information for a Port
If intensive troubleshooting is necessary to resolve a problem, you can use the
smpquery
command to display very detailed information about a port.
On the command-line interface (CLI), enter the following command:
# smpquery portinfo lid port
where
lid
is the LID of the node in the fabric.
For example:
# smpquery portinfo 15 23 # Mkey:............................0x0000000000000000 GidPrefix:.......................0x0000000000000000 Lid:.............................0x0000 SMLid:...........................0x0000 CapMask:.........................0x0 DiagCode:........................0x0000 MkeyLeasePeriod:.................0 LocalPort:.......................0 LinkWidthEnabled:................1X or 4X LinkWidthSupported:..............1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkState:.......................Active PhysLinkState:...................LinkUp LinkDownDefState:................Polling ProtectBits:.....................0 LMC:.............................0 . . . SubnetTimeout:...................0 RespTimeVal:.....................0 LocalPhysErr:....................8 OverrunErr:......................8 MaxCreditHint:...................85 RoundTrip:.......................16777215 #
Note:
The actual output for your InfiniBand fabric will differ from that in the example, and it is just a portion of the full output.14.1.10 Mapping LIDs to GUIDs
In the InfiniBand fabric in Exalogic machines, as a Subnet Manager and Subnet administrator, you may want to assign subnet-specific LIDs to nodes in the fabric. Often in the use of the InfiniBand commands, you must provide an LID to issue a command to a particular InfiniBand device.
Alternatively, the output of a command might identify InfiniBand devices by their LID. You can create a file that is a mapping of node LIDs to node GUIDs, which can help with administrating your InfiniBand fabric.
Note:
Creation of the mapping file is not a requirement for InfiniBand administration.
The following procedure creates a file that lists the LID in hexadecimal, the GUID in hexadecimal, and the node description:
- Create an inventory file:
# osmtest -f c -i inventory.txt
Theinventory.txt
file can be used for other purposes too, besides this procedure. - Create a mapping file:
# cat inventory.txt |grep -e '^lid' -e 'port_guid' -e 'desc' |sed 's/^lid/\nlid/'> mapping.txt
- Edit the latter half of the
mapping.txt
file to remove the nonessential information. The content of themapping.txt
file looks similar to the following:lid 0x14 port_guid 0x0021283a8620b0a0 # node_desc Sun DCS 72 QDR switch 1.2(LC) lid 0x15 port_guid 0x0021283a8620b0b0 # node_desc Sun DCS 72 QDR switch 1.2(LC) lid 0x16 port_guid 0x0021283a8620b0c0 # node_desc Sun DCS 72 QDR switch 1.2(LC)
Note:
The output in the example is just a portion of the entire file.14.1.11 Performing Comprehensive Diagnostics for the Entire Fabric
If you require a full testing of your InfiniBand fabric, you can use the
ibdiagnet
command to perform many tests with verbose results. The command is a useful tool to determine the general overall health of the InfiniBand fabric.
On the command-line interface (CLI), run the following command:
# ibdiagnet -v -r
The
ibdiagnet.log
file contains the log of the testing.14.1.12 Performing Comprehensive Diagnostics for a Route
You can use the
ibdiagpath
command to perform some of the same comprehensive tests for a particular route.
On the command-line interface (CLI), run the following command:
# ibdiagpath -v -l slid dlid
where
slid
is the LID of the source node in the fabric, and dlid
is the LID of the destination node.
The
ibdiagpath.log
file contains the log of the testing.14.1.13 Determining Changes to the InfiniBand Topology
If your fabric has a number of nodes that are suspect, the
osmtest
command enables you to take a snapshot (inventory file) of your fabric and at a later time compare that file to the present conditions.
Note:
Although this procedure is most useful after initializing the Subnet Manager, it can be performed at any time.
Complete the following steps:
- Ensure that Subnet Manager is initiated.
- On the command-line interface (CLI), run the following command to take a snapshot of the topology:
# osmtest -f c
For example:# osmtest -f c Command Line Arguments Done with args Flow = Create Inventory Aug 13 19:44:53 601222 [B7D466C0] 0x7f -> Setting log level to: 0x03 Aug 13 19:44:53 601969 [B7D466C0] 0x02 -> osm_vendor_init: 1000 pending umadsspecified using default guid 0x21283a8620b0f0 Aug 13 19:44:53 612312 [B7D466C0] 0x02 -> osm_vendor_bind: Binding to port0x21283a8620b0f0 Aug 13 19:44:53 636876 [B7D466C0] 0x02 -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2602 cap_mask2:0x0 resp_time_val:0x10 ----------------------------- OSMTEST: TEST "Create Inventory" PASS #
- After an event, compare the present topology to that saved in the inventory file, as in the following example:
# osmtest -f v Command Line Arguments Done with args Flow = Validate Inventory Aug 13 19:45:02 342143 [B7EF96C0] 0x7f -> Setting log level to: 0x03 Aug 13 19:45:02 342857 [B7EF96C0] 0x02 -> osm_vendor_init: 1000 pending umadsspecified using default guid 0x21283a8620b0f0 Aug 13 19:45:02 351555 [B7EF96C0] 0x02 -> osm_vendor_bind: Binding to port0x21283a8620b0f0 Aug 13 19:45:02 375997 [B7EF96C0] 0x02 -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2602 cap_mask2:0x0 resp_time_val:0x10 ----------------------------- Aug 13 19:45:02 378991 [B7EF96C0] 0x01 -> osmtest_validate_node_data: Checkingnode 0x0021283a8620b0a0, LID 0x14 Aug 13 19:45:02 379172 [B7EF96C0] 0x01 -> osmtest_validate_node_data: Checkingnode 0x0021283a8620b0b0, LID 0x15 . . . Aug 13 19:45:02 480201 [B7EF96C0] 0x01 ->osmtest_validate_single_path_rec_guid_pair: Checking src 0x0021283a8620b0f0 to dest 0x0021283a8620b0f0 Aug 13 19:45:02 480588 [B7EF96C0] 0x01 -> osmtest_validate_path_data: Checkingpath SLID 0x19 to DLID 0x19 Aug 13 19:45:02 480989 [B7EF96C0] 0x02 -> osmtest_run: ***************** ALL TESTS PASS ***************** OSMTEST: TEST "Validate Inventory" PASS #
Note:Depending on the size of your InfiniBand fabric, the output from theosmtest
command could be tens of thousands of lines long.
14.1.14 Determining Which Links Are Experiencing Significant Errors
You can use the
ibdiagnet
command to determine which links are experiencing symbol errors and recovery errors by injecting packets.
On the command-line interface (CLI), run the following command:
# ibdiagnet -c 100 -P all=1
In this instance of the
ibdiagnet
command, 100 test packets are injected into each link and the -P all=1
option returns all counters that increment during the test.
In the output of the
ibdiagnet
command, search for the symbol_error_counter
string. That line contains the symbol error count in hexadecimal. The preceding lines identify the node and port with the errors. Symbol errors are minor errors, and if there are relatively few during the diagnostic, they can be monitored.
Note:
According to the InfiniBand specification 10E-12 BER, the maximum allowable symbol error rate is 120 errors per hour.
In addition, in the output of the
ibdiagnet
command, search for the link_error_recovery_counter
string.
That line contains the recovery error count in hexadecimal. The preceding lines identify the node and port with the errors. Recovery errors are major errors and the respective links must be investigated for the cause of the rapid symbol error propagation.
Additionally, the
ibdiagnet.log
file contains the log of the testing.14.1.15 Checking All Ports
To perform a quick check of all ports of all nodes in your InfiniBand fabric, you can use the
ibcheckstate
command.
On the command-line interface (CLI), run the following command:
# ibcheckstate -v
The output is displayed, as in the following example:
# Checking Switch: nodeguid 0x0021283a8389a0a0 Node check lid 15: OK Port check lid 15 port 23: OK Port check lid 15 port 19: OK . . . # Checking Ca: nodeguid 0x0003ba000100e388 Node check lid 14: OK Port check lid 14 port 2: OK ## Summary: 5 nodes checked, 0 bad nodes found ## 10 ports checked, 0 ports with bad state found #
Note:
The ibcheckstate
command requires time to complete, depending upon the size of your InfiniBand fabric. Without the -v
option, the output contains only failed ports. The output in the example is only a small portion of the actual output.14.2 Controlling the InfiniBand Fabric
This section contains the following topics:
14.2.1 Clearing Error Counters
If you are troubleshooting a port, the
perfquery
command provides counters of errors occurring at that port. To determine if the problem has been resolved, you can reset all of the error counters to 0 with the ibclearerrors
command.
On the command-line interface (CLI), run the following command:
# ibclearerrors
The output is displayed, as in the following example:
## Summary: 5 nodes cleared 0 errors #
14.2.2 Clearing Data Counters
When you are optimizing the InfiniBand fabric for performance, you might want to know how the throughput increases or decreases according to changes you are making to the fabric and Subnet Manager. The
ibclearcounters
command enables you to reset the data counters for all ports to 0.
On the command-line interface (CLI), run the following command:
# ibclearcounters
The output is displayed, as in the following example:
## Summary: 5 nodes cleared 0 errors #
14.2.3 Resetting a Port
You might need to reset a port to determine its functionality.
On the command-line interface (CLI), run the following command:
# ibportstate lid port reset
where
lid
is the LID of the node in the fabric, and port
is the port of the node.
For example:
# ibportstate 15 23 reset Initial PortInfo: # Port info: Lid 15 port 23 LinkState:.......................Down PhysLinkState:...................Disabled LinkWidthSupported:..............1X or 4X LinkWidthEnabled:................1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkSpeedActive:.................2.5 Gbps After PortInfo set: # Port info: Lid 15 port 23 LinkState:.......................Down PhysLinkState:...................Disabled After PortInfo set: # Port info: Lid 15 port 23 LinkState:.......................Down PhysLinkState:...................PortConfigurationTraining #
14.2.4 Setting Port Speed
You can manually set the speed of a single port to help determine symbol error generation. The
ibportstate
command can set the speed to 2.5, 5.0, or 10.0 GB/sec.
On the command-line interface (CLI), run the following command:
# ibportstate lid port speed <value>
where
lid
is the LID of the node in the fabric, port
is the port of the node, and <value>
is the speed you want to set.
Note:
Adding speed values enables either speed. For example, speed 7 is 2.5, 5.0, and 10.0 GB/sec.
For example:
# ibportstate 15 23 speed 1 Initial PortInfo: # Port info: Lid 15 port 23 LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps After PortInfo set: # Port info: Lid 15 port 23 LinkSpeedEnabled:................2.5 Gbps # ibportstate 15 23 speed 7 Initial PortInfo: # Port info: Lid 15 port 23 LinkSpeedEnabled:................2.5 Gbps After PortInfo set: # Port info: Lid 15 port 23 LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps #
14.2.5 Disabling a Port
If a port is found to be problematic due to a bad cable connection or a physical damage to the connectors, you can disable the port.
On the command-line interface (CLI), run the following command:
# disableswitchport [--reason=reason] connector|ibdev port
where
reason
is the reason for disabling the port, Blacklist
or Partition
. connector
is the number of the QSFP connector (0A–15B). ibdev
is the InfiniBand device name (Switch, Bridge-0-0, Bridge-0-1, Bridge-1-0, Bridge-1-1). port
is the number of the port (1–36).
This hardware command disables a QSFP connector and port on the switch chip or a port on the BridgeX chips. The command addresses either the connector or the port on the switch chip or the BridgeX port.
The
--reason
option enables you to use a passphrase to lock the state of the port:Blacklist
– A connector and port pair are identified as being inaccessible because of unreliable operation.Partition
– A connector and port pair are identified as being isolated from the InfiniBand fabric.
Both the
Blacklist
and Partition
passphrases survive reboot. You unlock these passphrases using the enableswitchport
command with the --reason
option.
Note:
State changes made with the ibportstate
command are not recognized by the disableswitchport
, enableswitchport
, or listlinkup
commands.
The following example shows how to disable and blacklist connector 14A with the
disableswitchport
command.:# disableswitchport --reason=Blacklist 14A Disable Switch port 7 reason: Blacklist Initial PortInfo: # Port info: DR path slid 65535; dlid 65535; 0 port 7 LinkState:.......................Down PhysLinkState:...................Polling LinkWidthSupported:..............1X or 4X LinkWidthEnabled:................1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkSpeedActive:.................2.5 Gbps After PortInfo set: # Port info: DR path slid 65535; dlid 65535; 0 port 7 LinkState:.......................Down PhysLinkState:...................Disabled #
Note:
After fixing the cable connection or any connector problems, you should enable the port.14.2.6 Enabling a Port
After fixing any connection- or connector-related problem related to a port, you should enable the port with the
enableswitchport
command.
On the command-line interface (CLI), run the following command:
enableswitchport [--reason=reason] connector|ibdev port
where
reason
is the reason for disabling the port, connector
is the number of the QSFP connector (0A–15B), ibdev
is the InfiniBand device name (Switch, Bridge-0-0, Bridge-0-1, Bridge-1-0, Bridge-1-1), and port
is the number of the port (1–36).
For example:
# enableswitchport --reason=Blacklist 14A Enable Switch port 7 Initial PortInfo: # Port info: DR path slid 65535; dlid 65535; 0 port 7 LinkState:.......................Down PhysLinkState:...................Disabled LinkWidthSupported:..............1X or 4X LinkWidthEnabled:................1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps LinkSpeedActive:.................2.5 Gbps After PortInfo set: # Port info: DR path slid 65535; dlid 65535; 0 port 7 LinkState:.......................Down PhysLinkState:...................Polling #
14.3 For More Information
For more information about Sun Network QDR InfiniBand Gateway Switches, see the product documentation at the following URL:
No comments:
Post a Comment