Troubleshooting

This document includes troubleshooting steps and addresses frequently asked questions related to Terragraph deployments.

Basic Network Debugging

This section describes the workflow for debugging general network issues.

Checking controller status

Verify that the e2e_controller service is active and running without any critical errors.

For Docker Swarm installations, use the docker service ls and docker service logs commands to show the service state and logs, respectively.
For legacy systemd installations, use the following commands:

# Print the systemd service status
$ systemctl status e2e_controller

# Stream the controller logs
$ journalctl -u e2e_controller -f

If something appears wrong, verify the controller configuration (see Deployment and Installation).

Checking network status

Run the following commands on the controller to find the current status of all nodes and links in the network:

# Print topology information, including node/link status
$ tg2 topology ls

# Print node IPv6 addresses and software versions
$ tg2 controller status

In a healthy network, the status of all DNs should be ONLINE_INITITATOR, and all CNs should be ONLINE. There should be a status report for every node in the network.

All nodes are offline

If all nodes are offline, including PoP nodes, check the routing from the PoP nodes to the controller.

Check that the PoP node has been configured correctly (see Deployment and Installation).
Verify that firewall rules are not blocking communication. Disabling firewalld is recommended to ensure no restrictions are in place. Otherwise, see Deployment and Installation for a list of ports used by the cloud services.

$ systemctl stop firewalld
$ systemctl disable firewalld

# If NMS is installed on the same host, restart Docker as well to re-add its iptables rules
$ systemctl restart docker

PoP nodes are online, and remaining nodes/links are offline

Check the status of a PoP node with the tg2 topology ls command:

ONLINE - The PoP node is unable to ignite the network due to GPS issues.
- Check the GPS lock on each PoP node using the command below. There should be a 3D fix and ideally 16 GPS satellites locked to each Terragraph device.

$ tg2 stats driver-if | grep -e "tgd.gpsStat.fixType" -e "tgd.gpsStat.fixNumSat"

ONLINE_INITIATOR - There is likely a routing issue from all other nodes.
- Check the route on the gateway or controller towards the Terragraph network for the Terragraph prefix.
- Check that BGP is working. Use the command below to print the default routes on the PoP node; there should be a default route towards nic2.

$ ip -6 route
default via fe80::66d1:54ff:feeb:a863 dev nic2 proto zebra metric 1024 pref medium

A specific node is offline

If the controller sees a node as OFFLINE, but it can be reached in-band, check the following items:

Verify that the node ID (MAC address) in the topology file matches the one on the node. This can be printed on the node using the following command:

$ get_hw_info NODE_ID

Verify that the controller URL on the node is correct:

$ cat /tmp/mynetworkinfo | grep e2eCtrlUrl
  "e2eCtrlUrl": "tcp://[2001::1]:7007",

Verify that the minion is running and attempting to report to the controller:

$ tail -f /var/log/e2e_minion/current
I0419 13:12:50.506603  3202 StatusApp.cpp:744] Reporting status to controller

Try to ping the controller from node. If the controller is pingable, check if TCP port 7007 is open (as described in an earlier section).
Verify that the default route (::/0) in the routing table points to the PoP node. Refer to the "Routing" section for additional troubleshooting steps.

$ breeze fib list
...
> ::/0
via fe80::250:43ff:fe59:a9c5@nic2

A specific link is offline

If a link is down and is not being ignited, check the following items:

Verify that "auto-ignition" on the controller is not disabled for the link or the entire network.

$ tg ignition state
14:19:23 [INFO]: IgnitionState(
    ...
    igParams=IgnitionParams(
        enable=True,
        ...
        linkAutoIgnite={}))

If there are continuously increasing ignition attempts (as seen in tg2 topology ls), check the failure reason or look for any error messages in the minion or firmware logs. Refer to the Firmware section for additional troubleshooting steps.

# See LinkDownCause in DriverMessage.thrift
$ tg2 stats driver-if | grep "linkDown.cause"

Terragraph unit is unresponsive or not powered

Verify that the unit has power.
- Unscrew the power cable gland.
- Verify that the power connector for the primary sector is securely inserted.
- Use a digital multimeter to verify that the Terragraph power terminal block has the correct DC voltage and polarity.
Power cycle the unit. Remove and reseat the power connector, or disconnect power to the power supply, then reconnect it.
If the problem persists, replace the unit.

Fiber is connected to Terragraph, but remote switch/router indicates no link integrity

Verify that the device at the other end of the fiber is powered on and configured correctly.
Verify that the fiber settings at both sides of the connection match, and are set as follows:
- Fiber type: Single or Multi-mode
- SFPs: Single or Multi-mode
- Speed: 1Gb or 10Gb
Verify that the fiber cable polarity is plugged in correctly at both ends.
- Near side SFP - RX/TX connects to far side SFP - TX/RX.
Using a fiber cable tester, verify that the fiber running to the Terragraph unit has received light power.
Plug the fiber cable into a known good SFP module.
Plug the known good SFP module into the Terragraph unit.
If the problem persists, replace the unit.

Ethernet is connected to Terragraph, but Ethernet device indicates no link integrity

Verify that the Ethernet device is powered on and configured correctly.
If the Ethernet device is powered by PoE, verify that it is connected to the unit.
Test the Ethernet cable from the Terragraph unit on a known good laptop.
Remove the Ethernet cable from the Terragraph unit and use a cable tester to verify the cable integrity.
If the problem persists, replace the unit.

Basic Node Debugging

This section lists steps for triaging node-level issues, mainly useful during development or initial hardware bring-up.

First Steps

The commands below provide a starting point for debugging and are not meant to be comprehensive.

Check the node configuration. Look for any modifications in the node configuration file and understand why they are there.

$ diff_node_config
$ cat /data/cfg/node_config.json

Check software/firmware versions. Are there matching or compatible versions running on all nodes?

$ tg2 version          # show versions for several components
$ cat /etc/tgversion   # show Terragraph build version
$ get_fw_version       # show firmware version number

Verify hardware information. In particular, check that MAC addresses are as expected (radio MAC addresses may depend on the node configuration field envParams.VPP_USE_EEPROM_MACS).

$ cat /tmp/node_info

Look for core dumps. If anything crashed, understand why.

$ ls -al /var/volatile/cores/
$ ls /data/kernel_crashes/vmcore.*

Use CLIs. Use CLI commands to check the state of important services.

# -- e2e_minion --
$ tg2 minion status
$ tg2 minion links
$ tg2 stats driver-if
$ tg2 stats system --dump
$ tg2 tech-support
# -- openr --
$ breeze lm links
$ breeze kvstore adj
$ breeze fib list
$ breeze tech-support
# -- vpp --
$ vppctl show int
$ vppctl show int addr
$ vppctl show ip6 fib
# -- exabgp --
$ exabgpcli show neighbor summary
$ exabgpcli show adj-rib in
$ exabgpcli show adj-rib out

Check service logs. Make sure important services are running properly. See Logs for descriptions of log files.

$ tail -F /var/log/e2e_minion/current
$ tail -F /var/log/vpp/current
$ tail -F /var/log/vpp/vnet.log
...

Check kernel logs. Terragraph driver logs can be seen in syslog files or dmesg.

$ dmesg -w
$ tail -F /var/log/kern.log

Check firmware logs. If possible, enable firmware logs using the node configuration field envParams.FW_LOGGING_ENABLED. See Firmware for troubleshooting steps and instructions on configuring the log verbosity.

$ tail -f /var/log/wil6210/wil6210_*_fw_*.txt

Manual Link Bring-Up

To establish a single Terragraph link manually using only the E2E minion, follow the short list of steps below.

Generate a new node configuration file using default values.

$ mv /data/cfg/node_config.json /data/cfg/node_config.json.bak
$ config_get_base /data/cfg/node_config.json

Only if a GPS fix is unavailable (or there is no GPS module present), disable the GPS sync check in firmware on the initiator node.

$ config_set -i radioParamsBase.fwParams.forceGpsDisable 1

Reboot the node.

$ reboot

Wait for the E2E minion to initialize and then show the radio MAC addresses.

$ tg2 minion status

Enable GPS sync.

$ tg2 minion gps_enable

Set the wireless channel (e.g. channel 2).

$ tg2 minion set_params -c 2

Associate the link from the initiator node using radio MAC addresses above.

$ tg2 minion assoc -i <initiator_mac> -m <responder_mac> -n dn

At this point, verify that all expected components are working:

tg2 minion links shows the link established at the driver/firmware layer.
breeze lm links shows the link established at the routing layer.
ping6 ff02::1%terraX (find "X" from output above) shows "DUP" responses confirming link-local connectivity.
vppctl show int shows incrementing "vpp-terraX" interface counters.

Routing

Triage

Run the command below to print debug information for an engineer to examine.

$ breeze tech-support

Background

Every link-state routing protocol incorporates the following steps and components:

Link discovery - Learn the underlying system's link information, including name, status, and addresses.
Neighbor discovery - Discover neighbors on the VLAN of each link.
Link-state database - Record the neighbor and prefix information for each node.
Route computation - Use the global link-state database to compute a route to each destination.
Route programming - Program routes onto the underlying system.

Routing can fail if a problem occurs in any of these stages. It is always recommended to troubleshoot issues serially.

Basic Checks

Verify that Open/R is running and not crashing. To do so, run the following command several times and check that the process ID has not changed:

$ ps -A | grep openr

If applicable, do the same for the fib_vpp process.

Link Discovery

Use the following command to list all links known to Open/R's LinkMonitor module:

$ breeze lm links

== Node Overload: NO  ==

Interface  Status  Overloaded  Metric Override  ifIndex  Addresses
---------  ------  ----------  ---------------  -------  ------------------------
nic1       Up                                         3  fe80::250:43ff:fe46:dbbb
nic2       Up                                         4  fe80::250:c2ff:fec9:9e9a

Any interface on the system must be seen by Open/R before neighbor discovery can be performed. Look for the following issues:

Make sure all expected links are listed with the correct status ("Up" or "Down") and address.
If a link status is "Down", check if a lower layer is down (L2 or driver issue).
If a link is not listed, then the link is not configured on the system (driver issue).
A node or interface should not be "overloaded" unless undergoing a drain operation. If the "overload" flag is incorrectly set, run the appropriate command to unset it:

$ breeze lm unset-node-overload
$ breeze lm unset-link-overload <interface>

If pinging any of the link-local addresses is unsuccessful, then either a problem exists in the underlying driver/lower layer or the other end of the link is not ignited correctly.

Neighbor Discovery

Use the following command to list the adjacencies that have propagated to Open/R's KvStore:

$ breeze kvstore adj

> node-01.02.03.04.05.06's adjacencies, version: 3, Node  Label: 1064, Overloaded?: False
Neighbor                Local Interface    Remote  Interface      Metric    Weight    Adj Label  NextHop-v4     NextHop-v6                   Uptime
node-00.00.00.10.0e.45  terra0              terra0                     1         1            0   0.0.0.0       fe80::200:ff:fe10:e45        8d1h
node-00.00.00.10.0e.47  nic2                nic2                       1         1            0   0.0.0.0       fe80::250:c2ff:fec9:9c5d     8d1h

If an expected neighbor is missing over a working link, then go through the following steps:

Verify that link-local multicast packets are received from the neighboring node.

$ tcpdump -i terra0 udp port 6666
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on terra0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:17:25.350335 IP6 fe80::200:ff:fe10:e45.6666 > ip6-allnodes.6666: UDP, length 189
15:17:26.189291 IP6 fe80::3a3a:21ff:feb0:29d.6666 > ip6-allnodes.6666: UDP, length  189

If the link-local multicast packets are not received, check if the MULTICAST configuration is set on the interface.

$ ip link show terra0
12: terra0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:00:00:10:0b:47 brd ff:ff:ff:ff:ff:ff

If packets are going through but no adjacencies are formed, verify that both ends of the link are using the same domain. Open/R will not form adjacencies between nodes with different domains. The domain is set in the Open/R configuration file ("domain" field in /var/run/openr_config.json).

Key-Value Store (`KvStore`)

KvStore provides a data bus for distributed computing, as it enables all nodes to have the same view of the network.

Peers

Every neighbor listed in the adjacencies database should be added as a KvStore peer. Use the following command to list all peers:

$ breeze kvstore peers

== node-01.02.03.04.05.06's peers  ==

> node-00.00.00.10.0e.45
cmd via tcp://[fe80::200:ff:fe10:e45%terra0]:60002

> node-00.00.00.10.0e.47
cmd via tcp://[fe80::250:c2ff:fec9:9c5d%nic2]:60002

If the same neighbor appears over multiple links, it is added only once into KvStore. This could happen when, for instance, two PoP sites have a wireless link between them along with a wired link (required for PoP nodes). Open/R will typically sync on the wired link, and the wireless link will never be used for any traffic. A Terragraph watchdog monitors all wireless links to check if any data is flowing, and will restart the E2E minion if no data is passed for a few minutes. This should be prevented in Terragraph by not allowing wireless links between PoP sites.

Consistency

In steady state and under stable network conditions, all nodes should report the same keys and hashes for each. This can be seen using the following command:

$ breeze kvstore keys

Link-State Database (`Lsdb`)

Every node in the network contributes two major pieces of information to KvStore with the following key format:

adj:<node-name> - The node's adjacency information.
prefix:<node-name> - The network prefix owned/proxied by the node.

Use the following command to dump all keys:

$ breeze kvstore keys

== Available keys in KvStore  ==

Key                            OriginatorId              Version  Hash
-----------------------------  ----------------------  ---------  ----------
adj:node-00.00.00.10.0e.42     node-00.00.00.10.0e.42          4  3845919268
allocprefix:188                node-00.00.00.10.0e.42          1  3527252845
prefix:node-00.00.00.10.0e.42  node-00.00.00.10.0e.42          9  520548462
...

Verify that the prefix and adj keys are present for all nodes. Otherwise, follow these steps to troubleshoot:

If prefixes are missing, check that the Open/R prefix allocator is enabled (i.e. either OPENR_ALLOC_PREFIX or OPENR_STATIC_PREFIX_ALLOC is set in the configuration file). The following command lists the prefixes advertised by all nodes:

$ breeze kvstore prefixes --nodes all

Route Computation

Open/R's Decision module consumes the global link-state database from KvStore and, on each node, computes the best path to all other prefixes and generates routing information. Along with the best paths, Loop-Free Alternates (LFAs) are also programmed.

Use the following command to request the routing table from Decision module:

$ breeze decision routes

== Routes for node-00.00.00.10.0b.47  ==

> 2001:a:b:c::/64
via fe80::200:ff:fe10:b4c@terra0 metric 2

...

> ::/0
via fe80::250:43ff:fe59:a9c5@nic2 metric 4
via fe80::250:43ff:fef2:130a@nic2 metric 5

When troubleshooting routes to a particular prefix, verify that a route has been computed and that the metric values are correct for the best next-hops (lower is more preferable).

Route Programming

Routes computed by the Decision module are then programmed by the Fib agent (ex. fib_vpp) via Thrift APIs.

Use the following command to print all routes on the Fib agent:

$ breeze fib list

== node-00.00.00.10.0b.47's FIB routes by client 786  ==

> 2001:a:b:c::/64
via fe80::200:ff:fe10:b4c@terra0

...

> ::/0
via fe80::250:43ff:fe59:a9c5@nic2

When troubleshooting routes to a particular prefix, verify that the intended next-hop is programmed. Otherwise, follow these steps to troubleshoot:

If a route is missing but was reported by the Decision module, then it is a sync error that may be resolved by restarting the openr service.
If applicable, check that routes are programmed correctly into the VPP FIB by running the command below.

$ vppctl show ip6 fib

Logs

This section describes important logs generated on Terragraph nodes.

Log Descriptions

The table below provides a non-exhaustive list of log files.

Note that the logs in /var are archived to flash on each reboot and also periodically, so a longer continuous log history is often available.

Log	Description
`/data/log/reboot_history`	Time and reason for each reboot. These are rotated (older file is suffixed ".2").
`/data/format_data.log`	Contains the timestamp when the /data partition required re-formatting and re-initialization. The reformat attempts to preserve some configuration, but not logs.
`/var/log/e2e_minion/current`	E2E minion logs.
`/var/log/e2e_minion/process_history`	E2E minion startup history, created the first time that the minion restarts.
`/var/log/e2e_controller/current`	E2E controller logs (when running the controller on a TG node).
`/var/log/stats_agent/current`	Stats agent logs.
`/var/log/fluent-bit/current`	Fluent Bit (log daemon) logs.
`/var/log/openr/current`	Open/R (routing daemon) logs.
`/var/log/fib_vpp/current`	Open/R VPP FIB agent logs.
`/var/log/exabgp/current`	ExaBGP (BGP daemon) logs.
`/var/log/gpsd/current`	gpsd (GPS daemon) logs.
`/var/log/vpp/current`	VPP startup logs.
`/var/log/vpp/vnet.log`	VPP and DPDK logs, including `wil6210` and `dpaa2` Poll Mode Driver (PMD) logs.
`/var/log/vpp-debug.log`	VPP CLI (`vppctl`) command history.
`/var/log/vpp_chaperone/current`	VPP configuration service logs.
`/var/log/wil6210/`	Directory containing firmware and microcode logs when using the node configuration field `envParams.FW_LOGGING_ENABLED`.
`/var/log/kern.log`	Kernel and TG driver logs.
`/var/log/dmesg`	Recent items in `kern.log`, also accessible via the `dmesg` command.
Linux crash (panic) logs	Panic logs and kernel context are saved in a dedicated flash partition. On Puma, the kernel message log is saved to `/data/kernel_crashes/vmcore.<date>`. Note that the node will reboot within 3 minutes of a kernel crash.

If the software watchdog is enabled, the following logs will also be generated:

Log	Description
`/var/log/wdog_repair_history`	Log of every watchdog-initiated repair since the last reboot. These are rotated (older file is suffixed ".2").
`/var/log/openr/openr_debug.log`	Periodic traceroute logs collected by the watchdog when no PoP node was reachable. The log also shows the times when PoP nodes were unknown. This log is rotated, and the older version has a suffix.

Automatically-Generated Logs

Certain log files are collected continuously on the temporary filesystem in the /var/log/ directory. These include logs for Terragraph software as well as system state. The files in this directory are archived and rotated to the flash in /data/log/logs.x.tar.gz (with x=1 being the latest), both periodically and before reboots. Note that some less frequently updated logs only exist on the flash within /data/log/ or /data/.

Logs Generated Manually

An extensive system dump archive can be created manually in /tmp/sysdump-<date>.tgz using the sys_dump command.

Firmware

This section contains frequently asked questions relating to firmware issues.

What are the signs of firmware crashes?

The node has no RF links.
A radio displays unknown status:

$ tg2 minion status

Radio MAC           Status           GPS Sync
------------------  ---------------  ---------
00:00:00:10:0b:47   N/A (crashed?)   false

There are firmware core dumps present.

$ ls -al /var/volatile/cores/wil6210_fw_core_*

The E2E minion logs indicate that a Wigig device is down:

$ tail -f /var/log/e2e_minion/current
I0305 00:06:12.570667  3342 StatusApp.cpp:1196] <00:00:00:10:0b:47> Device status: DOWN

What are the possible causes of firmware crashes?

Firmware-related error messages are printed out by the wil6210 Poll Mode Driver (PMD) in VPP logs which are found in /var/log/vpp/vnet.log. Causes include the following:

The initialization of the firmware failed. The VPP logs show messages such as:

Firmware not ready after x ms

A firmware assert occurred. Assert codes can be used to identify specific firmware errors. The VPP logs show messages such as:

Firmware error detected, assert codes FW <hex fw assert code>, UCODE <hex ucode assert code>

The node is not configured as an RF node. This fundamental setting is controlled by a variable read from EEPROM, tg_if2if, which must be set to 0 for RF nodes.

# Check if configured as an IF node
$ get_hw_info TG_IF2IF
1
# Switch between RF/IF node
$ db_eeprom=$(fdtget /sys/firmware/fdt /chosen eeprom)
$ fdtput -p "$db_eeprom" -t i /board tg-if2if <0|1>

What are signs that a node has rebooted unexpectedly?

Unexpected reboots are logged in the file /data/log/reboot_history. If the "dirty" tag appears by itself on any line, it indicates an abnormal reboot which was not initiated by the "reboot" command or the watchdog.

$ cat /data/log/reboot_history
up 1523315537 Mon, 09 Apr 2018 16:12:17 -0700 dirty

Linux crashes will create logs in a dedicated flash partition. On Puma, the kernel message log is saved to /data/kernel_crashes/vmcore.<date>.

Why can't a wireless link be established?

Verify that at least one end of the link is in the ONLINE_INITIATOR state. Typically, it should take less than a minute for a sector to transition from ONLINE to ONLINE_INITIATOR. If the node remains only ONLINE, then it has not acquired GPS timing and cannot initiate a wireless link. This can be verified via the firmware stat tgf.<mac_addr>.tsf.syncModeGps, which will be 0 (instead of 1).
- There are potential issues if the following firmware stats are increasing:
  - tgf.<mac_addr>.gps.numMissedSec, tgf.<mac_addr>.gps.numPpsErr
- Check the following stats to see why the GPS chip may not be able to estimate time (and position):
  - tgd.gpsStat.fixNumSat - Must be at least 1, or at least 4 if the site location is missing from the topology.
- Check the list of all satellites visible to the GPS chip:
  - tgd.gpsStat.<id>.used - Whether this satellite is in use.
  - tgd.gpsStat.<id>.snr - Usable satellites would typically have SNR greater than 25.
Verify that beamforming (BF) messages (Tx/Rx) are being successfully exchanged using the commands below. On the initiator node, the tgf.MAC.mgmtTx.bfTrainingReq and tgf.MAC.mgmtRx.bfTrainingRsp stats should be non-zero. On the responder node, the tgf.MAC.mgmtRx.bfTrainingReq and tgf.MAC.mgmtTx.bfTrainingRsp stats should be non-zero. If this is not the case, then check for misalignment between the nodes.

$ tg2 stats driver-if | grep -e "bfTrainingReq" -e "bfTrainingRsp"

The node configuration for over-the-air (OTA) encryption should match on both ends of the link. Verify that the wsecEnable config field is the same on both nodes, as well as 802.1X parameters if applicable.
Verify that the interface stats for "RX packets" and "TX packets" are increasing, which indicates that the link is active.

$ ifconfig terra0
terra0 Link encap:Ethernet HWaddr 02:08:02:00:01:00
inet6 addr: fe80::8:2ff:fe00:100/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST MTU:7800 Metric:1
RX packets:7260 errors:0 dropped:0 overruns:0 frame:0
 TX packets:7264 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
 RX bytes:1429692 (1.3 MiB) TX bytes:1428796 (1.3 MiB)

Initiate a ping on both nodes, either using the other end's IPv6 address as the destination or doing a multicast ping on the link's interface. If this ping succeeds, then the link is active.

# Ping the destination IPv6 address
$ ping6 fe80::8:2ff:fe00:103%terra0
# Ping the multicast address
$ ping6 ff02::1%terra0

Why is throughput lower than expected?

Check the firmware stats for PER (tgf.<mac_addr>.staPkt.perE6) and MCS (tgf.<mac_addr>.staPkt.mcs). The PER should be close to 0%. The MCS should be greater than 9 for high throughput.

$ tg2 stats driver-if | grep -e "staPkt.mcs" -e "staPkt.perE6"

Check the CPU usage on the Linux host and the firmware. The host CPU usage is tracked via the system stat cpu.util. The firmware CPU usage can be viewed via the stat tgf.<mac_addr>.miscSys.cpuLoadAvg, which captures the idle CPU as a percentage; if this stat drops to 3 or less, then the firmware is overloaded.

$ tg2 stats driver-if | grep "cpuLoadAvg"

Make sure that firmware info logs are disabled. Running the command below will enable only error logging:

$ tg2 minion fw_set_log_config -l error

Check that firmware is not in BF responder mode, which reserves the Rx BF slots and reduces maximum throughput by roughly 50%. If a sector is in responder mode, the following firmware stats will indicate the Rx BF slot count incrementing while the Tx BF slot count remains constant.

$ tg2 stats driver-if | grep -e "slot.numOfRxBfSlotsPgmrd" -e "slot.numOfTxBfSlotsPgmrd"

Change log levels of firmware modules

Show all supported firmware modules and logging levels:

$ tg2 minion fw_set_log_config --help

Change the logging level:

# set logging level to 'info' for all fw modules
$ tg2 minion fw_set_log_config -l info
# set logging level to 'debug' for a specific node and specific fw modules
$ tg2 minion fw_set_log_config -m la_tpc -m framer -m tpc -l debug

Watchdog

The Terragraph watchdog is a collection of monitors that observes different aspects of a node's health. These monitors perform various repair actions, including reboots when necessary.

Watchdog Logs

All watchdog actions are timestamped and logged to the following locations:

/var/log/wdog_repair_history - Log of every watchdog-initiated repair since the last reboot.
/data/log/reboot_history - Reason for each reboot, including reboots that were not initiated by the watchdog.

Both of the above logs are rotated, and the older rotations are renamed with a numeric suffix. Copies of wdog_repair_history are archived to flash in /data/log/ at least once before rebooting.

Watchdog Fault Tables

The tables below list all detected faults.
The /var/log/ files mentioned in the fault tables are also archived to flash in /data/log/ on reboot.

Fault Name	Description	Repair Action	Keyword in repair/reboot history	Fault-specific logs and comments
POP	No PoP node was reachable for 1 hour	Reboot	pop_unreachable	Traceroute logs are saved in `/var/log/openr` every few minutes when PoP is unreachable. The log also indicates the times when the PoP nodes were unknown. Use the `get_pop_ip` command to find the PoP nodes currently known to the unit.
UPG	On first boot of an upgrade image, E2E minion was unable to connect to the controller for several minutes	Revert to previous TG image and reboot	testcode-timed-out	n/a
LINK	No RF link formed for some minutes (default 15) on a baseband card that was healthy at least once since the last e2e_minion startup.	Reload FW on all baseband cards and restart E2E minion	prog-[macAddress]-nolink	The reported mac address is the first baseband card on which the fault was detected. The actual timeout is a determined by the `radioParamsBase.fwParams.noLinkTimeout` config parameter. `/var/log/fwdumps/RF/` (rotated log archives)
FW	Firmware has crashed, or the datapath from FW to Linux is dead on a baseband card that was healthy at least once since the last e2e_minion startup.	Reload FW on all baseband cards and restart E2E minion	prog-[macAddress]-timeout	The reported mac address is the first baseband card on which the fault was detected. `/var/log/fwdumps/RF/` (rotated log archives)
BBINIT	None of the baseband cards was initialized with a mac address for 1 minute after the last e2e_minion restart, or internal error.	Reload FW on all baseband cards and restart E2E minion	prog-init	`/var/log/fwdumps/RF/` (rotated log archives)
PEER	Some RF links are up, but none reached its peer for 1 minute	Reload FW on all baseband cards and restart E2E minion	link_monit.sh	`/var/log/fwdumps/RF/` (rotated log archives)
E2E	Failed to restart E2E minion	Reboot	e2e_minion_restart_failed_prog, e2e_minion_restart_failed_link	Restarting e2e minion also reloads the FW, reloads drivers, and initializes the FW. The suffix (`_prog` and `_link`) identifies the specific health monitor that failed to restart e2e minion.
GPS	GPS is in a bad or unlocked state continuously for 30 minutes (the timeout persists across reboots).	Log the fault without repairing it	gps	n/a
CFG1	A unit running with an new/unverified config did not connect to the controller for about 7 minutes.	Reboot and fall back to the previous, working configuration	config-fallback-timed-out	n/a
CFG2	A unit with a config change since startup, rebooted before connecting to the controller, and also before the deadline to do so expired.	Start up with the unverified new config	config-unverified-new-config	Some config changes only take effect after a re-start, so we allow one re-start with an unverified new config.
CFG3	A unit that booted up with an unverified new config, rebooted before connecting to the controller, and also before the deadline to do so expired.	Reboot and fall back to the previous, working configuration	config-fallback	n/a
DATA	/data flash partition is full	Reboot and clean up /data	fsys-data-full	The reboot history may be truncated. When `/data` is completely full, the "fsys-data-full" reboot reason may not show up in the reboot history. Only the `/data/log` directory is cleaned up.
TMP	/tmp is full	Reboot and also clean up /data if necessary	fsys-tmp-full	n/a
VPP	The vpp cli (`vppctl`) has deadlocked	Restart vpp	vpp_cli	n/a

Disabling the Watchdog

Some watchdogs can be temporarily disabled during troubleshooting to avoid undesirable repair actions and reboots.
The maximum watchdog disable period is one day.
The watchdogs that monitor the filesystem, revert upgrades, and revert config cannot be disabled.

The watchdog CLI provides the following commands for disabling watchdogs:

# Disable all non-critical watchdogs for 45 minutes (max 1440 minutes, or 1 day)
/etc/init.d/watchdog.sh dis 45

# Re-enable all watchdogs
/etc/init.d/watchdog.sh en

Manual Upgrades

Software upgrades on Terragraph nodes should normally be performed using the NMS UI or TG CLI (see Maintenance and Configuration). If this is not possible, then manual upgrades can be performed but should only be used as a last resort. The Terragraph upgrade image, tg-update-qoriq.bin, is a self-extracting binary file.

Manual Upgrade Steps

Sanity check the release/version of the TG upgrade binary. This can be done in a Bash shell before copying the image to the node, or afterwards on the node (following step 3).

$ ./tg-update-qoriq.bin -m
{
"version":"Facebook Terragraph Release RELEASE_M70-0-g1ad294cc0 michaelcallahan@devvm4933 2021-03-05T17:21:02",
"md5":"d0149fe85367780989ea642fe0bd480b",
"model":"NXP TG Board",
"hardwareBoardIds":["NXP_LS1048A_PUMA"]
}

Temporarily disable watchdogs to ensure that the node will not reboot during the upgrade.

# Disable watchdogs for 20 minutes
$ /etc/init.d/watchdog.sh dis 20

Copy the upgrade image to the /tmp/ directory on the node, replacing the IP address in the command below with the correct IP. Check that the executable permission was preserved (SCP should do this automatically).

$ scp tg-update-qoriq.bin root@[2001::1]:/tmp/

Write the new image to the flash, then boot it up.

$ /tmp/tg-update-qoriq.bin -wrt

Log back into the node after it has rebooted, and sanity check the new image version.

$ cat /etc/tgversion
Facebook Terragraph Release RELEASE_M70-0-g1ad294cc0 michaelcallahan@devvm4933 2021-03-05T17:21:02

Finalize the upgrade. Step 4 boots the node into a "test" state in which a reboot for any reason causes an automatic fallback to the old image; finalization cancels this "test" state.

# This command also prints info about the flash partitions,
# and it is safe to use when there is no new image to finalize.
$ testcode c

Note: If the new image is not finalized within 6 minutes, and the E2E minion is enabled, and it is unable to connect to the E2E controller, then the watchdog will trigger a reboot and revert to the previous image. The "test" state and the reversion are safety features that prevent bricking the node in all stages of the upgrade process. The "test" upgrade state can be skipped if absolutely necessary by issuing a different command in Step 4:

$ /tmp/tg-update-qoriq.bin -ur

Manual Reverts

It is sometimes convenient to manually revert to the inactive/secondary boot image. This can only be done when the unit is not in the middle of an upgrade, and not in the "test" state described in the "Manual Upgrades" section above.

Sanity check the TG version of the secondary boot image. Make sure that it is what you expect.

$ testcode v
TG version in secondary boot partition (/dev/mmcblk0p2)
Facebook Terragraph Release RELEASE_M70-0-g1ad294cc0 michaelcallahan@devvm4933 2021-03-05T17:21:02

Swap the roles of the primary and secondary boot partitions.

$ testcode x
Swapping the roles of the primary (/dev/mmcblk0p1) and
the secondary (/dev/mmcblk0p2) boot partitions.
Done
Reboot now to activate the new primary image!

Reboot to activate the newly designated primary image.

$ reboot

Note: You can undo the revert request between Steps 2 and 3 as follows:

$ testcode c

Troubleshooting

Basic Network Debugging​

Checking controller status​

Checking network status​

All nodes are offline​

PoP nodes are online, and remaining nodes/links are offline​

A specific node is offline​

A specific link is offline​

Terragraph unit is unresponsive or not powered​

Fiber is connected to Terragraph, but remote switch/router indicates no link integrity​

Ethernet is connected to Terragraph, but Ethernet device indicates no link integrity​

Basic Node Debugging​

First Steps​

Manual Link Bring-Up​

Routing​

Triage​

Background​

Basic Checks​

Link Discovery​

Neighbor Discovery​

Key-Value Store (KvStore)​

Peers​

Consistency​

Link-State Database (Lsdb)​

Route Computation​

Route Programming​

Logs​

Log Descriptions​

Automatically-Generated Logs​

Logs Generated Manually​

Firmware​

What are the signs of firmware crashes?​

What are the possible causes of firmware crashes?​

What are signs that a node has rebooted unexpectedly?​

Why can't a wireless link be established?​

Why is throughput lower than expected?​

Change log levels of firmware modules​

Watchdog​

Watchdog Logs​

Watchdog Fault Tables​

Disabling the Watchdog​

Manual Upgrades​

Manual Upgrade Steps​

Manual Reverts​