# Configuration Management

This document describes the configuration management system.

## Architecture​

Terragraph implements a centralized configuration management architecture. The E2E controller serves as the point of truth for all node configurations, and is responsible for keeping all nodes' local copies in sync with its own view. All config changes are applied using the controller.

Node configuration is a JSON-serialized Thrift structure (thrift::NodeConfig), and is generally manipulated using folly::dynamic objects. A node's full configuration is computed by applying several "override" layers on top of the base configuration for its software version. All configuration files are stored under the /data/cfg/ directory on disk for both the E2E controller and Terragraph node.

### Operations​

The controller's ConfigApp accepts the following user operations on the node configurations:

User OperationCommands
Get Node ConfigGET_CTRL_CONFIG_REQ
GET_CTRL_CONFIG_PATHS_REQ
Get Config LayersGET_CTRL_CONFIG_BASE_REQ
GET_CTRL_CONFIG_FIRMWARE_BASE_REQ
GET_CTRL_CONFIG_HARDWARE_BASE_REQ
GET_CTRL_CONFIG_AUTO_NODE_OVERRIDES_REQ
GET_CTRL_CONFIG_NETWORK_OVERRIDES_REQ
GET_CTRL_CONFIG_NODE_OVERRIDES_REQ
Set Config OverridesSET_CTRL_CONFIG_NETWORK_OVERRIDES_REQ
SET_CTRL_CONFIG_NODE_OVERRIDES_REQ
MODIFY_CTRL_CONFIG_NETWORK_OVERRIDES_REQ
MODIFY_CTRL_CONFIG_NODE_OVERRIDES_REQ
CLEAR_AUTO_NODE_OVERRIDES_CONFIG
Get Config MetadataGET_CTRL_CONFIG_METADATA_REQ
Get Config ActionsGET_CTRL_CONFIG_NETWORK_OVERRIDES_ACTIONS_REQ
GET_CTRL_CONFIG_NODE_OVERRIDES_ACTIONS_REQ
GET_CTRL_CONFIG_ACTIONS_RESULTS_REQ
Perform OptimizationsTRIGGER_POLARITY_OPTIMIZATION
TRIGGER_GOLAY_OPTIMIZATION
TRIGGER_CONTROL_SUPERFRAME_OPTIMIZATION
TRIGGER_CHANNEL_OPTIMIZATION

All SET operations on the controller are validated using the config metadata, which is a separate JSON file describing every configuration parameter. All GET operations are served locally except for getting post-config actions, which sends out GET_MINION_CONFIG_ACTIONS_REQ messages to minions to request the actions that they would take for a given configuration change; responses can then be polled from the controller. Additionally, the controller handles the EDIT_NODE command from TopologyApp to migrate a node's configuration upon a node name change.

The controller pushes the full node configuration to a minion's ConfigApp through the SET_MINION_CONFIG_REQ command. The minion updates its copy of the configuration on disk (/data/cfg/node_config.json) and takes any required post-config actions. These actions are determined by taking the JSON difference between the current and new node configurations, then applying all actions defined in the local metadata file.

Note that much of the functionality described above is implemented within helper classes:

• ConfigHelper (e2e-controller) manages reading, modifying, and writing to all separate configuration layers.
• NodeConfigWrapper (e2e-common) manages reading and writing the full node configuration.
• ConfigMetadata (e2e-common) manages reading the config metadata file, validating configuration, and extracting post-config actions.
• ConfigUtil (e2e-common) contains miscellaneous static utilities, such as version string parsing.

### Syncing Configuration​

To keep config in sync, nodes send an MD5 hash of their local configuration to the controller in their periodic status reports (STATUS_REPORT) from StatusApp. The controller's ConfigApp periodically iterates over all received status reports and looks for mismatches with its computed configuration. For each mismatch, the controller will push a new node configuration to the minion, as described above. There is a fixed interval between consecutive pushes to the same node, as a protective measure in case the new configuration causes a fatal error on the node; this delay is reset when the controller receives a configuration change.

The automatic config sync can be disabled by "un-managing" the network or specific nodes via a special boolean configuration field sysParams.managedConfig. This may be needed temporarily for testing purposes.

### Staged Rollout​

The controller's ConfigApp contains an optional scheduling algorithm, similar to that of UpgradeApp, to roll out configuration changes that affect multiple nodes in batches (e.g. when changing network overrides). This protects against network-wide service disruption and network isolation, but will delay the propagation of configuration changes.

To enable this staging algorithm, set the config_staged_rollout_enabled flag on the controller; by default, all config updates to nodes are sent simultaneously.

### Interoperability​

When the controller finds a node with an unrecognized hardware board ID and version, it will disable config updates for that node because unknown hardware-specific configs may exist.

Additionally, if the unknown_hw_queries_enabled flag is set (default on), the controller will send GET_MINION_BASE_CONFIG queries to these nodes (rate-limited per node and board ID). In response, each minion will return a MINION_BASE_CONFIG message containing its latest base configuration and metadata. The controller merges these minion structures into its existing config data (non-persistently), and can afterwards re-enable config updates for all nodes with the same hardware board ID. Any conflicting metadata entries are dropped, and unknown post-config actions are parsed but ignored.

## Layered Configuration Model​

Terragraph follows a layered configuration model, with a node's "full" configuration computed as the union of all layers in the following order:

• Base configuration - The default configuration, which is tied to a specific software version and is included as part of the image. The controller finds the closest match for a node's software version string, and falls back to the latest if no match was found.
/etc/e2e_config/base_versions/<version>.json
• Firmware-specific base configuration - The default configuration tied to a specific firmware version, which is also included as part of the image. Values are applied on top of the initial base configuration layer.
/etc/e2e_config/base_versions/fw_versions/<version>.json
• Hardware-specific base configuration - The default configuration tied to a specific hardware type, which is also included as part of the image. Each hardware type supplies configuration that changes with software versions. Values are applied on top of the firmware base configuration layer. The mapping between hardware types and hardware board IDs is defined in /etc/e2e_config/base_versions/hw_versions/hw_types.json.
/etc/e2e_config/base_versions/hw_versions/<hw_type>/<version>.json
• Automated node overrides - Contains any config parameters for specific nodes that were automatically set by the E2E controller.
/data/cfg/auto_node_config_overrides.json
• Network overrides - Contains any config parameters that should be uniformly overridden across the entire network. This takes precedence over the base configuration and automatic overrides.
/data/cfg/network_config_overrides.json
• Node overrides - Contains any config parameters that should be overridden only on specific nodes (e.g. PoP nodes). This takes precedence over the network overrides.
/data/cfg/node_config_overrides.json

The E2E controller manages and stores the separate config layers. Terragraph nodes have no knowledge of these layers, except the base configuration on the image. The nodes will copy the latest base version (via natural sort order) if the configuration file on disk is missing or corrupt.

A separate config metadata file (/etc/e2e_config/config_metadata.json) is used for per-config documentation, verification, and actions. Every configuration field has a corresponding entry in this file.

### Data Types​

Whereas node configuration is a JSON-serialized Thrift structure, the metadata is freeform JSON that is recursively parsed into C++ structures defined in ConfigMetadata.h.

Each root metadata entry is parsed into the CfgParamMetadata structure, which inherits the CfgRecursiveParam base type. A root entry is identified by the presence of its required properties ("desc", "type", "action"). Depending on the associated configuration field's data type, the root structure may or must also contain a type-specific sub-structure (e.g. required for maps and objects). The table below shows the association between all possible data types (defined in the Thrift enum thrift::CfgParamType), the name of their type-specific field, and the C++ structure for the field:

TypeFieldSub-Structure
INTEGERintValCfgIntegerParam
FLOATfloatValCfgFloatParam
STRINGstrValCfgStringParam
BOOLEANboolValCfgBooleanParam
MAPmapValCfgMapParam
OBJECTobjValCfgObjectParam

Note that these data types constitute all JSON types except arrays. Arrays are not supported in the configuration model due to ambiguity in layering. Also note that map keys must be strings, in accordance with JSON specifications.

### Post-Config Actions​

A post-config action is associated with each metadata entry, determining what the minion should do when the configuration field is changed. These are defined in the Thrift enum thrift::CfgAction:

ActionDescription
NO_ACTIONDo nothing
REBOOTReboot the node
RESTART_MINIONRestart e2e_minion
RESTART_STATS_AGENTRestart stats_agent
RESTART_LOGTAILRestart logtail
RESTART_ROUTINGRestart openr and pop_config
RESTART_SQUIRERestart squire
REDO_POP_CONFIGRestart pop_config and fib_nss
RESTART_FLUENTD_AGENTRestart fluent-bit
RELOAD_DNS_SERVERSReload nameservers listed in /etc/resolv.conf
RELOAD_NTP_CONFIGRestart NTP service with updated config
RELOAD_RSYSLOG_CONFIGRewrite rsyslog.conf and restart rsyslogd
RELOAD_SSHD_CA_KEYSRestart sshd with updated CA keys file
RESTART_KEARestart kea (dhcp)
SYNC_LINK_MONITORSync link metrics with openr LinkMonitor
INJECT_KVSTORE_KEYSSync keys with openr KvStore
UPDATE_FIREWALLRewrite firewall(ip6tables) rules
UPDATE_LINK_METRICSUpdate MCS-based link metric config for openr
UPDATE_GLOG_LEVELUpdate glog's VLOG level
RELOAD_FIRMWAREReload firmware, usually by restarting e2e_minion
SET_FW_PARAMSDynamically change this firmware parameter
SET_FW_PARAMS_SYNC_OR_RELOAD_FIRMWAREDynamically change this firmware parameter at a specific BWGD index, or reload firmware if not possible
SET_FW_STATS_CONFIGSet firmware stats config
SET_AIRTIME_PARAMSDynamically change link airtime allocation
RESTART_UDP_PING_SERVERRestart UDP ping server
RELOAD_SSHDRestart SSH daemon
UPDATE_CHANNEL_ASSIGNMENTReassign channels across topology
RESTART_SNMPRestart NET-SNMP and TG snmp agent daemons
RESTART_WEBUIRestart WebUI (HTTP server)
RELOAD_TUNNEL_CONFIGReload tunnel configuration
RELOAD_VPP_CONFIG_AND_MONITORRun vpp_chaperone to re-apply VPP config and restart monitor services
UPDATE_ZONERedo POP config if required or do nothing
RELOAD_TOPOLOGY_NAMESRestart stats_agent and fluent-bit services

It is possible that multiple actions will be triggered for a single parameter, e.g. if an object property defines an action different from the root entry. The object property action does not simply replace the root entry action. Note that SET_FW_PARAMS and related actions are a special case and will only take effect when set on an object property (not on a root entry).

The code handling each action is located in the minion's ConfigApp::performNodeActions() method. The minion is responsible for applying actions in a sensible order, e.g. rebooting or restarting minion last.

### Structure​

Each root metadata entry must include these fields:

• desc: A string description of the field
• action: The post-config action (thrift::CfgAction)
• type: The data type of the field (thrift::CfgParamType)

Type-specific constraints can be defined using the previously-listed sub-structures:

• intVal
• allowedRanges: An array of allowed ranges, where each range is a two-element integer array [min, max] (inclusive)
• allowedValues: An array of allowed integer values, separate from allowedRanges (e.g. for special values like -1)
• floatVal
• allowedRanges: An array of allowed ranges, where each range is a two-element floating-point array [min, max] (inclusive)
• allowedValues: An array of allowed floating-point values, separate from allowedRanges
• strVal
• regexMatches: A regular expression string which the value must match
• intRanges: An array of allowed integer ranges (like above); the string value is parsed as an integer
• floatRanges: An array of allowed floating-point ranges (like above); the string value is parsed as a floating-point number
• allowedValues: An array of allowed string values, separate from regexMatches
• boolVal
• n/a
• objVal
• properties: A map of string property names to objects with the following fields:
• desc: A string description of the object property
• action: An additional post-config action (thrift::CfgAction)
• required: Whether defining this property is required (default false)
• type: The data type of the property value (thrift::CfgParamType)
• intVal, floatVal, strVal, boolVal, objVal, mapVal
• mapVal
• type: The data type of the map values (enum, see above)
• intVal, floatVal, strVal, boolVal, objVal, mapVal

### Deprecation​

A configuration field can be marked as "deprecated" by setting the optional property "deprecated": true on any root or object-property metadata entry. Deprecated configuration fields will still be initially read from disk, but only allow GET operations during runtime.

A configuration field can be marked as "read-only" by setting the optional property "readOnly": true on any root or object-property metadata entry. Read-only configuration fields will still be initially read from disk, but only allow GET operations during runtime.

### Preprocessing​

The config metadata parser supports a simple form of macro expansion. Macro expansion blocks are denoted with a "copy-block" marker (__copy_block__), which is a normal JSON property that takes as a string value the absolute location of the target block to copy (with all properties delimited by a period). The target JSON value will be copied into the source location (i.e. where copy-block was used). Any other properties defined within the original object with copy-block will override those same properties in the copied target block.

In the example below, the "mcs" block under "radioParamsBase" will be copied into "linkParamsBase", but with the action "SET_FW_PARAMS" instead of "RESTART_MINION":

{  "linkParamsBase": {    "desc": "Link parameters",    "action": "NO_ACTION",    "type": "OBJECT",    "objVal": {      "properties": {        "fwParams": {          "desc": "Firmware parameters for links",          "action": "RESTART_MINION",          "type": "OBJECT",          "objVal": {            "properties": {              "mcs": {                "__copy_block__": "radioParamsBase.objVal.properties.fwParams.objVal.properties.mcs",                "action": "SET_FW_PARAMS"              }            }          }        }      }    }  },  "radioParamsBase": {    "desc": "Radio parameters",    "action": "NO_ACTION",    "type": "OBJECT",    "objVal": {      "properties": {        "fwParams": {          "desc": "Firmware parameters for radios",          "action": "RESTART_MINION",          "type": "OBJECT",          "objVal": {            "properties": {              "mcs": {                "desc": "MCS used by transmitter (1-12: Static MCS, 35: Joint LA-TPC)",                "type": "INTEGER",                "intVal": {                  "allowedRanges": [[1, 12]],                  "allowedValues": [35]                }              }            }          }        }      }    }  }}

The current implementation handles circular- and self-references (throw error), as well as dependencies (parse with no order requirement).

## E2E Configuration​

The controller and aggregator are configured using separate configuration files with a similar overall design as the node configuration, and are managed by the E2EConfigWrapper class. The configuration model does not have any layers (or "base" configuration), but the user operations on these configurations are otherwise the same as for nodes.

These E2E configuration files are shown in the table below.

thrift::ControllerConfig/data/cfg/controller_config.json/etc/e2e_config/controller_config_metadata.json
thrift::AggregatorConfig/data/cfg/aggregator_config.json/etc/stats_config/aggregator_config_metadata.json

The following actions are supported in the E2E configuration files:

ActionDescription
NO_ACTIONDo nothing
REBOOTSend SIGTERM to the process
RESTART_STATS_AGENTRestart stats_agent
UPDATE_GLOG_LEVELUpdate glog's VLOG level
UPDATE_GFLAGReload associated gflag value
UPDATE_SCAN_CONFIGUpdate scan configuration

The E2E configuration structures contain a special "flags" map, which defines command-line flags to be passed into the E2E controller and NMS aggregator on startup. The systemd service scripts will pass the flags, and utilize the config_print_flags script which reads a configuration file and returns the command-line flags as a formatted string.

Certain metadata properties are specific to E2E configuration, and are described in the sections below.

#### Tags​

Descriptive tags can be set on any root or object-property metadata entry via the optional "tag" string property. Tags are purely informational.

#### Sync​

Controller configuration is partially synced between peers under High Availability mode (refer to High Availability for further details). Configuration fields must be marked "syncable" by setting the optional property "sync": true on any root or object-property metadata entry. In general, only network-level configuration is synced, and host-level configuration is not synced.

## Software Version Strings​

Selection of the correct base configuration and hardware-specific base configuration is determined by each node's software version string, contained in the file /etc/tgversion.

Optionally, "major" and "minor" version numbers are extracted from version strings that contain one of the following substrings:

[...] RELEASE_M<major> [...][...] RELEASE_M<major>_<minor> [...]

For example, the string below represents major version 60, minor version 7:

Facebook Terragraph Release RELEASE_M60_7-0-g5bec98eb9 michaelcallahan@devvm1112 2020-12-04T15:49:48

If major and minor versions are present, they are used to compare different software versions. Otherwise, string-type comparisons are done on the full version string. Order is important when determining the latest known configuration, for example during node startup where the initial node configuration file is generated if absent (via /usr/sbin/config_read_env). There is an assumption that the "latest" base configuration file present on a node's filesystem is the one it should use locally.

The sections below describe the steps required to write new configuration.

### New Hardware​

Hardware-specific configuration is applied based on a hardware board ID, hardware type, and software version.

• The hardware board ID is defined by the value of HW_BOARD_ID in the node info file (/var/run/node_info). This file is generated by a startup script gen_node_info_file.sh which must be modified for new hardware platforms. See Service Scripts for more details.
• The hardware type is defined in hw_types.json (described above), which is a mapping between board IDs and subdirectory names (within ./base_versions/hw_versions/). One hardware type can represent multiple board IDs (e.g. for related platforms with identical configuration).
• The software version is used to match both the base configuration and hardware-specific configuration. If any hardware-specific configuration should be applied, then a file should exist for each supported software version in both ./base_versions/<version>.json and ./base_versions/hw_versions/<hw_type>/<version>.json.

For interoperability, vendors must define a new, unique hardware board ID for each product.

### New Configuration Fields​

All configuration fields must have the following parts:

• Definition - Within the appropriate Thrift structure (i.e. NodeConfig, ControllerConfig, or AggregatorConfig), as described above.
• Metadata - Within the corresponding metadata file (i.e. controller_metadata.json, controller_config_metadata.json, or aggregator_config_metadata.json).
• Default value - Within the base configuration JSON file. For hardware-specific fields, the default value should be placed in the hardware-specific base configuration file instead; in these cases, an empty value is usually added to the base configuration as well (for visibility).

For interoperability, vendors should follow these guidelines:

• Choose obviously non-conflicting key names for custom fields. For example, include the company name in every field name (for top-level structures).
• Pick indices for new Thrift fields within unique ranges to avoid serialization and/or merge conflicts. For example, start at 20000 (random number) instead of the next consecutive integer.

### New Command-Line Flags​

When adding new command-line flags to the E2E controller or NMS aggregator, it is recommended (but not required) to add an entry to corresponding metadata file. This provides better visibility (e.g. in the NMS), enables input validation, and allows post-config actions to be applied.

Flags for other services (e.g. E2E minion) are not supported here; consider using explicit configuration fields instead.

### Breaking Changes​

For configuration changes which are not backwards compatible (e.g. moving/renaming fields), it is recommended to support automatic migration between the old and new configuration versions to avoid the need for user intervention. This migration procedure is implemented in the script src/terragraph-e2e/lua/migrate_e2e_data.lua, which should be run automatically during software upgrades (e.g. when using the NMS installer) and will modify data files as needed. Follow instructions in this script to add additional migration functions.