Skip to main content
Version: 1.4

System Health Metrics

Wayfinder provides a set of metrics that give you signals about the health of the system and any issues you may need to investigate. These metrics are emitted in a standard kubernetes way, so you can surface them using your observability tools.

Controller metrics​

Labels​

The following labels are used in controller metrics.

LabelDescription
controllerName of the controller. When omitted controller="cluster".
nameName of the object being reconciled. When omitted name="eks-dev".
namespaceNamespace of the object being reconciled. When omitted namespace="dev-team".
result“error”, “requeue”, “requeue_after”, “success”
statusValue of the status field on the status subresource, e.g. “Pending”, “Failed”, “Success”
severityError severity
typeError type

Available metrics​

Wayfinder exposes controller metrics using a Prometheus-compatible metrics endpoint.

NameDescriptionLabels
controller_runtime_reconcile_totalTotal number of reconciliations per controllercontroller, result
controller_runtime_reconcile_errors_totalTotal number of reconciliation errors per controller (see Note)controller
controller_runtime_reconcile_time_secondsLength of time per reconciliation per controllercontroller
controller_runtime_max_concurrent_reconcilesMaximum number of concurrent reconciles per controller.controller
controller_runtime_active_workersNumber of currently used workers per controllercontroller

Note: Same as controller_runtime_reconcile_total{result=”error”}

Additional controller metrics coming soon​

NameDescriptionLabels
wf_controller_reconcile_totalTotal number of reconciliations per controller and objectcontroller, namespace, name, result
wf_controller_reconcile_errors_totalTotal number of reconciliation errors per controller and objectcontroller, namespace, name, severity, type
wf_controller_component_reconcile_totalTotal number of reconciliations per controller, object and (status) componentcontroller, namespace, name, component, result, status
wf_controller_component_reconcile_errors_totalTotal number of reconciliation errors per controller, object and (status component)controller, namespace, name, component, severity, type
wf_controller_reconcile_time_secondsLength of time per reconciliation per controller and objectcontroller, namespace, name
wf_controller_reconcile_interval_secondsLength of time since the last reconciliation happened per controller and objectcontroller, namespace, name
wf_controller_sync_period_secondsDefault sync period for a controllercontroller

Example reconciliation loop with recording metrics​

NameRecorded metric
Controller startswf_controller_sync_period_seconds
Reconciliation starts for dev-team/eks-dev clusterwf_controller_reconcile_interval_seconds
Reconcile “Component A”wf_controller_component_reconcile_total{component=”Component A”,result=”error|requeue|requeue_after
Reconcile “Component B”wf_controller_component_reconcile_total{component=”Component B”,result=”error|requeue|requeue_after|success”}
Reconciliation finishedwf_controller_reconcile_total{result=”error|requeue|requeue_after|success”}
wf_controller_reconcile_time_seconds

API metrics​

The Wayfinder API produces a number of metrics around inbound requests and handling.

NameDescriptionRecorded metric
Policy ErrorsThe number of errors encountered trying to add a policypolicy_add_errors
Policy Engine ErrorsA counter on the number of errors encountered in the policy enginepolicy_errors
Policy Evaluation SummaryA summary of the policy evaluation time in secondspolicy_evaluation_seconds|summary
Policy Evaluation Find MatchingA summary of the latency encountered when finding matching policiespolicy_find_matches_seconds|summary
Policies Out of SyncA counter of the number of policies found out of syncpolicy_out_of_sync
HTTP Request Average SummaryThe average latency on requests to the apiserverhttp_request_avg_sec|
http_request_avg_sec_sum|http_request_avg_sec_count
HTTP Request Code TotalThe total number of http requests broken down by http codehttp_request_code_total
HTTP Request Error TotalThe total number of http requests that have not been successfulhttp_request_error_total
HTTP Total Number of RequestsThe total number of http requests to the apiserverhttp_request_total

Database metrics​

NameDescriptionRecorded metric
Database Total Creation CounterA counter or the create operations in the dbdb_create_counter
Database Total Deletion CounterA counter or the delete operations in the dbdb_delete_counter
Database Latency Summary on deletionsThe latency on delete operations to the dbdb_delete_latency_sec
db_delete_latency_sec_sum
db_delete_latency_sec_count
Overall Database Total ErrorsA counter of the number of errors encountered by the dbdb_error_counter
Database Latency Summary on selecting recordsThe latency on get operations to the dbdb_get_latency_sec
db_get_latency_sec_count
db_get_latency_sec_sum
Database Latency Summary on selects / listingThe latency on list operations to the dbdb_list_latency_sec
db_list_latency_sec_sum
db_list_latency_sec_count
Database Latency Summary on updates / insertThe latency on set operations to the dbdb_set_latency_sec
db_set_latency_sec_sum
db_set_latency_sec_count
Database UpdatesA counter or the update and add operations in the dbdb_update_counter