Skip to main content

System Health Metrics

Wayfinder provides a set of metrics that give you signals about the health of the system and any issues you may need to investigate. These metrics are emitted in a standard kubernetes way, so you can surface them using your observability tools.

Controller metrics

Labels

The following labels are used in controller metrics.

LabelDescription
controllerName of the controller. When omitted controller="cluster".
nameName of the object being reconciled. When omitted name="eks-dev".
namespaceNamespace of the object being reconciled. When omitted namespace="dev-team".
result"error", "requeue", "requeue_after", "success"
statusValue of the status field on the status subresource, e.g. "Pending", "Failed", "Success"
severityError severity
typeError type

Available metrics

Wayfinder exposes controller metrics using a Prometheus-compatible metrics endpoint.

NameDescriptionLabels
controller_runtime_reconcile_totalTotal number of reconciliations per controllercontroller, result
controller_runtime_reconcile_errors_totalTotal number of reconciliation errors per controller (see Note)controller
controller_runtime_reconcile_time_secondsLength of time per reconciliation per controllercontroller
controller_runtime_max_concurrent_reconcilesMaximum number of concurrent reconciles per controller.controller
controller_runtime_active_workersNumber of currently used workers per controllercontroller

Note: Same as controller_runtime_reconcile_total{result="error"}

Additional controller metrics coming soon

NameDescriptionLabels
wf_controller_reconcile_totalTotal number of reconciliations per controller and objectcontroller, namespace, name, result
wf_controller_reconcile_errors_totalTotal number of reconciliation errors per controller and objectcontroller, namespace, name, severity, type
wf_controller_component_reconcile_totalTotal number of reconciliations per controller, object and (status) componentcontroller, namespace, name, component, result, status
wf_controller_component_reconcile_errors_totalTotal number of reconciliation errors per controller, object and (status component)controller, namespace, name, component, severity, type
wf_controller_reconcile_time_secondsLength of time per reconciliation per controller and objectcontroller, namespace, name
wf_controller_reconcile_interval_secondsLength of time since the last reconciliation happened per controller and objectcontroller, namespace, name
wf_controller_sync_period_secondsDefault sync period for a controllercontroller

Example reconciliation loop with recording metrics

NameRecorded metric
Controller startswf_controller_sync_period_seconds
Reconciliation starts for dev-team/eks-dev clusterwf_controller_reconcile_interval_seconds
Reconcile "Component A"wf_controller_component_reconcile_total{component="Component A",result="error|requeue|requeue_after|success"}
Reconcile "Component B"wf_controller_component_reconcile_total{component="Component B",result="error|requeue|requeue_after|success"}
Reconciliation finishedwf_controller_reconcile_total{result="error|requeue|requeue_after|success"}
wf_controller_reconcile_time_seconds

API metrics

The Wayfinder API produces a number of metrics around inbound requests and handling.

NameDescriptionRecorded metric
Policy ErrorsThe number of errors encountered trying to add a policypolicy_add_errors
Policy Engine ErrorsA counter on the number of errors encountered in the policy enginepolicy_errors
Policy Evaluation SummaryA summary of the policy evaluation time in secondspolicy_evaluation_seconds|summary
Policy Evaluation Find MatchingA summary of the latency encountered when finding matching policiespolicy_find_matches_seconds|summary
Policies Out of SyncA counter of the number of policies found out of syncpolicy_out_of_sync
HTTP Request Average SummaryThe average latency on requests to the apiserverhttp_request_avg_sec
http_request_avg_sec_sum
http_request_avg_sec_count
HTTP Request Code TotalThe total number of http requests broken down by http codehttp_request_code_total
HTTP Request Error TotalThe total number of http requests that have not been successfulhttp_request_error_total
HTTP Total Number of RequestsThe total number of http requests to the apiserverhttp_request_total

Database metrics

NameDescriptionRecorded metric
Database Total Creation CounterA counter or the create operations in the dbdb_create_counter
Database Total Deletion CounterA counter or the delete operations in the dbdb_delete_counter
Database Latency Summary on deletionsThe latency on delete operations to the dbdb_delete_latency_sec
db_delete_latency_sec_sum
db_delete_latency_sec_count
Overall Database Total ErrorsA counter of the number of errors encountered by the dbdb_error_counter
Database Latency Summary on selecting recordsThe latency on get operations to the dbdb_get_latency_sec
db_get_latency_sec_count
db_get_latency_sec_sum
Database Latency Summary on selects / listingThe latency on list operations to the dbdb_list_latency_sec
db_list_latency_sec_sum
db_list_latency_sec_count
Database Latency Summary on updates / insertThe latency on set operations to the dbdb_set_latency_sec
db_set_latency_sec_sum
db_set_latency_sec_count
Database UpdatesA counter or the update and add operations in the dbdb_update_counter