System Health Metrics
Wayfinder provides a set of metrics that give you signals about the health of the system and any issues you may need to investigate. These metrics are emitted in a standard kubernetes way, so you can surface them using your observability tools.
Controller metrics
Labels
The following labels are used in controller metrics.
| Label | Description |
|---|---|
| controller | Name of the controller. When omitted controller="cluster". |
| name | Name of the object being reconciled. When omitted name="eks-dev". |
| namespace | Namespace of the object being reconciled. When omitted namespace="dev-team". |
| result | "error", "requeue", "requeue_after", "success" |
| status | Value of the status field on the status subresource, e.g. "Pending", "Failed", "Success" |
| severity | Error severity |
| type | Error type |
Available metrics
Wayfinder exposes controller metrics using a Prometheus-compatible metrics endpoint.
| Name | Description | Labels |
|---|---|---|
controller_runtime_reconcile_total | Total number of reconciliations per controller | controller, result |
controller_runtime_reconcile_errors_total | Total number of reconciliation errors per controller (see Note) | controller |
controller_runtime_reconcile_time_seconds | Length of time per reconciliation per controller | controller |
controller_runtime_max_concurrent_reconciles | Maximum number of concurrent reconciles per controller. | controller |
controller_runtime_active_workers | Number of currently used workers per controller | controller |
Note: Same as controller_runtime_reconcile_total{result="error"}
Additional controller metrics coming soon
| Name | Description | Labels |
|---|---|---|
wf_controller_reconcile_total | Total number of reconciliations per controller and object | controller, namespace, name, result |
wf_controller_reconcile_errors_total | Total number of reconciliation errors per controller and object | controller, namespace, name, severity, type |
wf_controller_component_reconcile_total | Total number of reconciliations per controller, object and (status) component | controller, namespace, name, component, result, status |
wf_controller_component_reconcile_errors_total | Total number of reconciliation errors per controller, object and (status component) | controller, namespace, name, component, severity, type |
wf_controller_reconcile_time_seconds | Length of time per reconciliation per controller and object | controller, namespace, name |
wf_controller_reconcile_interval_seconds | Length of time since the last reconciliation happened per controller and object | controller, namespace, name |
wf_controller_sync_period_seconds | Default sync period for a controller | controller |
Example reconciliation loop with recording metrics
| Name | Recorded metric |
|---|---|
| Controller starts | wf_controller_sync_period_seconds |
| Reconciliation starts for dev-team/eks-dev cluster | wf_controller_reconcile_interval_seconds |
| Reconcile "Component A" | wf_controller_component_reconcile_total{component="Component A",result="error|requeue|requeue_after|success"} |
| Reconcile "Component B" | wf_controller_component_reconcile_total{component="Component B",result="error|requeue|requeue_after|success"} |
| Reconciliation finished | wf_controller_reconcile_total{result="error|requeue|requeue_after|success"}wf_controller_reconcile_time_seconds |
API metrics
The Wayfinder API produces a number of metrics around inbound requests and handling.
| Name | Description | Recorded metric |
|---|---|---|
| Policy Errors | The number of errors encountered trying to add a policy | policy_add_errors |
| Policy Engine Errors | A counter on the number of errors encountered in the policy engine | policy_errors |
| Policy Evaluation Summary | A summary of the policy evaluation time in seconds | policy_evaluation_seconds|summary |
| Policy Evaluation Find Matching | A summary of the latency encountered when finding matching policies | policy_find_matches_seconds|summary |
| Policies Out of Sync | A counter of the number of policies found out of sync | policy_out_of_sync |
| HTTP Request Average Summary | The average latency on requests to the apiserver | http_request_avg_sechttp_request_avg_sec_sumhttp_request_avg_sec_count |
| HTTP Request Code Total | The total number of http requests broken down by http code | http_request_code_total |
| HTTP Request Error Total | The total number of http requests that have not been successful | http_request_error_total |
| HTTP Total Number of Requests | The total number of http requests to the apiserver | http_request_total |
Database metrics
| Name | Description | Recorded metric |
|---|---|---|
| Database Total Creation Counter | A counter or the create operations in the db | db_create_counter |
| Database Total Deletion Counter | A counter or the delete operations in the db | db_delete_counter |
| Database Latency Summary on deletions | The latency on delete operations to the db | db_delete_latency_secdb_delete_latency_sec_sumdb_delete_latency_sec_count |
| Overall Database Total Errors | A counter of the number of errors encountered by the db | db_error_counter |
| Database Latency Summary on selecting records | The latency on get operations to the db | db_get_latency_secdb_get_latency_sec_countdb_get_latency_sec_sum |
| Database Latency Summary on selects / listing | The latency on list operations to the db | db_list_latency_secdb_list_latency_sec_sumdb_list_latency_sec_count |
| Database Latency Summary on updates / insert | The latency on set operations to the db | db_set_latency_secdb_set_latency_sec_sumdb_set_latency_sec_count |
| Database Updates | A counter or the update and add operations in the db | db_update_counter |