System Health Metrics
Wayfinder provides a set of metrics that give you signals about the health of the system and any issues you may need to investigate. These metrics are emitted in a standard kubernetes way, so you can surface them using your observability tools.
Controller metrics
Labels
The following labels are used in controller metrics.
Label | Description |
---|---|
controller | Name of the controller. When omitted controller="cluster". |
name | Name of the object being reconciled. When omitted name="eks-dev". |
namespace | Namespace of the object being reconciled. When omitted namespace="dev-team". |
result | "error", "requeue", "requeue_after", "success" |
status | Value of the status field on the status subresource, e.g. "Pending", "Failed", "Success" |
severity | Error severity |
type | Error type |
Available metrics
Wayfinder exposes controller metrics using a Prometheus-compatible metrics endpoint.
Name | Description | Labels |
---|---|---|
controller_runtime_reconcile_total | Total number of reconciliations per controller | controller, result |
controller_runtime_reconcile_errors_total | Total number of reconciliation errors per controller (see Note) | controller |
controller_runtime_reconcile_time_seconds | Length of time per reconciliation per controller | controller |
controller_runtime_max_concurrent_reconciles | Maximum number of concurrent reconciles per controller. | controller |
controller_runtime_active_workers | Number of currently used workers per controller | controller |
Note: Same as controller_runtime_reconcile_total{result="error"}
Additional controller metrics coming soon
Name | Description | Labels |
---|---|---|
wf_controller_reconcile_total | Total number of reconciliations per controller and object | controller, namespace, name, result |
wf_controller_reconcile_errors_total | Total number of reconciliation errors per controller and object | controller, namespace, name, severity, type |
wf_controller_component_reconcile_total | Total number of reconciliations per controller, object and (status) component | controller, namespace, name, component, result, status |
wf_controller_component_reconcile_errors_total | Total number of reconciliation errors per controller, object and (status component) | controller, namespace, name, component, severity, type |
wf_controller_reconcile_time_seconds | Length of time per reconciliation per controller and object | controller, namespace, name |
wf_controller_reconcile_interval_seconds | Length of time since the last reconciliation happened per controller and object | controller, namespace, name |
wf_controller_sync_period_seconds | Default sync period for a controller | controller |
Example reconciliation loop with recording metrics
Name | Recorded metric |
---|---|
Controller starts | wf_controller_sync_period_seconds |
Reconciliation starts for dev-team/eks-dev cluster | wf_controller_reconcile_interval_seconds |
Reconcile "Component A" | wf_controller_component_reconcile_total{component="Component A",result="error|requeue|requeue_after|success"} |
Reconcile "Component B" | wf_controller_component_reconcile_total{component="Component B",result="error|requeue|requeue_after|success"} |
Reconciliation finished | wf_controller_reconcile_total{result="error|requeue|requeue_after|success"} wf_controller_reconcile_time_seconds |
API metrics
The Wayfinder API produces a number of metrics around inbound requests and handling.
Name | Description | Recorded metric |
---|---|---|
Policy Errors | The number of errors encountered trying to add a policy | policy_add_errors |
Policy Engine Errors | A counter on the number of errors encountered in the policy engine | policy_errors |
Policy Evaluation Summary | A summary of the policy evaluation time in seconds | policy_evaluation_seconds|summary |
Policy Evaluation Find Matching | A summary of the latency encountered when finding matching policies | policy_find_matches_seconds|summary |
Policies Out of Sync | A counter of the number of policies found out of sync | policy_out_of_sync |
HTTP Request Average Summary | The average latency on requests to the apiserver | http_request_avg_sec http_request_avg_sec_sum http_request_avg_sec_count |
HTTP Request Code Total | The total number of http requests broken down by http code | http_request_code_total |
HTTP Request Error Total | The total number of http requests that have not been successful | http_request_error_total |
HTTP Total Number of Requests | The total number of http requests to the apiserver | http_request_total |
Database metrics
Name | Description | Recorded metric |
---|---|---|
Database Total Creation Counter | A counter or the create operations in the db | db_create_counter |
Database Total Deletion Counter | A counter or the delete operations in the db | db_delete_counter |
Database Latency Summary on deletions | The latency on delete operations to the db | db_delete_latency_sec db_delete_latency_sec_sum db_delete_latency_sec_count |
Overall Database Total Errors | A counter of the number of errors encountered by the db | db_error_counter |
Database Latency Summary on selecting records | The latency on get operations to the db | db_get_latency_sec db_get_latency_sec_count db_get_latency_sec_sum |
Database Latency Summary on selects / listing | The latency on list operations to the db | db_list_latency_sec db_list_latency_sec_sum db_list_latency_sec_count |
Database Latency Summary on updates / insert | The latency on set operations to the db | db_set_latency_sec db_set_latency_sec_sum db_set_latency_sec_count |
Database Updates | A counter or the update and add operations in the db | db_update_counter |