Essential Metrics for CockroachDB Advanced Deployments

On this page

These essential CockroachDB metrics let you monitor your CockroachDB self-hosted cluster. Use them to build custom dashboards with the following tools:

Datadog integration - The Datadog Integration Metric Name column lists the corresponding Datadog metric which requires the crdb_dedicated prefix.
Metrics export

The Usage column explains why each metric is important to visualize and how to make both practical and actionable use of the metric in a production deployment.

Platform

CockroachDB Metric Name	Datadog Integration Metric Name (add `crdb_dedicated.` prefix)	Description	Usage
`sys.cpu.combined.percent-normalized`	`sys.cpu.combined.percent.normalized`	Current user+system cpu percentage consumed by the CRDB process, normalized 0-1 by number of cores	This metric gives the CPU utilization percentage by the CockroachDB process. If it is equal to 1 (or 100%), then the CPU is overloaded. The CockroachDB process should not be running with over 80% utilization for extended periods of time (hours). This metric is used in the DB Console CPU Percent graph.
`sys.cpu.host.combined.percent-normalized`	NOT AVAILABLE	Current user+system cpu percentage across the whole machine, normalized 0-1 by number of cores	This metric gives the CPU utilization percentage of the underlying server, virtual machine, or container hosting the CockroachDB process. It includes CPU usage from both CockroachDB and non-CockroachDB processes. It also accounts for time spent processing hardware (irq) and software (softirq) interrupts, as well as nice time, which represents low-priority user-mode activity. A value of 1 (or 100%) indicates that the CPU is overloaded. Avoid running the CockroachDB process in an environment where the CPU remains overloaded for extended periods (e.g. multiple hours). This metric appears in the DB Console on the Host CPU Percent graph.
`sys.cpu.sys.percent`	`sys.cpu.sys.percent`	Current system cpu percentage consumed by the CRDB process	This metric gives the CPU usage percentage at the system (Linux kernel) level by the CockroachDB process only. This is similar to the Linux top command output. The metric value can be more than 1 (or 100%) on multi-core systems. It is best to combine user and system metrics.
`sys.cpu.user.percent`	`sys.cpu.user.percent`	Current user cpu percentage consumed by the CRDB process	This metric gives the CPU usage percentage at the user level by the CockroachDB process only. This is similar to the Linux top command output. The metric value can be more than 1 (or 100%) on multi-core systems. It is best to combine user and system metrics.
`sys.host.disk.iopsinprogress`	NOT AVAILABLE	IO operations currently in progress on this host (as reported by the OS)	This metric gives the average queue length of the storage device. It characterizes the storage device's performance capability. All I/O performance metrics are Linux counters and correspond to the avgqu-sz in the Linux iostat command output. You need to view the device queue graph in the context of the actual read/write IOPS and MBPS metrics that show the actual device utilization. If the device is not keeping up, the queue will grow. Values over 10 are bad. Values around 5 mean the device is working hard trying to keep up. For internal (on chassis) NVMe devices, the queue values are typically 0. For network connected devices, such as AWS EBS volumes, the normal operating range of values is 1 to 2. Spikes in values are OK. They indicate an I/O spike where the device fell behind and then caught up. End users may experience inconsistent response times, but there should be no cluster stability issues. If the queue is greater than 5 for an extended period of time and IOPS or MBPS are low, then the storage is most likely not provisioned per Cockroach Labs guidance. In AWS EBS, it is commonly an EBS type, such as gp2, not suitable as database primary storage. If I/O is low and the queue is low, the most likely scenario is that the CPU is lacking and not driving I/O. One such case is a cluster with nodes with only 2 vcpus which is not supported sizing for production deployments. There are quite a few background processes in the database that take CPU away from the workload, so the workload is just not getting the CPU. Review storage and disk I/O.
`sys.host.disk.read.bytes`	NOT AVAILABLE	Bytes read from all disks since this process started (as reported by the OS)	This metric reports the effective storage device read throughput (MB/s) rate. To confirm that storage is sufficiently provisioned, assess the I/O performance rates (IOPS and MBPS) in the context of the sys.host.disk.iopsinprogress metric.
`sys.host.disk.read.count`	NOT AVAILABLE	Disk read operations across all disks since this process started (as reported by the OS)	This metric reports the effective storage device read IOPS rate. To confirm that storage is sufficiently provisioned, assess the I/O performance rates (IOPS and MBPS) in the context of the sys.host.disk.iopsinprogress metric.
`sys.host.disk.write.bytes`	NOT AVAILABLE	Bytes written to all disks since this process started (as reported by the OS)	This metric reports the effective storage device write throughput (MB/s) rate. To confirm that storage is sufficiently provisioned, assess the I/O performance rates (IOPS and MBPS) in the context of the sys.host.disk.iopsinprogress metric.
`sys.host.disk.write.count`	NOT AVAILABLE	Disk write operations across all disks since this process started (as reported by the OS)	This metric reports the effective storage device write IOPS rate. To confirm that storage is sufficiently provisioned, assess the I/O performance rates (IOPS and MBPS) in the context of the sys.host.disk.iopsinprogress metric.
`sys.host.net.recv.bytes`	`sys.host.net.recv.bytes`	Bytes received on all network interfaces since this process started (as reported by the OS)	This metric gives the node's ingress/egress network transfer rates for flat sections which may indicate insufficiently provisioned networking or high error rates. CockroachDB is using a reliable TCP/IP protocol, so errors result in delivery retries that create a "slow network" effect.
`sys.host.net.send.bytes`	`sys.host.net.send.bytes`	Bytes sent on all network interfaces since this process started (as reported by the OS)	This metric gives the node's ingress/egress network transfer rates for flat sections which may indicate insufficiently provisioned networking or high error rates. CockroachDB is using a reliable TCP/IP protocol, so errors result in delivery retries that create a "slow network" effect.
`sys.rss`	`sys.rss`	Current process RSS	This metric gives the amount of RAM used by the CockroachDB process. Persistently low values over an extended period of time suggest there is underutilized memory that can be put to work with adjusted settings for --cache or --max_sql_memory or both. Conversely, a high utilization, even if a temporary spike, indicates an increased risk of Out-of-memory (OOM) crash (particularly since the swap is generally disabled).
`sys.runnable.goroutines.per.cpu`	NOT AVAILABLE	Average number of goroutines that are waiting to run, normalized by number of cores	If this metric has a value over 30, it indicates a CPU overload. If the condition lasts a short period of time (a few seconds), the database users are likely to experience inconsistent response times. If the condition persists for an extended period of time (tens of seconds, or minutes) the cluster may start developing stability issues. Review CPU planning.
`sys.uptime`	`sys.uptime`	Process uptime	This metric measures the length of time, in seconds, that the CockroachDB process has been running. Monitor this metric to detect events such as node restarts, which may require investigation or intervention.

Storage

CockroachDB Metric Name	Datadog Integration Metric Name (add `crdb_dedicated.` prefix)	Description	Usage
`admission.io.overload`	NOT AVAILABLE	1-normalized float indicating whether IO admission control considers the store as overloaded with respect to compaction out of L0 (considers sub-level and file counts).	If the value of this metric exceeds 1, then it indicates overload. You can also look at the metrics `storage.l0-num-files`, `storage.l0-sublevels` or `rocksdb.read-amplification` directly. A healthy LSM shape is defined as "read-amp < 20" and "L0-files < 1000", looking at cluster settings `admission.l0_sub_level_count_overload_threshold` and `admission.l0_file_count_overload_threshold` respectively.
`capacity`	`capacity`	Total storage capacity	This metric gives total storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space).
`capacity.available`	`capacity.available`	Available storage capacity	This metric gives available storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space).
`capacity.used`	`capacity.used`	Used storage capacity	This metric gives used storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space).
`rocksdb.block.cache.hits`	`rocksdb.block.cache.hits`	Count of block cache hits	This metric gives hits to block cache which is reserved memory. It is allocated upon the start of a node process by the `--cache` flag and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload.
`rocksdb.block.cache.misses`	`rocksdb.block.cache.misses`	Count of block cache misses	This metric gives misses to block cache which is reserved memory. It is allocated upon the start of a node process by the `--cache` flag and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload.
`rocksdb.compactions`	`rocksdb.compactions`	Number of table compactions	This metric reports the number of a node's LSM compactions. If the number of compactions remains elevated while the LSM health does not improve, compactions are not keeping up with the workload. If the condition persists for an extended period, the cluster will initially exhibit performance issues that will eventually escalate into stability issues.
`storage.wal.fsync.latency`	NOT AVAILABLE	The write ahead log fsync latency	If this value is greater than `100ms`, it is an indication of a disk stall. To mitigate the effects of disk stalls, consider deploying your cluster with WAL failover configured.
`storage.write-stalls`	NOT AVAILABLE	Number of instances of intentional write stalls to backpressure incoming writes	This metric reports actual disk stall events. Ideally, investigate all reports of disk stalls. As a pratical guideline, one stall per minute is not likely to have a material impact on workload beyond an occasional increase in response time. However one stall per second should be viewed as problematic and investigated actively. It is particularly problematic if the rate persists over an extended period of time, and worse, if it is increasing.

Health

CockroachDB Metric Name Datadog Integration Metric Name
(add crdb_dedicated. prefix) Description Usage

CockroachDB Metric Name	Datadog Integration Metric Name (add `crdb_dedicated.` prefix)	Description	Usage
`admission.wait_durations.kv`	NOT AVAILABLE	Wait time durations for requests that waited	This metric shows if CPU utilization-based admission control feature is working effectively or potentially overaggressive. This is a latency histogram of how much delay was added to the workload due to throttling. If observing over 100ms waits for over 5 seconds while there was excess capacity available, then the admission control is overly aggressive.
`admission.wait_durations.kv-stores`	NOT AVAILABLE	Wait time durations for requests that waited	This metric shows if I/O utilization-based admission control feature is working effectively or potentially overaggressive. This is a latency histogram of how much delay was added to the workload due to throttling. If observing over 100ms waits for over 5 seconds while there was excess capacity available, then the admission control is overly aggressive.

admission.wait_durations.kv

NOT AVAILABLE

Wait time durations for requests that waited

This metric shows if CPU utilization-based admission control feature is working effectively or potentially overaggressive. This is a latency histogram of how much delay was added to the workload due to throttling. If observing over 100ms waits for over 5 seconds while there was excess capacity available, then the admission control is overly aggressive.

admission.wait_durations.kv-stores

NOT AVAILABLE

Wait time durations for requests that waited

This metric shows if I/O utilization-based admission control feature is working effectively or potentially overaggressive. This is a latency histogram of how much delay was added to the workload due to throttling. If observing over 100ms waits for over 5 seconds while there was excess capacity available, then the admission control is overly aggressive.

KV Distributed

CockroachDB Metric Name Datadog Integration Metric Name
(add crdb_dedicated. prefix) Description Usage

CockroachDB Metric Name	Datadog Integration Metric Name (add `crdb_dedicated.` prefix)	Description	Usage
`distsender.errors.notleaseholder`	`distsender.errors.notleaseholder`	Number of NotLeaseHolderErrors encountered from replica-addressed RPCs	Errors of this type are normal during elastic cluster topology changes when leaseholders are actively rebalancing. They are automatically retried. However they may create occasional response time spikes. In that case, this metric may provide the explanation of the cause.
`distsender.rpc.sent.nextreplicaerror`	`distsender.rpc.sent.nextreplicaerror`	Number of replica-addressed RPCs sent due to per-replica errors	RPC errors do not necessarily indicate a problem. This metric tracks remote procedure calls that return a status value other than "success". A non-success status of an RPC should not be misconstrued as a network transport issue. It is database code logic executed on another cluster node. The non-success status is a result of an orderly execution of an RPC that reports a specific logical condition.

distsender.errors.notleaseholder

Number of NotLeaseHolderErrors encountered from replica-addressed RPCs

Errors of this type are normal during elastic cluster topology changes when leaseholders are actively rebalancing. They are automatically retried. However they may create occasional response time spikes. In that case, this metric may provide the explanation of the cause.

distsender.rpc.sent.nextreplicaerror

Number of replica-addressed RPCs sent due to per-replica errors

RPC errors do not necessarily indicate a problem. This metric tracks remote procedure calls that return a status value other than "success". A non-success status of an RPC should not be misconstrued as a network transport issue. It is database code logic executed on another cluster node. The non-success status is a result of an orderly execution of an RPC that reports a specific logical condition.

KV Replication

CockroachDB Metric Name	Datadog Integration Metric Name (add `crdb_dedicated.` prefix)	Description	Usage
`leases.transfers.success`	`leases.transfers.success`	Number of successful lease transfers	A high number of lease transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors.
`liveness.heartbeatlatency`	`liveness.heartbeatlatency`	Node liveness heartbeat latency	If this metric exceeds 1 second, it is a sign of cluster instability.
`liveness.livenodes`	`liveness.livenodes`	Number of live nodes in the cluster (will be 0 if this node is not itself live)	This is a critical metric that tracks the live nodes in the cluster.
`queue.replicate.replacedecommissioningreplica.error`	NOT AVAILABLE	Number of failed decommissioning replica replacements processed by the replicate queue	Refer to Decommission the node.
`range.merges`	NOT AVAILABLE	Number of range merges	This metric indicates how fast a workload is scaling down. Merges are Cockroach's optimization for performance. This metric indicates that there have been deletes in the workload.
`range.splits`	`range.splits`	Number of range splits	This metric indicates how fast a workload is scaling up. Spikes can indicate resource hotspots since the split heuristic is based on QPS. To understand whether hotspots are an issue and with which tables and indexes they are occurring, correlate this metric with other metrics such as CPU usage, such as sys.cpu.combined.percent-normalized, or use the Hot Ranges page.
`ranges`	`ranges`	Number of ranges	This metric provides a measure of the scale of the data size.
`ranges.unavailable`	`ranges.unavailable`	Number of ranges with fewer live replicas than needed for quorum	This metric is an indicator of replication issues. It shows whether the cluster is unhealthy and can impact workload. If an entire range is unavailable, then it will be unable to process queries.
`ranges.underreplicated`	`ranges.underreplicated`	Number of ranges with fewer live replicas than the replication target	This metric is an indicator of replication issues. It shows whether the cluster has data that is not conforming to resilience goals. The next step is to determine the corresponding database object, such as the table or index, of these under-replicated ranges and whether the under-replication is temporarily expected. Use the statement SELECT table_name, index_name FROM [SHOW RANGES WITH INDEXES] WHERE range_id = {id of under-replicated range};
`rebalancing.cpunanospersecond`	NOT AVAILABLE	Average CPU nanoseconds spent on processing replica operations in the last 30 minutes.	A high value of this metric could indicate that one of the store's replicas is part of a hot range.
`rebalancing.lease.transfers`	NOT AVAILABLE	Number of lease transfers motivated by store-level load imbalances	Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores.
`rebalancing.queriespersecond`	NOT AVAILABLE	Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions.	This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities.
`rebalancing.range.rebalances`	NOT AVAILABLE	Number of range rebalance operations motivated by store-level load imbalances	Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores.
`rebalancing.replicas.cpunanospersecond`	NOT AVAILABLE	Histogram of average CPU nanoseconds spent on processing replica operations in the last 30 minutes.	A high value of this metric could indicate that one of the store's replicas is part of a hot range. See also the non-histogram variant: rebalancing.cpunanospersecond.
`rebalancing.replicas.queriespersecond`	NOT AVAILABLE	Histogram of average kv-level requests received per second by replicas on the store in the last 30 minutes.	A high value of this metric could indicate that one of the store's replicas is part of a hot range. See also: rebalancing_replicas_cpunanospersecond.
`replicas`	`replicas`	Number of replicas	This metric provides an essential characterization of the data distribution across cluster nodes.
`replicas.leaseholders`	`replicas.leaseholders`	Number of lease holders	This metric provides an essential characterization of the data processing points across cluster nodes.

SQL

CockroachDB Metric Name	Datadog Integration Metric Name (add `crdb_dedicated.` prefix)	Description	Usage
`sql.conn.failures`	NOT AVAILABLE	Number of SQL connection failures	This metric is incremented whenever a connection attempt fails for any reason, including timeouts.
`sql.conn.latency`	`sql.conn.latency`	Latency to establish and authenticate a SQL connection	These metrics characterize the database connection latency which can affect the application performance, for example, by having slow startup times. Connection failures are not recorded in these metrics.
`sql.conns`	`sql.conns`	Number of open SQL connections	This metric shows the number of connections as well as the distribution, or balancing, of connections across cluster nodes. An imbalance can lead to nodes becoming overloaded. Review Connection Pooling.
`sql.ddl.count`	`sql.ddl.count`	Number of SQL DDL statements successfully executed	This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. For example, on the Transactions page and the Statements page, sort on the Execution Count column. To find problematic sessions, on the Sessions page, sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.
`sql.delete.count` `metrics` endpoint: `sql.count{query_type: delete}`	`sql.delete.count`	Number of SQL DELETE statements successfully executed	This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. For example, on the Transactions page and the Statements page, sort on the Execution Count column. To find problematic sessions, on the Sessions page, sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.
`sql.distsql.contended_queries.count`	`sql.distsql.contended.queries.count`	Number of SQL queries that experienced contention	This metric is incremented whenever there is a non-trivial amount of contention experienced by a statement whether read-write or write-write conflicts. Monitor this metric to correlate possible workload performance issues to contention conflicts.
`sql.failure.count`	`sql.failure.count`	Number of statements resulting in a planning or runtime error	This metric is a high-level indicator of workload and application degradation with query failures. Use the Insights page to find failed executions with their error code to troubleshoot or use application-level logs, if instrumented, to determine the cause of error.
`sql.full.scan.count`	`sql.full.scan.count`	Number of full table or index scans	This metric is a high-level indicator of potentially suboptimal query plans in the workload that may require index tuning and maintenance. To identify the statements with a full table scan, use SHOW FULL TABLE SCAN or the SQL Activity Statements page with the corresponding metric time frame. The Statements page also includes explain plans and index recommendations. Not all full scans are necessarily bad especially over smaller tables.
`sql.insert.count` `metrics` endpoint: `sql.count{query_type: insert}`	`sql.insert.count`	Number of SQL INSERT statements successfully executed	This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. For example, on the Transactions page and the Statements page, sort on the Execution Count column. To find problematic sessions, on the Sessions page, sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.
`sql.mem.root.current`	NOT AVAILABLE	Current sql statement memory usage for root	This metric shows how memory set aside for temporary materializations, such as hash tables and intermediary result sets, is utilized. Use this metric to optimize memory allocations based on long term observations. The maximum amount is set with --max_sql_memory. If the utilization of sql memory is persistently low, perhaps some portion of this memory allocation can be shifted to --cache.
`sql.new_conns`	`sql.new_conns.count`	Number of SQL connections created	The rate of this metric shows how frequently new connections are being established. This can be useful in determining if a high rate of incoming new connections is causing additional load on the server due to a misconfigured application.
`sql.select.count` `metrics` endpoint: `sql.count{query_type: select}`	`sql.select.count`	Number of SQL SELECT statements successfully executed	This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. For example, on the Transactions page and the Statements page, sort on the Execution Count column. To find problematic sessions, on the Sessions page, sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.
`sql.service.latency`	`sql.service.latency`	Latency of SQL request execution	These high-level metrics reflect workload performance. Monitor these metrics to understand latency over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. The Statements page has P90 Latency and P99 latency columns to enable correlation with this metric.
`sql.statements.active`	`sql.statements.active`	Number of currently active user SQL statements	This high-level metric reflects workload volume.
`sql.txn.abort.count`	`sql.txn.abort.count`	Number of SQL transaction abort errors	This high-level metric reflects workload performance. A persistently high number of SQL transaction abort errors may negatively impact the workload performance and needs to be investigated.
`sql.txn.begin.count`	`sql.txn.begin.count`	Number of SQL transaction BEGIN statements successfully executed	This metric reflects workload volume by counting explicit transactions. Use this metric to determine whether explicit transactions can be refactored as implicit transactions (individual statements).
`sql.txn.commit.count`	`sql.txn.commit.count`	Number of SQL transaction COMMIT statements successfully executed	This metric shows the number of transactions that completed successfully. This metric can be used as a proxy to measure the number of successful explicit transactions.
`sql.txn.latency`	`sql.txn.latency`	Latency of SQL transactions	These high-level metrics provide a latency histogram of all executed SQL transactions. These metrics provide an overview of the current SQL workload.
`sql.txn.rollback.count`	`sql.txn.rollback.count`	Number of SQL transaction ROLLBACK statements successfully executed	This metric shows the number of orderly transaction rollbacks. A persistently high number of rollbacks may negatively impact the workload performance and needs to be investigated.
`sql.txns.open`	`sql.txns.open`	Number of currently open user SQL transactions	This metric should roughly correspond to the number of cores * 4. If this metric is consistently larger, scale out the cluster.
`sql.update.count` `metrics` endpoint: `sql.count{query_type: update}`	`sql.update.count`	Number of SQL UPDATE statements successfully executed	This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. For example, on the Transactions page and the Statements page, sort on the Execution Count column. To find problematic sessions, on the Sessions page, sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.
`txn.restarts.serializable`	`txn.restarts.serializable`	Number of restarts due to a forwarded commit timestamp and isolation=SERIALIZABLE	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated.
`txn.restarts.txnaborted`	NOT AVAILABLE	Number of restarts due to an abort by a concurrent transaction (usually due to deadlock)	The errors tracked by this metric are generally due to deadlocks. Deadlocks can often be prevented with a considered transaction design. Identify the conflicting transactions involved in the deadlocks, then, if possible, redesign the business logic implementation prone to deadlocks.
`txn.restarts.txnpush`	NOT AVAILABLE	Number of restarts due to a transaction push failure	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated.
`txn.restarts.unknown`	NOT AVAILABLE	Number of restarts due to a unknown reasons	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated.
`txn.restarts.writetooold`	`txn.restarts.writetooold`	Number of restarts due to a concurrent writer committing first	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated.
`txnwaitqueue.deadlocks_total`	NOT AVAILABLE	Number of deadlocks detected by the txn wait queue	Alert on this metric if its value is greater than zero, especially if transaction throughput is lower than expected. Applications should be able to detect and recover from deadlock errors. However, transaction performance and throughput can be maximized if the application logic avoids deadlock conditions in the first place, for example, by keeping transactions as short as possible.

Table Statistics

CockroachDB Metric Name	Datadog Integration Metric Name (add `crdb_dedicated.` prefix)	Description	Usage
`jobs.auto_create_stats.currently_paused` `metrics` endpoint: `jobs{name: auto_create_stats, status: currently_paused}`	`jobs.auto_create_stats.currently_paused`	Number of auto_create_stats jobs currently considered Paused	This metric is a high-level indicator that automatically generated statistics jobs are paused which can lead to the query optimizer running with stale statistics. Stale statistics can cause suboptimal query plans to be selected leading to poor query performance.
`jobs.auto_create_stats.currently_running` `metrics` endpoint: `jobs{type: auto_create_stats, status: currently_running}`	`jobs.auto_create_stats.currently_running`	Number of auto_create_stats jobs currently running in Resume or OnFailOrCancel state	This metric tracks the number of active automatically generated statistics jobs that could also be consuming resources. Ensure that foreground SQL traffic is not impacted by correlating this metric with SQL latency and query volume metrics.
`jobs.auto_create_stats.resume_failed` `metrics` endpoint: `jobs.resume{name: auto_create_stats, status: failed}`	`jobs.auto_create_stats.resume_failed`	Number of auto_create_stats jobs which failed with a non-retriable error	This metric is a high-level indicator that automatically generated table statistics is failing. Failed statistic creation can lead to the query optimizer running with stale statistics. Stale statistics can cause suboptimal query plans to be selected leading to poor query performance.
`jobs.create_stats.currently_running` `metrics` endpoint: `jobs{type: create_stats, status: currently_running}`	`jobs.create_stats.currently_running`	Number of create_stats jobs currently running in Resume or OnFailOrCancel state	This metric tracks the number of active create statistics jobs that may be consuming resources. Ensure that foreground SQL traffic is not impacted by correlating this metric with SQL latency and query volume metrics.

Disaster Recovery

CockroachDB Metric Name	Datadog Integration Metric Name (add `crdb_dedicated.` prefix)	Description	Usage
`jobs.backup.currently_paused` `metrics` endpoint: `jobs{name: backup, status: currently_paused}`	`jobs.backup.currently_paused`	Number of backup jobs currently considered Paused	Monitor and alert on this metric to safeguard against an inadvertent operational error of leaving a backup job in a paused state for an extended period of time. In functional areas, a paused job can hold resources or have concurrency impact or some other negative consequence. Paused backup may break the recovery point objective (RPO).
`jobs.backup.currently_running` `metrics` endpoint: `jobs{type: backup, status: currently_running}`	`jobs.backup.currently_running`	Number of backup jobs currently running in Resume or OnFailOrCancel state	See Description.
`schedules.BACKUP.failed` `metrics` endpoint: `schedules{name: BACKUP, status: failed}`	`schedules.BACKUP.failed`	Number of BACKUP jobs failed	Monitor this metric and investigate backup job failures.
`schedules.BACKUP.last-completed-time`	`schedules.BACKUP.last-completed-time`	The unix timestamp of the most recently completed backup by a schedule specified as maintaining this metric	Monitor this metric to ensure that backups are meeting the recovery point objective (RPO). Each node exports the time that it last completed a backup on behalf of the schedule. If a node is restarted, it will report 0 until it completes a backup. If all nodes are restarted, max() is 0 until a node completes a backup. To make use of this metric, first, from each node, take the maximum over a rolling window equal to or greater than the backup frequency, and then take the maximum of those values across nodes. For example with a backup frequency of 60 minutes, monitor time() - max_across_nodes(max_over_time(schedules_BACKUP_last_completed_time, 60min)).

Changefeeds

CockroachDB Metric Name	Datadog Integration Metric Name (add `crdb_dedicated.` prefix)	Description	Usage
`changefeed.commit_latency`	`changefeed.commit.latency`	Event commit latency: a difference between event MVCC timestamp and the time it was acknowledged by the downstream sink. If the sink batches events, then the difference between the oldest event in the batch and acknowledgement is recorded; Excludes latency during backfill	This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the end-to-end lag between a committed change and that change applied at the destination.
`changefeed.emitted_bytes`	NOT AVAILABLE	Bytes emitted by all feeds	This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the throughput bytes being streamed from the CockroachDB cluster.
`changefeed.emitted_messages`	`changefeed.emitted.messages`	Messages emitted by all feeds	This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the rate of changes being streamed from the CockroachDB cluster.
`changefeed.error_retries`	`changefeed.error.retries`	Total retryable errors encountered by all changefeeds	This metric tracks transient changefeed errors. Alert on "too many" errors, such as 50 retries in 15 minutes. For example, during a rolling upgrade this counter will increase because the changefeed jobs will restart following node restarts. There is an exponential backoff, up to 10 minutes. But if there is no rolling upgrade in process or other cluster maintenance, and the error rate is high, investigate the changefeed job.
`changefeed.failures`	`changefeed.failures`	Total number of changefeed jobs which have failed	This metric tracks the permanent changefeed job failures that the jobs system will not try to restart. Any increase in this counter should be investigated. An alert on this metric is recommended.
`changefeed.running`	`changefeed.running`	Number of currently running changefeeds, including sinkless	This metric tracks the total number of all running changefeeds.
`jobs.changefeed.currently_paused` `metrics` endpoint: `jobs{name: changefeed, status: currently_paused}`	NOT AVAILABLE	Number of changefeed jobs currently considered Paused	Monitor and alert on this metric to safeguard against an inadvertent operational error of leaving a changefeed job in a paused state for an extended period of time. Changefeed jobs should not be paused for a long time because the protected timestamp prevents garbage collection.
`jobs.changefeed.protected_age_sec` `metrics` endpoint: `jobs.protected_age_sec{type: changefeed}`	NOT AVAILABLE	The age of the oldest PTS record protected by changefeed jobs	Changefeeds use protected timestamps to protect the data from being garbage collected. Ensure the protected timestamp age does not significantly exceed the GC TTL zone configuration. Alert on this metric if the protected timestamp age is greater than 3 times the GC TTL.

Row-level TTL

CockroachDB Metric Name	Datadog Integration Metric Name (add `crdb_dedicated.` prefix)	Description	Usage
`jobs.row_level_ttl.currently_paused` `metrics` endpoint: `jobs{name: row_level_ttl, status: currently_paused}`	NOT AVAILABLE	Number of row_level_ttl jobs currently considered Paused	Monitor this metric to ensure the Row Level TTL job does not remain paused inadvertently for an extended period.
`jobs.row_level_ttl.currently_running` `metrics` endpoint: `jobs{type: row_level_ttl, status: currently_running}`	NOT AVAILABLE	Number of row_level_ttl jobs currently running in Resume or OnFailOrCancel state	Monitor this metric to ensure there are not too many Row Level TTL jobs running at the same time. Generally, this metric should be in the low single digits.
`jobs.row_level_ttl.delete_duration`	NOT AVAILABLE	Duration for delete requests during row level TTL.	See Description.
`jobs.row_level_ttl.num_active_spans`	NOT AVAILABLE	Number of active spans the TTL job is deleting from.	See Description.
`jobs.row_level_ttl.resume_completed` `metrics` endpoint: `jobs.resume{name: row_level_ttl, status: completed}`	NOT AVAILABLE	Number of row_level_ttl jobs which successfully resumed to completion	If Row Level TTL is enabled, this metric should be nonzero and correspond to the ttl_cron setting that was chosen. If this metric is zero, it means the job is not running
`jobs.row_level_ttl.resume_failed` `metrics` endpoint: `jobs.resume{name: row_level_ttl, status: failed}`	NOT AVAILABLE	Number of row_level_ttl jobs which failed with a non-retriable error	This metric should remain at zero. Repeated errors means the Row Level TTL job is not deleting data.
`jobs.row_level_ttl.rows_deleted`	NOT AVAILABLE	Number of rows deleted by the row level TTL job.	Correlate this metric with the metric jobs.row_level_ttl.rows_selected to ensure all the rows that should be deleted are actually getting deleted.
`jobs.row_level_ttl.rows_selected`	NOT AVAILABLE	Number of rows selected for deletion by the row level TTL job.	Correlate this metric with the metric jobs.row_level_ttl.rows_deleted to ensure all the rows that should be deleted are actually getting deleted.
`jobs.row_level_ttl.select_duration`	NOT AVAILABLE	Duration for select requests during row level TTL.	See Description.
`jobs.row_level_ttl.span_total_duration`	NOT AVAILABLE	Duration for processing a span during row level TTL.	See Description.
`jobs.row_level_ttl.total_expired_rows`	NOT AVAILABLE	Approximate number of rows that have expired the TTL on the TTL table.	See Description.
`jobs.row_level_ttl.total_rows`	NOT AVAILABLE	Approximate number of rows on the TTL table.	See Description.
`schedules.scheduled-row-level-ttl-executor.failed` `metrics` endpoint: `schedules{name: scheduled-row-level-ttl-executor, status: failed}`	NOT AVAILABLE	Number of scheduled-row-level-ttl-executor jobs failed	Monitor this metric to ensure the Row Level TTL job is running. If it is non-zero, it means the job could not be created.

Pricing

Contact us

Sign In

Essential Metrics for CockroachDB Advanced Deployments

Platform

Storage

Health

KV Distributed

KV Replication

SQL

Table Statistics

Disaster Recovery

Changefeeds

Row-level TTL

See also

Tell us about your experience

Thank you for your feedback!

Explore More Documentation:

Essential Metrics for CockroachDB Advanced Deployments

Platform

Storage

Health

KV Distributed

KV Replication

SQL

Table Statistics

Disaster Recovery

Changefeeds

Row-level TTL

See also

Tell us about your experience

Select the problem area

Thank you for your feedback!

Explore More Documentation: