Observability

RIOT-X exposes several metrics over a Prometheus endpoint that can be useful for troubleshooting and performance tuning.

Getting Started

The riotx-dist repository includes a Docker compose configuration that set ups Prometheus and Grafana.

git clone https://github.com/redis/riotx-dist.git
cd riotx-dist
docker compose up

Prometheus is configured to scrape the host every second.

You can access the Grafana dashboard at localhost:3000.

Now start RIOT-X with the following command:

riotx replicate ... --metrics

This will enable the Prometheus metrics exporter endpoint and will populate the Grafana dashboard.

Configuration

Use the --metrics* options to enable and configure metrics:

--metrics: Enable metrics
--metrics-jvm: Enable JVM and system metrics
--metrics-redis: Enable command latency metrics. See https://github.com/redis/lettuce/wiki/Command-Latency-Metrics#micrometer
--metrics-name=<name>: Application name tag that will be applied to all metrics
--metrics-port=<int>: Port that Prometheus HTTP server should listen on (default: 8080)
--metrics-prop=<k=v>: Additional properties to pass to the Prometheus client. See https://prometheus.github.io/client_java/config/config/

Metrics

Below you can find a list of all metrics declared by RIOT-X.

Replication Metrics

Name Type Description

Name	Type	Description
`riotx_replication_bytes_total`	Counter	Number of bytes replicated (needs memory usage with `--mem-limit`)
`riotx_replication_lag_seconds`	Summary	Replication end-to-end latency
`riotx_replication_read_latency_seconds`	Summary	Replication read latency
`spring_batch_chunk_write_seconds`	Timer	Batch writing duration
`spring_batch_item_process_seconds`	Timer	Item processing duration
`spring_batch_item_read_seconds`	Timer	Item reading duration
`spring_batch_job_active_seconds`	Timer	Active jobs
`spring_batch_job_launch_count_total`	Counter	Job launch count
`spring_batch_redis_key_event_queue_capacity`	Gauge	Gauge reflecting the remaining capacity of the queue
`spring_batch_redis_key_event_queue_size`	Gauge	Gauge reflecting the size (depth) of the queue
`spring_batch_redis_key_scan_total`	Counter	Number of keys scanned
`spring_batch_redis_operation_seconds`	Timer	Operation execution duration
`spring_batch_redis_read_chunk`	Gauge	Gauge reflecting the chunk size of the reader
`spring_batch_redis_read_queue_capacity`	Gauge	Gauge reflecting the remaining capacity of the queue
`spring_batch_redis_read_queue_size`	Gauge	Gauge reflecting the size (depth) of the queue

riotx_replication_bytes_total

Counter

Number of bytes replicated (needs memory usage with --mem-limit)

riotx_replication_lag_seconds

Summary

Replication end-to-end latency

riotx_replication_read_latency_seconds

Summary

Replication read latency

spring_batch_chunk_write_seconds

Timer

Batch writing duration

spring_batch_item_process_seconds

Timer

Item processing duration

spring_batch_item_read_seconds

Timer

Item reading duration

spring_batch_job_active_seconds

Timer

Active jobs

spring_batch_job_launch_count_total

Counter

Job launch count

spring_batch_redis_key_event_queue_capacity

Gauge

Gauge reflecting the remaining capacity of the queue

spring_batch_redis_key_event_queue_size

Gauge

Gauge reflecting the size (depth) of the queue

spring_batch_redis_key_scan_total

Counter

Number of keys scanned

spring_batch_redis_operation_seconds

Timer

Operation execution duration

spring_batch_redis_read_chunk

Gauge

Gauge reflecting the chunk size of the reader

spring_batch_redis_read_queue_capacity

Gauge

Gauge reflecting the remaining capacity of the queue

spring_batch_redis_read_queue_size

Gauge

Gauge reflecting the size (depth) of the queue

JVM Metrics

Use the --metrics-jvm option to enable the following additional metrics:

Name Type Description

Name	Type	Description
`jvm_buffer_count_buffers`	Gauge	An estimate of the number of buffers in the pool
`jvm_buffer_memory_used_bytes`	Gauge	An estimate of the memory that the Java virtual machine is using for this buffer pool
`jvm_buffer_total_capacity_bytes`	Gauge	An estimate of the total capacity of the buffers in this pool
`jvm_gc_concurrent_phase_time_seconds`	Timer	Time spent in concurrent phase
`jvm_gc_live_data_size_bytes`	Gauge	Size of long-lived heap memory pool after reclamation
`jvm_gc_max_data_size_bytes`	Gauge	Max size of long-lived heap memory pool
`jvm_gc_memory_allocated_bytes_total`	Gauge	Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next
`jvm_gc_memory_promoted_bytes_total`	Counter	Count of positive increases in the size of the old generation memory pool before GC to after GC
`jvm_gc_pause_seconds`	Timer	Time spent in GC pause
`jvm_memory_committed_bytes`	Gauge	The amount of memory in bytes that is committed for the Java virtual machine to use
`jvm_memory_max_bytes`	Gauge	The maximum amount of memory in bytes that can be used for memory management
`jvm_memory_used_bytes`	Gauge	The amount of used memory
`jvm_threads_daemon_threads`	Gauge	The current number of live daemon threads
`jvm_threads_live_threads`	Gauge	The current number of live threads including both daemon and non-daemon threads
`jvm_threads_peak_threads`	Gauge	The peak live thread count since the Java virtual machine started or peak was reset
`jvm_threads_started_threads_total`	Counter	The total number of application threads started in the JVM
`jvm_threads_states_threads`	Gauge	The current number of threads
`process_cpu_time_ns_total`	Counter	The "cpu time" used by the Java Virtual Machine process
`process_cpu_usage`	Gauge	The "recent cpu usage" for the Java Virtual Machine process
`process_start_time_seconds`	Gauge	Start time of the process since unix epoch.
`process_uptime_seconds`	Gauge	The uptime of the Java virtual machine
`system_cpu_count`	Gauge	The number of processors available to the Java virtual machine
`system_cpu_usage`	Gauge	The "recent cpu usage" of the system the application is running in
`system_load_average_1m`	Gauge	The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time

jvm_buffer_count_buffers

Gauge

An estimate of the number of buffers in the pool

jvm_buffer_memory_used_bytes

Gauge

An estimate of the memory that the Java virtual machine is using for this buffer pool

jvm_buffer_total_capacity_bytes

Gauge

An estimate of the total capacity of the buffers in this pool

jvm_gc_concurrent_phase_time_seconds

Timer

Time spent in concurrent phase

jvm_gc_live_data_size_bytes

Gauge

Size of long-lived heap memory pool after reclamation

jvm_gc_max_data_size_bytes

Gauge

Max size of long-lived heap memory pool

jvm_gc_memory_allocated_bytes_total

Gauge

Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next

jvm_gc_memory_promoted_bytes_total

Counter

Count of positive increases in the size of the old generation memory pool before GC to after GC

jvm_gc_pause_seconds

Timer

Time spent in GC pause

jvm_memory_committed_bytes

Gauge

The amount of memory in bytes that is committed for the Java virtual machine to use

jvm_memory_max_bytes

Gauge

The maximum amount of memory in bytes that can be used for memory management

jvm_memory_used_bytes

Gauge

The amount of used memory

jvm_threads_daemon_threads

Gauge

The current number of live daemon threads

jvm_threads_live_threads

Gauge

The current number of live threads including both daemon and non-daemon threads

jvm_threads_peak_threads

Gauge

The peak live thread count since the Java virtual machine started or peak was reset

jvm_threads_started_threads_total

Counter

The total number of application threads started in the JVM

jvm_threads_states_threads

Gauge

The current number of threads

process_cpu_time_ns_total

Counter

The "cpu time" used by the Java Virtual Machine process

process_cpu_usage

Gauge

The "recent cpu usage" for the Java Virtual Machine process

process_start_time_seconds

Gauge

Start time of the process since unix epoch.

process_uptime_seconds

Gauge

The uptime of the Java virtual machine

system_cpu_count

Gauge

The number of processors available to the Java virtual machine

system_cpu_usage

Gauge

The "recent cpu usage" of the system the application is running in

system_load_average_1m

Gauge

The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time

Telegraf

RIOT-X logs can be collected and analyzed using Telegraf, an open-source server agent for collecting and sending metrics and logs.

Log Format

RIOT-X uses SLF4J Simple Logger with a configurable format.

Default format:

yyyy-MM-dd HH:mm:ss.SSS [LEVEL] logger.name - message

Example output:

2024-12-02 10:15:30.123 [INFO] com.redis.riot.FileImportCommand - Starting file import
2024-12-02 10:15:31.456 [WARN] com.redis.riot.ProgressStepExecutionListener - Slow processing detected
2024-12-02 10:15:32.789 [ERROR] com.redis.riot.ReplicateCommand - Connection failed

When --log-thread is enabled, the format includes thread information:

2024-12-02 10:15:30.123 [INFO] [main] com.redis.riot.FileImportCommand - Starting file import

Logging Options

Configure logging behavior with these options:

--log-file <file>: Write logs to a file
--log-level <level>: Set log level: ERROR, WARN, INFO, DEBUG, or TRACE (default: WARN)
--log-time-fmt <format>: Date/time format (default: yyyy-MM-dd HH:mm:ss.SSS)
--no-log-time: Hide timestamp in log messages
--log-thread: Show thread name in log messages
--log-name: Show logger instance name
-d, --debug: Enable debug logging
-i, --info: Enable info logging
-q, --quiet: Show errors only

Telegraf Setup

Configure RIOT-X to write logs to a file:

riotx --log-file /var/log/riotx/riotx.log file-import data.csv

Download the complete Telegraf configuration: ../_attachments/telegraf-riotx.conf[telegraf-riotx.conf]

Create a Telegraf configuration file (/etc/telegraf/telegraf.d/riotx.conf) with the following minimal setup:

[[inputs.tail]]
  files = ["/var/log/riotx/*.log"]
  from_beginning = false
  data_format = "grok"

  grok_patterns = [
    '%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:level}\] \[%{DATA:thread}\] %{DATA:logger} - %{GREEDYDATA:message}',
    '%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:level}\] %{DATA:logger} - %{GREEDYDATA:message}',
  ]

  name_override = "riotx_logs"
  grok_timezone = "Local"

[[processors.date]]
  field = "timestamp"
  field_key = "timestamp"
  date_format = ["2006-01-02 15:04:05.000"]

[[processors.enum]]
  [[processors.enum.mapping]]
    field = "level"
    dest = "severity"
    [processors.enum.mapping.value_mappings]
      TRACE = 1
      DEBUG = 2
      INFO = 3
      WARN = 4
      ERROR = 5

[[outputs.influxdb_v2]]
  urls = ["http://localhost:8086"]
  token = "$INFLUX_TOKEN"
  organization = "myorg"
  bucket = "riotx_logs"

Start Telegraf:
```
sudo systemctl restart telegraf
```

Parsed Fields

The Telegraf configuration extracts these fields:

Field Type Description

Field	Type	Description
`timestamp`	timestamp	Log entry timestamp
`level`	string	Log level (TRACE, DEBUG, INFO, WARN, ERROR)
`severity`	integer	Numeric severity (1-5)
`logger`	string	Full logger name
`thread`	string	Thread name (if enabled)
`message`	string	Log message content

timestamp

timestamp

Log entry timestamp

level

string

Log level (TRACE, DEBUG, INFO, WARN, ERROR)

severity

integer

Numeric severity (1-5)

logger

string

Full logger name

thread

string

Thread name (if enabled)

message

string

Log message content

Docker Deployment

For containerized deployments, use the Docker log input:

[[inputs.docker_log]]
  endpoint = "unix:///var/run/docker.sock"
  container_name_include = ["riotx*"]
  data_format = "grok"
  grok_patterns = [
    '%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:level}\] %{DATA:logger} - %{GREEDYDATA:message}',
  ]
  name_override = "riotx_logs"

Querying Logs

Example InfluxDB Flux queries:

Get ERROR level logs:

from(bucket: "riotx_logs")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "riotx_logs")
  |> filter(fn: (r) => r.level == "ERROR")

Count logs by level:

from(bucket: "riotx_logs")
  |> range(start: -24h)
  |> filter(fn: (r) => r._measurement == "riotx_logs")
  |> group(columns: ["level"])
  |> count()

For a complete setup guide including Docker, Kubernetes, Elasticsearch, and Prometheus configurations, see ../_attachments/TELEGRAF_SETUP.adoc[Telegraf Setup Guide].