Observability

RIOT-X exposes several metrics over a Prometheus endpoint that can be useful for troubleshooting and performance tuning.

Getting Started

The riotx-dist repository includes a Docker compose configuration that set ups Prometheus and Grafana.

git clone https://github.com/redis/riotx-dist.git
cd riotx-dist
docker compose up

Prometheus is configured to scrape the host every second.

You can access the Grafana dashboard at localhost:3000.

Now start RIOT-X with the following command:

riotx replicate ... --metrics

This will enable the Prometheus metrics exporter endpoint and will populate the Grafana dashboard.

Configuration

Use the --metrics* options to enable and configure metrics:

--metrics

Enable metrics

--metrics-jvm

Enable JVM and system metrics

--metrics-redis

Enable command latency metrics. See https://github.com/redis/lettuce/wiki/Command-Latency-Metrics#micrometer

--metrics-name=<name>

Application name tag that will be applied to all metrics

--metrics-port=<int>

Port that Prometheus HTTP server should listen on (default: 8080)

--metrics-prop=<k=v>

Additional properties to pass to the Prometheus client. See https://prometheus.github.io/client_java/config/config/

Metrics

Below you can find a list of all metrics declared by RIOT-X.

dashboard replication

Replication Metrics

Name Type Description

riotx_replication_bytes_total

Counter

Number of bytes replicated (needs memory usage with --mem-limit)

riotx_replication_lag_seconds

Summary

Replication end-to-end latency

riotx_replication_read_latency_seconds

Summary

Replication read latency

spring_batch_chunk_write_seconds

Timer

Batch writing duration

spring_batch_item_process_seconds

Timer

Item processing duration

spring_batch_item_read_seconds

Timer

Item reading duration

spring_batch_job_active_seconds

Timer

Active jobs

spring_batch_job_launch_count_total

Counter

Job launch count

spring_batch_redis_key_event_queue_capacity

Gauge

Gauge reflecting the remaining capacity of the queue

spring_batch_redis_key_event_queue_size

Gauge

Gauge reflecting the size (depth) of the queue

spring_batch_redis_key_scan_total

Counter

Number of keys scanned

spring_batch_redis_operation_seconds

Timer

Operation execution duration

spring_batch_redis_read_chunk

Gauge

Gauge reflecting the chunk size of the reader

spring_batch_redis_read_queue_capacity

Gauge

Gauge reflecting the remaining capacity of the queue

spring_batch_redis_read_queue_size

Gauge

Gauge reflecting the size (depth) of the queue

JVM Metrics

Use the --metrics-jvm option to enable the following additional metrics:

dashboard jvm
Name Type Description

jvm_buffer_count_buffers

Gauge

An estimate of the number of buffers in the pool

jvm_buffer_memory_used_bytes

Gauge

An estimate of the memory that the Java virtual machine is using for this buffer pool

jvm_buffer_total_capacity_bytes

Gauge

An estimate of the total capacity of the buffers in this pool

jvm_gc_concurrent_phase_time_seconds

Timer

Time spent in concurrent phase

jvm_gc_live_data_size_bytes

Gauge

Size of long-lived heap memory pool after reclamation

jvm_gc_max_data_size_bytes

Gauge

Max size of long-lived heap memory pool

jvm_gc_memory_allocated_bytes_total

Gauge

Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next

jvm_gc_memory_promoted_bytes_total

Counter

Count of positive increases in the size of the old generation memory pool before GC to after GC

jvm_gc_pause_seconds

Timer

Time spent in GC pause

jvm_memory_committed_bytes

Gauge

The amount of memory in bytes that is committed for the Java virtual machine to use

jvm_memory_max_bytes

Gauge

The maximum amount of memory in bytes that can be used for memory management

jvm_memory_used_bytes

Gauge

The amount of used memory

jvm_threads_daemon_threads

Gauge

The current number of live daemon threads

jvm_threads_live_threads

Gauge

The current number of live threads including both daemon and non-daemon threads

jvm_threads_peak_threads

Gauge

The peak live thread count since the Java virtual machine started or peak was reset

jvm_threads_started_threads_total

Counter

The total number of application threads started in the JVM

jvm_threads_states_threads

Gauge

The current number of threads

process_cpu_time_ns_total

Counter

The "cpu time" used by the Java Virtual Machine process

process_cpu_usage

Gauge

The "recent cpu usage" for the Java Virtual Machine process

process_start_time_seconds

Gauge

Start time of the process since unix epoch.

process_uptime_seconds

Gauge

The uptime of the Java virtual machine

system_cpu_count

Gauge

The number of processors available to the Java virtual machine

system_cpu_usage

Gauge

The "recent cpu usage" of the system the application is running in

system_load_average_1m

Gauge

The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time

Telegraf

RIOT-X logs can be collected and analyzed using Telegraf, an open-source server agent for collecting and sending metrics and logs.

Log Format

RIOT-X uses SLF4J Simple Logger with a configurable format.

Default format:

yyyy-MM-dd HH:mm:ss.SSS [LEVEL] logger.name - message

Example output:

2024-12-02 10:15:30.123 [INFO] com.redis.riot.FileImportCommand - Starting file import
2024-12-02 10:15:31.456 [WARN] com.redis.riot.ProgressStepExecutionListener - Slow processing detected
2024-12-02 10:15:32.789 [ERROR] com.redis.riot.ReplicateCommand - Connection failed

When --log-thread is enabled, the format includes thread information:

2024-12-02 10:15:30.123 [INFO] [main] com.redis.riot.FileImportCommand - Starting file import

Logging Options

Configure logging behavior with these options:

--log-file <file>

Write logs to a file

--log-level <level>

Set log level: ERROR, WARN, INFO, DEBUG, or TRACE (default: WARN)

--log-time-fmt <format>

Date/time format (default: yyyy-MM-dd HH:mm:ss.SSS)

--no-log-time

Hide timestamp in log messages

--log-thread

Show thread name in log messages

--log-name

Show logger instance name

-d, --debug

Enable debug logging

-i, --info

Enable info logging

-q, --quiet

Show errors only

Telegraf Setup

  1. Configure RIOT-X to write logs to a file:

    riotx --log-file /var/log/riotx/riotx.log file-import data.csv
  2. Download the complete Telegraf configuration: ../_attachments/telegraf-riotx.conf[telegraf-riotx.conf]

  3. Create a Telegraf configuration file (/etc/telegraf/telegraf.d/riotx.conf) with the following minimal setup:

    [[inputs.tail]]
      files = ["/var/log/riotx/*.log"]
      from_beginning = false
      data_format = "grok"
    
      grok_patterns = [
        '%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:level}\] \[%{DATA:thread}\] %{DATA:logger} - %{GREEDYDATA:message}',
        '%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:level}\] %{DATA:logger} - %{GREEDYDATA:message}',
      ]
    
      name_override = "riotx_logs"
      grok_timezone = "Local"
    
    [[processors.date]]
      field = "timestamp"
      field_key = "timestamp"
      date_format = ["2006-01-02 15:04:05.000"]
    
    [[processors.enum]]
      [[processors.enum.mapping]]
        field = "level"
        dest = "severity"
        [processors.enum.mapping.value_mappings]
          TRACE = 1
          DEBUG = 2
          INFO = 3
          WARN = 4
          ERROR = 5
    
    [[outputs.influxdb_v2]]
      urls = ["http://localhost:8086"]
      token = "$INFLUX_TOKEN"
      organization = "myorg"
      bucket = "riotx_logs"
  4. Start Telegraf:

    sudo systemctl restart telegraf

Parsed Fields

The Telegraf configuration extracts these fields:

Field Type Description

timestamp

timestamp

Log entry timestamp

level

string

Log level (TRACE, DEBUG, INFO, WARN, ERROR)

severity

integer

Numeric severity (1-5)

logger

string

Full logger name

thread

string

Thread name (if enabled)

message

string

Log message content

Docker Deployment

For containerized deployments, use the Docker log input:

[[inputs.docker_log]]
  endpoint = "unix:///var/run/docker.sock"
  container_name_include = ["riotx*"]
  data_format = "grok"
  grok_patterns = [
    '%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:level}\] %{DATA:logger} - %{GREEDYDATA:message}',
  ]
  name_override = "riotx_logs"

Querying Logs

Example InfluxDB Flux queries:

Get ERROR level logs:

from(bucket: "riotx_logs")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "riotx_logs")
  |> filter(fn: (r) => r.level == "ERROR")

Count logs by level:

from(bucket: "riotx_logs")
  |> range(start: -24h)
  |> filter(fn: (r) => r._measurement == "riotx_logs")
  |> group(columns: ["level"])
  |> count()

For a complete setup guide including Docker, Kubernetes, Elasticsearch, and Prometheus configurations, see ../_attachments/TELEGRAF_SETUP.adoc[Telegraf Setup Guide].