Prometheus Metrics
Rift exposes metrics in Prometheus format for monitoring and alerting.
Enabling Metrics
Default Configuration
Metrics are enabled by default on port 9090:
curl http://localhost:9090/metrics
Custom Port
# Environment variable
RIFT_METRICS_PORT=8090 rift-http-proxy
# Docker
docker run -e RIFT_METRICS_PORT=8090 -p 8090:8090 ghcr.io/etacassiopeia/rift-proxy:latest
Disable Metrics
RIFT_METRICS_ENABLED=false rift-http-proxy
Available Metrics
Request Metrics
# Total requests processed
rift_http_requests_total{method="GET", path="/api/users", status="200"} 1234
# Request duration histogram
rift_http_request_duration_seconds_bucket{le="0.001"} 100
rift_http_request_duration_seconds_bucket{le="0.01"} 500
rift_http_request_duration_seconds_bucket{le="0.1"} 900
rift_http_request_duration_seconds_bucket{le="1"} 1000
rift_http_request_duration_seconds_sum 45.67
rift_http_request_duration_seconds_count 1000
Fault Injection Metrics
# Faults injected
rift_faults_injected_total{type="latency", rule="api-latency"} 300
rift_faults_injected_total{type="error", rule="api-errors"} 50
# Injected latency histogram
rift_injected_latency_seconds_bucket{le="0.1"} 100
rift_injected_latency_seconds_bucket{le="0.5"} 250
rift_injected_latency_seconds_bucket{le="1"} 300
Imposter Metrics (Mountebank Mode)
# Imposters count
rift_imposters_total 5
# Stubs per imposter
rift_stubs_total{port="4545"} 10
rift_stubs_total{port="4546"} 25
# Requests per imposter
rift_imposter_requests_total{port="4545", matched="true"} 500
rift_imposter_requests_total{port="4545", matched="false"} 12
Script Execution Metrics
# Script execution time
rift_script_execution_seconds_bucket{engine="rhai", le="0.001"} 950
rift_script_execution_seconds_bucket{engine="rhai", le="0.01"} 999
rift_script_execution_seconds_sum{engine="rhai"} 2.5
rift_script_execution_seconds_count{engine="rhai"} 1000
# Script errors
rift_script_errors_total{engine="rhai"} 5
Flow State Metrics
# Flow state operations
rift_flow_state_operations_total{operation="get"} 5000
rift_flow_state_operations_total{operation="set"} 2000
rift_flow_state_operations_total{operation="delete"} 100
# Flow state operation latency
rift_flow_state_operation_seconds_bucket{operation="get", le="0.0001"} 4900
rift_flow_state_operation_seconds_bucket{operation="get", le="0.001"} 5000
Connection Metrics
# Active connections
rift_active_connections 25
# Connection pool stats
rift_connection_pool_size{upstream="backend"} 10
rift_connection_pool_available{upstream="backend"} 7
Prometheus Configuration
Basic Scrape Config
# prometheus.yml
scrape_configs:
- job_name: 'rift'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15s
Kubernetes Service Discovery
scrape_configs:
- job_name: 'rift'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: rift
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "9090"
action: keep
Grafana Dashboard
Import Dashboard
- Go to Grafana → Dashboards → Import
- Upload the dashboard JSON or paste the ID
- Select your Prometheus data source
Key Panels
Request Rate:
rate(rift_http_requests_total[5m])
Error Rate:
sum(rate(rift_http_requests_total{status=~"5.."}[5m]))
/
sum(rate(rift_http_requests_total[5m])) * 100
P99 Latency:
histogram_quantile(0.99, rate(rift_http_request_duration_seconds_bucket[5m]))
Fault Injection Rate:
sum(rate(rift_faults_injected_total[5m])) by (type)
Sample Dashboard JSON
{
"title": "Rift Metrics",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [{
"expr": "sum(rate(rift_http_requests_total[5m])) by (status)"
}]
},
{
"title": "Latency Percentiles",
"type": "graph",
"targets": [
{ "expr": "histogram_quantile(0.50, rate(rift_http_request_duration_seconds_bucket[5m]))", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, rate(rift_http_request_duration_seconds_bucket[5m]))", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, rate(rift_http_request_duration_seconds_bucket[5m]))", "legendFormat": "p99" }
]
},
{
"title": "Fault Injection",
"type": "graph",
"targets": [{
"expr": "sum(rate(rift_faults_injected_total[5m])) by (type)"
}]
}
]
}
Alerting Rules
High Error Rate
groups:
- name: rift
rules:
- alert: RiftHighErrorRate
expr: |
sum(rate(rift_http_requests_total{status=~"5.."}[5m]))
/
sum(rate(rift_http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in Rift"
description: "Error rate is "
High Latency
- alert: RiftHighLatency
expr: |
histogram_quantile(0.99, rate(rift_http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency in Rift"
description: "P99 latency is "
Script Errors
- alert: RiftScriptErrors
expr: rate(rift_script_errors_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Script errors in Rift"
description: "Script engine has errors"
Best Practices
- Set reasonable scrape intervals - 15-30 seconds is typical
- Use recording rules - Pre-compute expensive queries
- Set up alerting - Monitor error rates and latency
- Retain metrics - Keep enough history for analysis
- Label cardinality - Avoid high-cardinality labels (like request IDs)