Building a Scalable Logging System with Loki and Grafana
#observability
#logging
#loki
#grafana
#kubernetes
Modern applications generate a constant stream of logs across services, containers, and infrastructure. Centralizing those logs is critical for debugging, analytics, and compliance, but traditional full-text indexing stacks can become expensive and complex at scale.
Grafana Loki takes a different approach. Instead of indexing the entire log line, Loki indexes labels and stores raw logs in object storage. This dramatically reduces cost while still enabling fast and flexible querying with LogQL. Paired with Grafana for visualization and Promtail or Grafana Agent for collection, you get a scalable, cost-efficient logging platform.
This guide walks through architecture, deployment options, best practices, and examples to help you build a robust logging system with Loki and Grafana.
Architecture Overview
Core components:
- Clients: Promtail or Grafana Agent collect logs and push to Loki.
- Loki: Ingests, stores, compacts, and serves queries.
- Distributor: Receives logs from clients, validates, and forwards.
- Ingester: Batches logs into chunks and writes to object storage.
- Query frontend and queriers: Parallelize and execute queries.
- Compactor: Compacts and enforces retention on index and chunks.
- Storage: Object store for chunks (S3, GCS, Azure Blob, MinIO). Lightweight index via boltdb-shipper.
- Grafana: Visualization, exploration, alerts.
Deployment modes:
- Single binary: Simple for small setups or dev.
- Microservices mode: Scales each Loki component independently for production.
When to Choose Loki
- You want lower cost than full-text indexing solutions for high-volume logs.
- You primarily filter by labels and perform time-bound searches.
- You are already using Grafana and Prometheus and want a familiar workflow.
- You need multi-tenancy and per-tenant limits.
Quick Start: Local Docker Compose
Use this for evaluation and development.
docker-compose.yml
version: "3.8"
services:
loki:
image: grafana/loki:2.9.0
command: -config.file=/etc/loki/config.yml
volumes:
- ./loki:/etc/loki
- loki-data:/data
ports:
- "3100:3100"
promtail:
image: grafana/promtail:2.9.0
command: -config.file=/etc/promtail/config.yml
volumes:
- ./promtail:/etc/promtail
- /var/log:/var/log
- /var/lib/docker/containers:/var/lib/docker/containers:ro
depends_on:
- loki
grafana:
image: grafana/grafana:11.2.0
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
loki-data:
grafana-data:
loki/config.yml
auth_enabled: true
server:
http_listen_port: 3100
common:
path_prefix: /data
storage:
filesystem:
chunks_directory: /data/chunks
rules_directory: /data/rules
replication_factor: 1
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v12
index:
prefix: loki_index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /data/index
cache_location: /data/boltdb-cache
shared_store: filesystem
compactor:
working_directory: /data/compactor
compaction_interval: 5m
retention_enabled: true
limits_config:
ingestion_rate_mb: 4
ingestion_burst_size_mb: 8
max_cache_freshness_per_query: 10m
reject_old_samples: true
reject_old_samples_max_age: 168h
per_stream_rate_limit: 3MB
per_stream_rate_limit_burst: 6MB
retention_period: 168h
promtail/config.yml
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
tenant_id: dev
external_labels:
cluster: local
positions:
filename: /tmp/positions.yaml
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*.log
- job_name: docker
pipeline_stages:
- docker: {}
- cri: {}
static_configs:
- targets:
- localhost
labels:
job: containers
__path__: /var/lib/docker/containers/*/*-json.log
Start:
- docker compose up -d
- Open Grafana at http://localhost:3000, add a Loki data source pointing to http://loki:3100.
Production on Kubernetes with Helm
Add chart repo:
- helm repo add grafana https://grafana.github.io/helm-charts
- helm repo update
Install Loki with object storage: values-loki.yaml
loki:
auth_enabled: true
commonConfig:
replication_factor: 3
schemaConfig:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: s3
schema: v12
index:
prefix: loki_index_
period: 24h
storageConfig:
boltdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/boltdb-cache
shared_store: s3
aws:
s3forcepathstyle: true
bucketnames: your-loki-bucket
endpoint: s3.amazonaws.com
region: us-east-1
access_key_id: ${AWS_ACCESS_KEY_ID}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
compactor:
working_directory: /var/loki/compactor
compaction_interval: 5m
retention_enabled: true
limits_config:
max_streams_matchers_per_query: 10000
max_query_parallelism: 24
ingestion_rate_mb: 20
ingestion_burst_size_mb: 40
per_stream_rate_limit: 3MB
per_stream_rate_limit_burst: 6MB
retention_period: 720h
analytics:
reporting_enabled: false
singleBinary:
enabled: false
ingester:
replicas: 3
persistence:
enabled: true
size: 50Gi
distributor:
replicas: 3
querier:
replicas: 3
queryFrontend:
replicas: 2
compactor:
replicas: 1
persistence:
enabled: true
size: 20Gi
Install:
- kubectl create ns observability
- helm install loki grafana/loki -n observability -f values-loki.yaml
Install Promtail as a DaemonSet: values-promtail.yaml
config:
clients:
- url: http://loki-gateway.observability.svc.cluster.local/loki/api/v1/push
tenant_id: prod
external_labels:
cluster: prod-cluster
snippets:
pipelineStages:
- cri: {}
- json:
expressions:
level: level
msg: message
- labels:
level: ""
- timestamp:
source: time
format: RFC3339
fallback: true
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- replace:
expression: "(?i)password|secret|token"
replace: "[REDACTED]"
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- action: replace
source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
- action: replace
source_labels: [__meta_kubernetes_node_name]
target_label: node
- action: drop
source_labels: [__meta_kubernetes_pod_uid]
- action: drop
source_labels: [__meta_kubernetes_pod_label_pod_template_hash]
selector: '{namespace!~"kube-system|kube-public"}'
Install:
- helm install promtail grafana/promtail -n observability -f values-promtail.yaml
Install Grafana:
- helm install grafana grafana/grafana -n observability
- Set the Loki data source in Grafana to the loki-gateway service.
Labeling and Cardinality Best Practices
- Use a small, stable set of labels:
- cluster, namespace, app, container, node, environment.
- Avoid labels with high cardinality or rapid churn:
- pod UID, request IDs, timestamps, user IDs.
- Normalize labels via relabel_configs before ingestion.
- Consolidate equivalent labels across teams to improve query reuse and caching.
- Use tenant IDs (X-Scope-OrgID) for multi-tenancy and apply per-tenant limits.
LogQL Query Examples
Basics:
- View all logs for a deployment:
- {namespace=“prod”, app=“api”}
- Filter lines containing a string:
- {app=“api”} |= “timeout”
- Exclude matches:
- {app=“api”} != “debug”
Parsing and formatting:
- Parse JSON logs and extract fields:
- {app=“api”} | json | level=“error”
- Custom output:
- {app=“api”} | json | line_format “{{.trace_id}} {{.msg}}”
Rates and aggregations:
- Errors per second by namespace:
- sum by (namespace) (rate({level=“error”}[5m]))
- Top 10 noisy containers:
- topk(10, sum by (container) (rate({cluster=“prod”}[5m])))
Latency or status code analysis:
- HTTP 500s over time:
- sum by (app) (rate({app=“gateway”} |= ” 500 ” [5m]))
Correlation:
- Join logs and metrics in Grafana by using the same labels (namespace, app, pod). Use Explore’s split view for side-by-side analysis.
Alerting on Logs
You can alert in two ways:
- Loki ruler with LogQL based alerts.
- Grafana Alerting using a Loki data source.
Example Loki ruler group:
groups:
- name: app-errors
interval: 1m
rules:
- alert: HighErrorRate
expr: sum by (app, namespace) (rate({app="api"} |= "error" [5m])) > 1
for: 10m
labels:
severity: warning
annotations:
description: Error logs exceed 1/s for {{ $labels.app }} in {{ $labels.namespace }}
Store ruler files in object storage or mount with the Helm chart. Configure notification channels in Grafana.
Retention, Compaction, and Storage
- Use boltdb-shipper for the index and object storage for chunks.
- Enable the compactor with retention_enabled to enforce retention.
- Global retention via limits_config.retention_period.
- Per-tenant overrides allow different retention by tenant.
- Use object storage lifecycle policies for additional cost control:
- Transition older chunks to infrequent access after 30 days.
- Expire chunks beyond compliance windows.
Security and Compliance
- Enable auth_enabled and enforce tenant headers through a gateway or ingress.
- Terminate TLS at the ingress or use mTLS between agents and Loki.
- Restrict egress so only the logging agents can push to Loki.
- Redact secrets in the pipeline (Promtail replace stage).
- Consider network policies in Kubernetes to isolate the stack.
- Audit access via Grafana and store dashboards as code.
Performance and Scaling
- Horizontal scale:
- Increase distributors and ingesters for higher write throughput.
- Increase queriers and query-frontends for read throughput and parallelism.
- Caching:
- Use results cache and index cache if deploying at scale.
- Chunk tuning:
- chunk_target_size and max_chunk_age balance memory vs. object size.
- Label strategy:
- Shard streams to spread ingestion load but avoid high cardinality.
- Query tips:
- Narrow time ranges.
- Filter early with labels before line filters.
- Use parsers and aggregations sparingly on very large time windows.
Cost Optimization
- Keep labels lean to minimize index growth.
- Prefer S3 or compatible object storage with lifecycle rules.
- Use retention per environment and team.
- Drop noisy, low-value logs or sample them at the edge.
- Compress logs at the source when appropriate and supported.
Troubleshooting
- 429 Too Many Requests:
- Increase per_stream_rate_limit or reduce client push concurrency.
- Missing logs:
- Verify Promtail positions file and path patterns.
- Check tenant_id alignment between clients and queries.
- Slow queries:
- Add or refine labels.
- Reduce time range or use query_frontend sharding.
- Out-of-order timestamps:
- Ensure time parsing is correct and set reject_old_samples appropriately.
Migrating from ELK
- Start by mirroring a subset of logs into Loki.
- Keep label sets compatible with existing dashboards.
- Replace common Kibana searches with LogQL equivalents.
- Evaluate cost and performance, then expand usage.
Provisioning Grafana Data Sources
grafana/provisioning/datasources/loki.yaml
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki-gateway.observability.svc.cluster.local
basicAuth: false
isDefault: true
jsonData:
maxLines: 1000
derivedFields:
- name: trace_id
matcherRegex: "trace[=: ]([a-f0-9-]+)"
url: "$${__value.raw}"
Final Checklist
- Storage
- Object storage configured and reachable
- Compactor enabled with retention
- Reliability
- At least three ingesters and distributors
- Persistent volumes for ingesters and compactor
- Security
- TLS at ingress
- Tenant isolation and per-tenant limits
- Efficiency
- Label strategy reviewed
- Lifecycle rules for object storage
- Usability
- Grafana data source provisioned
- Dashboards and alerts defined
By following these practices and configurations, you can run a scalable, cost-effective logging stack with Grafana Loki and Grafana that grows with your workloads while keeping operational complexity in check.