Building a Scalable Logging System with Loki and Grafana

Team 7 min read

#observability

#logging

#loki

#grafana

#kubernetes

Modern applications generate a constant stream of logs across services, containers, and infrastructure. Centralizing those logs is critical for debugging, analytics, and compliance, but traditional full-text indexing stacks can become expensive and complex at scale.

Grafana Loki takes a different approach. Instead of indexing the entire log line, Loki indexes labels and stores raw logs in object storage. This dramatically reduces cost while still enabling fast and flexible querying with LogQL. Paired with Grafana for visualization and Promtail or Grafana Agent for collection, you get a scalable, cost-efficient logging platform.

This guide walks through architecture, deployment options, best practices, and examples to help you build a robust logging system with Loki and Grafana.

Architecture Overview

Core components:

  • Clients: Promtail or Grafana Agent collect logs and push to Loki.
  • Loki: Ingests, stores, compacts, and serves queries.
    • Distributor: Receives logs from clients, validates, and forwards.
    • Ingester: Batches logs into chunks and writes to object storage.
    • Query frontend and queriers: Parallelize and execute queries.
    • Compactor: Compacts and enforces retention on index and chunks.
  • Storage: Object store for chunks (S3, GCS, Azure Blob, MinIO). Lightweight index via boltdb-shipper.
  • Grafana: Visualization, exploration, alerts.

Deployment modes:

  • Single binary: Simple for small setups or dev.
  • Microservices mode: Scales each Loki component independently for production.

When to Choose Loki

  • You want lower cost than full-text indexing solutions for high-volume logs.
  • You primarily filter by labels and perform time-bound searches.
  • You are already using Grafana and Prometheus and want a familiar workflow.
  • You need multi-tenancy and per-tenant limits.

Quick Start: Local Docker Compose

Use this for evaluation and development.

docker-compose.yml

version: "3.8"

services:
  loki:
    image: grafana/loki:2.9.0
    command: -config.file=/etc/loki/config.yml
    volumes:
      - ./loki:/etc/loki
      - loki-data:/data
    ports:
      - "3100:3100"

  promtail:
    image: grafana/promtail:2.9.0
    command: -config.file=/etc/promtail/config.yml
    volumes:
      - ./promtail:/etc/promtail
      - /var/log:/var/log
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    depends_on:
      - loki

  grafana:
    image: grafana/grafana:11.2.0
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

volumes:
  loki-data:
  grafana-data:

loki/config.yml

auth_enabled: true

server:
  http_listen_port: 3100

common:
  path_prefix: /data
  storage:
    filesystem:
      chunks_directory: /data/chunks
      rules_directory: /data/rules
  replication_factor: 1

schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v12
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /data/index
    cache_location: /data/boltdb-cache
    shared_store: filesystem

compactor:
  working_directory: /data/compactor
  compaction_interval: 5m
  retention_enabled: true

limits_config:
  ingestion_rate_mb: 4
  ingestion_burst_size_mb: 8
  max_cache_freshness_per_query: 10m
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  per_stream_rate_limit: 3MB
  per_stream_rate_limit_burst: 6MB
  retention_period: 168h

promtail/config.yml

server:
  http_listen_port: 9080

clients:
  - url: http://loki:3100/loki/api/v1/push
    tenant_id: dev
    external_labels:
      cluster: local

positions:
  filename: /tmp/positions.yaml

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/*.log

  - job_name: docker
    pipeline_stages:
      - docker: {}
      - cri: {}
    static_configs:
      - targets:
          - localhost
        labels:
          job: containers
          __path__: /var/lib/docker/containers/*/*-json.log

Start:

Production on Kubernetes with Helm

Add chart repo:

Install Loki with object storage: values-loki.yaml

loki:
  auth_enabled: true

  commonConfig:
    replication_factor: 3

  schemaConfig:
    configs:
      - from: 2024-01-01
        store: boltdb-shipper
        object_store: s3
        schema: v12
        index:
          prefix: loki_index_
          period: 24h

  storageConfig:
    boltdb_shipper:
      active_index_directory: /var/loki/index
      cache_location: /var/loki/boltdb-cache
      shared_store: s3
    aws:
      s3forcepathstyle: true
      bucketnames: your-loki-bucket
      endpoint: s3.amazonaws.com
      region: us-east-1
      access_key_id: ${AWS_ACCESS_KEY_ID}
      secret_access_key: ${AWS_SECRET_ACCESS_KEY}

  compactor:
    working_directory: /var/loki/compactor
    compaction_interval: 5m
    retention_enabled: true

  limits_config:
    max_streams_matchers_per_query: 10000
    max_query_parallelism: 24
    ingestion_rate_mb: 20
    ingestion_burst_size_mb: 40
    per_stream_rate_limit: 3MB
    per_stream_rate_limit_burst: 6MB
    retention_period: 720h

  analytics:
    reporting_enabled: false

singleBinary:
  enabled: false

ingester:
  replicas: 3
  persistence:
    enabled: true
    size: 50Gi

distributor:
  replicas: 3

querier:
  replicas: 3

queryFrontend:
  replicas: 2

compactor:
  replicas: 1
  persistence:
    enabled: true
    size: 20Gi

Install:

  • kubectl create ns observability
  • helm install loki grafana/loki -n observability -f values-loki.yaml

Install Promtail as a DaemonSet: values-promtail.yaml

config:
  clients:
    - url: http://loki-gateway.observability.svc.cluster.local/loki/api/v1/push
      tenant_id: prod
      external_labels:
        cluster: prod-cluster

  snippets:
    pipelineStages:
      - cri: {}
      - json:
          expressions:
            level: level
            msg: message
      - labels:
          level: ""
      - timestamp:
          source: time
          format: RFC3339
          fallback: true

  scrape_configs:
    - job_name: kubernetes-pods
      kubernetes_sd_configs:
        - role: pod
      pipeline_stages:
        - replace:
            expression: "(?i)password|secret|token"
            replace: "[REDACTED]"
      relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - action: replace
          source_labels: [__meta_kubernetes_namespace]
          target_label: namespace
        - action: replace
          source_labels: [__meta_kubernetes_pod_name]
          target_label: pod
        - action: replace
          source_labels: [__meta_kubernetes_pod_container_name]
          target_label: container
        - action: replace
          source_labels: [__meta_kubernetes_node_name]
          target_label: node
        - action: drop
          source_labels: [__meta_kubernetes_pod_uid]
        - action: drop
          source_labels: [__meta_kubernetes_pod_label_pod_template_hash]
      selector: '{namespace!~"kube-system|kube-public"}'

Install:

  • helm install promtail grafana/promtail -n observability -f values-promtail.yaml

Install Grafana:

  • helm install grafana grafana/grafana -n observability
  • Set the Loki data source in Grafana to the loki-gateway service.

Labeling and Cardinality Best Practices

  • Use a small, stable set of labels:
    • cluster, namespace, app, container, node, environment.
  • Avoid labels with high cardinality or rapid churn:
    • pod UID, request IDs, timestamps, user IDs.
  • Normalize labels via relabel_configs before ingestion.
  • Consolidate equivalent labels across teams to improve query reuse and caching.
  • Use tenant IDs (X-Scope-OrgID) for multi-tenancy and apply per-tenant limits.

LogQL Query Examples

Basics:

  • View all logs for a deployment:
    • {namespace=“prod”, app=“api”}
  • Filter lines containing a string:
    • {app=“api”} |= “timeout”
  • Exclude matches:
    • {app=“api”} != “debug”

Parsing and formatting:

  • Parse JSON logs and extract fields:
    • {app=“api”} | json | level=“error”
  • Custom output:
    • {app=“api”} | json | line_format “{{.trace_id}} {{.msg}}”

Rates and aggregations:

  • Errors per second by namespace:
    • sum by (namespace) (rate({level=“error”}[5m]))
  • Top 10 noisy containers:
    • topk(10, sum by (container) (rate({cluster=“prod”}[5m])))

Latency or status code analysis:

  • HTTP 500s over time:
    • sum by (app) (rate({app=“gateway”} |= ” 500 ” [5m]))

Correlation:

  • Join logs and metrics in Grafana by using the same labels (namespace, app, pod). Use Explore’s split view for side-by-side analysis.

Alerting on Logs

You can alert in two ways:

  • Loki ruler with LogQL based alerts.
  • Grafana Alerting using a Loki data source.

Example Loki ruler group:

groups:
  - name: app-errors
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: sum by (app, namespace) (rate({app="api"} |= "error" [5m])) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          description: Error logs exceed 1/s for {{ $labels.app }} in {{ $labels.namespace }}

Store ruler files in object storage or mount with the Helm chart. Configure notification channels in Grafana.

Retention, Compaction, and Storage

  • Use boltdb-shipper for the index and object storage for chunks.
  • Enable the compactor with retention_enabled to enforce retention.
  • Global retention via limits_config.retention_period.
  • Per-tenant overrides allow different retention by tenant.
  • Use object storage lifecycle policies for additional cost control:
    • Transition older chunks to infrequent access after 30 days.
    • Expire chunks beyond compliance windows.

Security and Compliance

  • Enable auth_enabled and enforce tenant headers through a gateway or ingress.
  • Terminate TLS at the ingress or use mTLS between agents and Loki.
  • Restrict egress so only the logging agents can push to Loki.
  • Redact secrets in the pipeline (Promtail replace stage).
  • Consider network policies in Kubernetes to isolate the stack.
  • Audit access via Grafana and store dashboards as code.

Performance and Scaling

  • Horizontal scale:
    • Increase distributors and ingesters for higher write throughput.
    • Increase queriers and query-frontends for read throughput and parallelism.
  • Caching:
    • Use results cache and index cache if deploying at scale.
  • Chunk tuning:
    • chunk_target_size and max_chunk_age balance memory vs. object size.
  • Label strategy:
    • Shard streams to spread ingestion load but avoid high cardinality.
  • Query tips:
    • Narrow time ranges.
    • Filter early with labels before line filters.
    • Use parsers and aggregations sparingly on very large time windows.

Cost Optimization

  • Keep labels lean to minimize index growth.
  • Prefer S3 or compatible object storage with lifecycle rules.
  • Use retention per environment and team.
  • Drop noisy, low-value logs or sample them at the edge.
  • Compress logs at the source when appropriate and supported.

Troubleshooting

  • 429 Too Many Requests:
    • Increase per_stream_rate_limit or reduce client push concurrency.
  • Missing logs:
    • Verify Promtail positions file and path patterns.
    • Check tenant_id alignment between clients and queries.
  • Slow queries:
    • Add or refine labels.
    • Reduce time range or use query_frontend sharding.
  • Out-of-order timestamps:
    • Ensure time parsing is correct and set reject_old_samples appropriately.

Migrating from ELK

  • Start by mirroring a subset of logs into Loki.
  • Keep label sets compatible with existing dashboards.
  • Replace common Kibana searches with LogQL equivalents.
  • Evaluate cost and performance, then expand usage.

Provisioning Grafana Data Sources

grafana/provisioning/datasources/loki.yaml

apiVersion: 1
datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki-gateway.observability.svc.cluster.local
    basicAuth: false
    isDefault: true
    jsonData:
      maxLines: 1000
      derivedFields:
        - name: trace_id
          matcherRegex: "trace[=: ]([a-f0-9-]+)"
          url: "$${__value.raw}"

Final Checklist

  • Storage
    • Object storage configured and reachable
    • Compactor enabled with retention
  • Reliability
    • At least three ingesters and distributors
    • Persistent volumes for ingesters and compactor
  • Security
    • TLS at ingress
    • Tenant isolation and per-tenant limits
  • Efficiency
    • Label strategy reviewed
    • Lifecycle rules for object storage
  • Usability
    • Grafana data source provisioned
    • Dashboards and alerts defined

By following these practices and configurations, you can run a scalable, cost-effective logging stack with Grafana Loki and Grafana that grows with your workloads while keeping operational complexity in check.