Practical Database Indexing: Real Examples and Benchmarks
#databases
#indexing
#performance
#postgresql
#mysql
Indexing is the most cost-effective way to make database-backed applications fast. This article gives you practical examples and reproducible benchmarks for common query patterns, with a focus on PostgreSQL and notes for MySQL users.
What you’ll get:
- A realistic dataset you can generate locally
- Common query patterns and which indexes help
- Before/after EXPLAIN ANALYZE with indicative timings
- Guidance for Postgres-specific features (partial, covering, GIN) and MySQL equivalents
- A checklist for measuring impact without fooling yourself
Your exact numbers will vary based on hardware, data distribution, cache warmth, and versions. The goal is to show proportions and methods, not absolute records.
Benchmarks setup
- Database: PostgreSQL 16, default settings except increased shared_buffers (2–4 GB) and effective_cache_size (half RAM)
- Hardware: 8 vCPU, 16 GB RAM, NVMe SSD
- Data volume: 10 million rows
- Methodology:
- Run EXPLAIN (ANALYZE, BUFFERS, TIMING, VERBOSE) three times and take the best warm-cache value
- Note a cold-cache run after dropping OS file cache when possible
- Ensure ANALYZE has run after bulk loads (VACUUM ANALYZE or just ANALYZE)
- Track write overhead via pg_stat_statements and insert throughput
Dataset: events A single wide table that captures common access patterns.
PostgreSQL DDL:
CREATE TABLE events (
id bigserial PRIMARY KEY,
user_id int NOT NULL,
occurred_at timestamptz NOT NULL,
event_type smallint NOT NULL, -- e.g., 1=login, 2=purchase, ...
country_code char(2) NOT NULL,
payload text, -- optional searchable text
metadata jsonb -- flexible attributes
);
-- Synthetic data: ~10M rows over ~365 days, skewed user distribution.
-- Adjust generate_series ranges to taste.
INSERT INTO events (user_id, occurred_at, event_type, country_code, payload, metadata)
SELECT
(1000 + (random()*900000))::int, -- many users, Zipf-ish spread
now() - ((random()*365)::int || ' days')::interval
- ((random()*86400)::int || ' seconds')::interval,
(1 + (random()*5)::int)::smallint,
(ARRAY['US','DE','IN','BR','GB','FR','CA','AU','ES','IT'])[1 + (random()*9)::int],
substr(md5(random()::text), 1, 60),
jsonb_build_object(
'plan', (ARRAY['free','pro','team'])[1 + (random()*2)::int],
'device', (ARRAY['ios','android','web'])[1 + (random()*2)::int],
'ref', substr(md5(random()::text), 1, 8)
)
FROM generate_series(1, 10000000);
ANALYZE events;
Query pattern 1: fetch latest items per user (ORDER BY + LIMIT) Typical UI: “Latest 10 user events”.
Query:
SELECT id, occurred_at, event_type, country_code
FROM events
WHERE user_id = 424242
ORDER BY occurred_at DESC
LIMIT 10;
Baseline (no index on user_id):
- Plan: Seq Scan on events, filter user_id, then sort top-N
- Indicative warm time on 10M rows: 800–1500 ms
- Buffers: heavy heap read; sorts each time
Add a composite, descending, covering index:
-- Postgres: INCLUDE makes it covering without affecting sort keys
CREATE INDEX CONCURRENTLY idx_events_user_time_desc_cover
ON events (user_id, occurred_at DESC)
INCLUDE (event_type, country_code);
After index:
- Plan: Index Only Scan using idx_events_user_time_desc_cover
- Indicative warm time: 1–5 ms
- Cold time: 20–60 ms (depending on cache)
- Notes:
- DESC in the index matches ORDER BY DESC and avoids a sort
- INCLUDE enables index-only scans when visibility map is favorable
MySQL note:
- Use a composite index: KEY (user_id, occurred_at DESC). InnoDB secondary indexes can be covering if selected columns are in the index. There’s no INCLUDE syntax; ensure selected columns are part of the index or the primary key.
Query pattern 2: range by time within a partitioning key Example: “Events in DE last 7 days ordered by time”.
SELECT id, occurred_at, user_id
FROM events
WHERE country_code = 'DE'
AND occurred_at >= now() - interval '7 days'
ORDER BY occurred_at ASC
LIMIT 500;
Index:
CREATE INDEX CONCURRENTLY idx_events_country_time
ON events (country_code, occurred_at);
Why this order:
- Filter on country_code first, then range on occurred_at
- Matches ORDER BY occurred_at for that country; often avoids explicit sort
After index:
- Warm time: 3–15 ms for selective countries
- If ‘DE’ is very common, consider a partial index by country, or add more selectivity (e.g., event_type).
MySQL note:
- Same composite order: KEY(country_code, occurred_at)
- Ensure the query can use the leftmost prefix.
Query pattern 3: partial indexes for hot subsets (Postgres) If most queries target event_type = 2 (e.g., purchases), use a partial index.
CREATE INDEX CONCURRENTLY idx_events_purchases_recent
ON events (occurred_at DESC)
WHERE event_type = 2;
Query:
SELECT id, user_id, occurred_at
FROM events
WHERE event_type = 2
AND occurred_at >= now() - interval '30 days'
ORDER BY occurred_at DESC
LIMIT 50;
Benefits:
- Smaller index → faster scans and updates
- Warm time: similar to pattern 1, often 1–5 ms
- MySQL: no predicate partial indexes; consider a separate table for hot subset, or a generated column and partitioning/secondary index tuned to the predicate.
Query pattern 4: case-insensitive prefix vs substring text search Case-insensitive prefix:
-- Functional index on expression
CREATE INDEX CONCURRENTLY idx_events_lower_payload_prefix
ON events ((lower(payload)));
-- Query
SELECT id FROM events
WHERE lower(payload) LIKE 'abc%';
- B-tree can handle prefix LIKE with the same collation.
- Warm time: tens of ms depending on selectivity.
Substring search:
-- Postgres pg_trgm extension for trigram GIN
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX CONCURRENTLY idx_events_payload_trgm
ON events USING GIN (payload gin_trgm_ops);
SELECT id FROM events
WHERE payload ILIKE '%abc%';
- GIN trigram turns a table scan into an index scan
- Warm time: often 10–40 ms for selective terms vs hundreds of ms to seconds without index
MySQL note:
- Use FULLTEXT for natural-language modes; for substring matching you may need n-gram parsers or consider external search. For prefix LIKE, an index on payload with a consistent collation helps. MySQL 8 supports functional indexes via generated columns for lower(payload).
Query pattern 5: JSON attributes Filter by JSON attribute:
-- Query
SELECT id FROM events
WHERE metadata ->> 'plan' = 'pro';
Postgres indexes:
-- Option A: GIN on jsonb for containment lookups
CREATE INDEX CONCURRENTLY idx_events_metadata_gin
ON events USING GIN (metadata jsonb_path_ops);
-- Then query with containment to use the index:
SELECT id FROM events
WHERE metadata @> '{"plan":"pro"}';
-- Option B: Functional B-tree if you only need a single key
CREATE INDEX CONCURRENTLY idx_events_plan
ON events ((metadata ->> 'plan'));
- B-tree functional index is small and fast for equality, often a few ms.
- GIN helps for multiple keys and more complex predicates.
MySQL note:
- Create a generated column and index it:
ALTER TABLE events
ADD COLUMN plan VARCHAR(10) AS (JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.plan'))) STORED,
ADD INDEX idx_events_plan (plan);
Query pattern 6: joins on foreign keys If you frequently join events.user_id to users.id, index both sides:
- users.id is PRIMARY KEY (already indexed)
- events.user_id needs a B-tree index:
CREATE INDEX CONCURRENTLY idx_events_user ON events (user_id);
Without this, the join may devolve into nested loops with repeated table scans.
Benchmark snapshots (indicative)
- Latest 10 events for a user:
- No index: 800–1500 ms warm
- With (user_id, occurred_at DESC) covering: 1–5 ms warm, 20–60 ms cold
- Country + 7-day range:
- No index: 700–1200 ms warm with sort
- With (country_code, occurred_at): 3–15 ms warm, 25–80 ms cold
- JSON plan equality via functional index:
- No index: 400–900 ms warm
- With (metadata->>‘plan’): 2–8 ms warm
- Substring payload search:
- No index: 2–6 s warm for unpopular term on 10M rows
- GIN trigram: 10–40 ms warm for selective term
Write overhead Indexes accelerate reads but slow writes. Measure the impact:
- Baseline INSERT INTO events … (no secondary indexes): ~170k rows/s bulk insert
- With 3 additional indexes (user_id, user_id+occurred_at, country_code+occurred_at): ~75–100k rows/s
- With GIN on payload or metadata: expect another noticeable drop due to GIN maintenance These are directional numbers; measure on your stack.
How to measure correctly
- Warm vs cold cache:
- Warm: run the same query 2–3 times and take the best
- Cold: on Linux, drop caches between runs (needs root) or read enough unrelated data to evict
- Use EXPLAIN (ANALYZE, BUFFERS):
- Look for Index Scan/Index Only Scan vs Seq Scan
- Check Rows Removed by Filter; if high, consider predicate pushdown or more selective indexes
- Check stats:
- ANALYZE after big loads
- Ensure datatypes and collations match predicates
- Watch for implicit casts:
- WHERE user_id = ‘42’ might prevent index usage; cast properly or use correct types
- Verify sort avoidance:
- If you ORDER BY occurred_at but your index is (occurred_at, user_id), ensure the filter aligns to allow index-order scans
Selecting composite index order
- Put the most selective, equality-tested column first
- Follow with range/ordering columns
- Make the index order match your ORDER BY whenever possible
- One query pattern per index; if you have many patterns, consider query consolidation or different indexes per workload
Covering indexes
- Postgres: use INCLUDE for non-key columns to enable index-only scans
- MySQL/InnoDB: queries are often covered if selected columns are in the secondary index or the primary key
- Be mindful: larger indexes increase write cost and memory use
Partial indexes (Postgres)
- Great when a small predicate matches most hot queries
- Example: WHERE event_type = 2 AND occurred_at >= now() - interval ‘90 days’
- They reduce index size and maintenance cost
Functional indexes
- Index expressions used in WHERE/ORDER BY:
- lower(email), date_trunc(‘day’, occurred_at), coalesce(field, ‘x’)
- MySQL 8: use generated columns and index them
Full-text and fuzzy search
- Postgres: GIN with pg_trgm for substring; GIN/GIST with to_tsvector for full-text
- MySQL: FULLTEXT indexes on InnoDB for natural language and boolean mode
- For complex search, consider dedicated search engines
Operational tips
- Don’t over-index; every index slows writes and VACUUM
- Rebuild bloated indexes if needed (REINDEX CONCURRENTLY in Postgres)
- Monitor with pg_stat_io, pg_stat_statements, and auto_explain
- Consider partitioning by time for very large tables and to improve pruning
Minimal reproducible script (Postgres)
- Create and load:
-- Run DDL + INSERT from above
ANALYZE events;
- Baseline query:
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, occurred_at, event_type, country_code
FROM events
WHERE user_id = 424242
ORDER BY occurred_at DESC
LIMIT 10;
- Add index and re-run:
CREATE INDEX CONCURRENTLY idx_events_user_time_desc_cover
ON events (user_id, occurred_at DESC)
INCLUDE (event_type, country_code);
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, occurred_at, event_type, country_code
FROM events
WHERE user_id = 424242
ORDER BY occurred_at DESC
LIMIT 10;
Cheat sheet
- Equality filter: B-tree on column; for multi-column, equality columns first
- Range filter with sort: composite (filter_cols, order_col [ASC|DESC]) to avoid sort
- Latest N rows per key: (key, time DESC) plus covering columns
- Frequent hot subset in Postgres: partial index
- Expression predicates: functional or generated-column index
- Substring search: trigram GIN (Postgres) or rethink approach in MySQL
- JSON attribute equality: functional index (or generated column in MySQL)
- Always measure with EXPLAIN ANALYZE, warm and cold
With a few targeted indexes and a disciplined measurement loop, you can turn multi-second scans into millisecond responses and keep write overhead under control.