Getting Started with Astro: A Modern Web Framework
#data-modeling
#databases
#sql
#nosql
#graphs
How to Store Hierarchical Data (Trees, Graphs, Tags)
Hierarchical data shows up everywhere: categories and subcategories, organization charts, threaded comments, product catalogs, permissions, and knowledge graphs. Choosing the right storage pattern depends on your read/write workload, required queries (subtrees, ancestors, paths, shortest paths), and the database you are using.
This guide walks through the most common patterns with pros, cons, schemas, and queries across SQL, document stores, and graph databases. It ends with practical guidance to choose the right model for your use case.
Trees vs. Graphs vs. DAGs
- Tree: A connected acyclic graph with one parent per node (except root).
- DAG: Directed acyclic graph; nodes can have multiple parents but no cycles.
- General graph: Relationships can form cycles; nodes can have many parents.
Be explicit about whether you need a strict tree, a DAG, or a general graph. Many models designed for trees do not handle multiple parents or cycles without modification.
Core Patterns in Relational Databases
1) Adjacency List (Parent Pointer)
The simplest: each node stores a reference to its parent.
Schema (PostgreSQL):
CREATE TABLE node (
id BIGSERIAL PRIMARY KEY,
parent_id BIGINT REFERENCES node(id) ON DELETE SET NULL,
name TEXT NOT NULL
);
CREATE INDEX ON node (parent_id);
Pros:
- Easy to understand.
- Inserts, updates are cheap.
- Works with any RDBMS.
Cons:
- Subtree and path queries require recursion.
- Harder to prevent cycles without constraints or triggers.
Common queries:
- Get descendants (recursive CTE, PostgreSQL/MySQL 8/SQLite 3.8+):
WITH RECURSIVE descendants AS (
SELECT id, parent_id, name, 0 AS depth
FROM node
WHERE id = $1 -- start node
UNION ALL
SELECT n.id, n.parent_id, n.name, d.depth + 1
FROM node n
JOIN descendants d ON n.parent_id = d.id
)
SELECT * FROM descendants;
- Get ancestors (path to root):
WITH RECURSIVE ancestors AS (
SELECT id, parent_id, name, 0 AS depth
FROM node WHERE id = $1
UNION ALL
SELECT p.id, p.parent_id, p.name, a.depth + 1
FROM node p
JOIN ancestors a ON a.parent_id = p.id
)
SELECT * FROM ancestors;
Cycle prevention (simple approach): a trigger that denies setting parent_id to a descendant of the current node.
2) Materialized Path
Each node stores its full path from the root, commonly as a delimited string or using Postgres ltree.
Schema (generic string path):
CREATE TABLE node (
id BIGSERIAL PRIMARY KEY,
path TEXT NOT NULL, -- e.g., /electronics/phones/smartphones
name TEXT NOT NULL
);
-- Index to support prefix lookups
CREATE INDEX node_path_prefix ON node (path);
With Postgres ltree:
CREATE EXTENSION IF NOT EXISTS ltree;
CREATE TABLE node (
id BIGSERIAL PRIMARY KEY,
path LTREE NOT NULL, -- e.g., electronics.phones.smartphones
name TEXT NOT NULL
);
CREATE INDEX ON node USING GIST (path);
Pros:
- Subtree queries are fast (prefix scan).
- Sorting by hierarchy is easy.
- Good for read-heavy trees with infrequent moves.
Cons:
- Moving/renaming a node requires updating all descendants’ paths.
- More write amplification on reorganizations.
Queries:
- Get subtree by prefix:
-- Text path
SELECT * FROM node WHERE path LIKE '/electronics/%';
-- ltree
SELECT * FROM node WHERE path <@ 'electronics.phones.*';
- Get ancestors (ltree):
SELECT subpath(path, 0, n) AS ancestor
FROM node, LATERAL generate_subscripts(string_to_array(path::text, '.'), 1) AS n
WHERE id = $1;
Updating on move:
-- Move /electronics/phones to /devices/phones
WITH moved AS (
SELECT path AS old_path, '/devices' || substring(path FROM '/phones.*') AS new_path
FROM node WHERE path LIKE '/electronics/phones%'
)
UPDATE node n
SET path = m.new_path
FROM moved m
WHERE n.path = m.old_path;
3) Nested Sets (lft/rgt)
Each node gets left/right numbers by a depth-first traversal. A node’s subtree is the interval (lft, rgt).
Schema:
CREATE TABLE node (
id BIGSERIAL PRIMARY KEY,
lft BIGINT NOT NULL,
rgt BIGINT NOT NULL,
name TEXT NOT NULL,
CHECK (lft < rgt)
);
CREATE INDEX ON node (lft);
CREATE INDEX ON node (rgt);
Pros:
- Subtree queries are single-range scans.
- Aggregate queries over subtrees are efficient.
Cons:
- Inserts, moves require shifting ranges (O(N) updates).
- Concurrency is tricky; not ideal for frequent writes.
Queries:
- Subtree:
SELECT child.*
FROM node AS parent
JOIN node AS child
ON child.lft BETWEEN parent.lft AND parent.rgt
WHERE parent.id = $1
ORDER BY child.lft;
- Immediate children:
SELECT c.*
FROM node p
JOIN node c
ON c.lft > p.lft AND c.rgt < p.rgt
LEFT JOIN node x
ON x.lft > p.lft AND x.rgt < p.rgt
AND x.lft < c.lft AND x.rgt > c.rgt
WHERE p.id = $1 AND x.id IS NULL
ORDER BY c.lft;
Use when the tree is relatively static and read-heavy.
4) Closure Table
Store all ancestor–descendant pairs with depth.
Schema:
CREATE TABLE node (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE node_closure (
ancestor BIGINT NOT NULL REFERENCES node(id) ON DELETE CASCADE,
descendant BIGINT NOT NULL REFERENCES node(id) ON DELETE CASCADE,
depth INT NOT NULL, -- 0 for self
PRIMARY KEY (ancestor, descendant)
);
CREATE INDEX ON node_closure (descendant, depth);
Insert a node under parent P:
-- Add the node
INSERT INTO node (name) VALUES ($1) RETURNING id;
-- Add closure rows: self-link and inherited ancestors
INSERT INTO node_closure (ancestor, descendant, depth)
SELECT ancestor, $new_id, depth + 1
FROM node_closure
WHERE descendant = $parent_id
UNION ALL
SELECT $new_id, $new_id, 0;
Move a subtree: delete its existing ancestor links, then reinsert based on the new parent’s ancestors.
Pros:
- Fast subtree and ancestor queries.
- Balanced read/write performance; moves are manageable.
- Enforces reachability semantics.
Cons:
- Extra storage proportional to number of pairs.
- Need logic (triggers or app code) to maintain closure rows.
Queries:
- Get descendants:
SELECT n.*, c.depth
FROM node_closure c
JOIN node n ON n.id = c.descendant
WHERE c.ancestor = $1
ORDER BY c.depth, n.id;
- Get ancestors (path):
SELECT n.*, c.depth
FROM node_closure c
JOIN node n ON n.id = c.ancestor
WHERE c.descendant = $1
ORDER BY c.depth DESC;
Cycle prevention is natural: deny inserts that would create ancestor=descendant with negative or nonsensical depth.
Graph Databases (Property Graph)
When you need multiple parents, arbitrary relationships, or flexible traversal patterns (e.g., “friends-of-friends who like both X and Y”), a graph database shines.
Example (Neo4j Cypher):
Schema and data:
CREATE (r:Category {name:'electronics'}),
(p:Category {name:'phones'}),
(s:Category {name:'smartphones'});
CREATE (p)-[:CHILD_OF]->(r);
CREATE (s)-[:CHILD_OF]->(p);
Subtree traversal:
MATCH (n:Category {name: $name})<-[:CHILD_OF*0..]-(desc)
RETURN desc;
Path to root:
MATCH path = (n:Category {name: $name})-[:CHILD_OF*0..]->(root)
WHERE NOT (root)-[:CHILD_OF]->()
RETURN path ORDER BY length(path) DESC LIMIT 1;
Pros:
- Natural modelling for DAGs and general graphs.
- Powerful, flexible traversals and path queries.
- Good for complex relationship analytics.
Cons:
- Operational overhead vs. RDBMS.
- Joins back to relational data require integration patterns.
Document Stores (e.g., MongoDB)
Two common patterns:
- Embed children (good for small, bounded subtrees, and when you always load together).
- Reference parent and optionally store materialized path.
Adjacency with ancestors array:
// Document example
{
_id: ObjectId("..."),
name: "smartphones",
parentId: ObjectId("..."),
ancestors: [ObjectId("root"), ObjectId("phones")]
}
// Index for subtree prefix search
db.categories.createIndex({ ancestors: 1 });
/* Get subtree (all nodes that include the root in their ancestors) */
db.categories.find({ ancestors: ObjectId("root") });
Materialized path as string:
{
_id: "...",
name: "smartphones",
path: "/electronics/phones/smartphones"
}
// Index: db.categories.createIndex({ path: 1 })
// Query: db.categories.find({ path: /^\/electronics\/phones/ })
Pros:
- Simple to scale.
- Embedding can simplify reads.
Cons:
- Moving nodes can be expensive if many descendants need updates.
- Limited native recursive querying; you model it yourself via arrays or path strings.
Tags and Taxonomies
- Flat tags (non-hierarchical):
- content table, tags table, content_tags join table.
CREATE TABLE content (
id BIGSERIAL PRIMARY KEY,
title TEXT NOT NULL
);
CREATE TABLE tag (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE
);
CREATE TABLE content_tag (
content_id BIGINT REFERENCES content(id) ON DELETE CASCADE,
tag_id BIGINT REFERENCES tag(id) ON DELETE CASCADE,
PRIMARY KEY (content_id, tag_id)
);
CREATE INDEX ON content_tag (tag_id);
- Hierarchical taxonomies:
- Use adjacency list, materialized path, or closure table for tags.
- If content should inherit from parent tags, closure table is a strong fit.
Example: hierarchical tags with closure table:
CREATE TABLE tag (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE
);
CREATE TABLE tag_closure (
ancestor BIGINT NOT NULL REFERENCES tag(id) ON DELETE CASCADE,
descendant BIGINT NOT NULL REFERENCES tag(id) ON DELETE CASCADE,
depth INT NOT NULL,
PRIMARY KEY (ancestor, descendant)
);
CREATE TABLE content_tag (
content_id BIGINT REFERENCES content(id) ON DELETE CASCADE,
tag_id BIGINT REFERENCES tag(id) ON DELETE CASCADE,
PRIMARY KEY (content_id, tag_id)
);
/* Query: all content under a tag subtree */
SELECT DISTINCT c.*
FROM tag_closure tc
JOIN content_tag ct ON ct.tag_id = tc.descendant
JOIN content c ON c.id = ct.content_id
WHERE tc.ancestor = $tag_id;
Choosing the Right Model
-
Mostly flat with occasional parent-child, simple CRUD, and small trees:
- Adjacency list.
-
Read-heavy navigation of subtrees and breadcrumb paths; infrequent moves:
- Materialized path or nested sets.
- Prefer materialized path if you can tolerate path rewrites on moves.
-
Balanced read/write, frequent queries for both ancestors and descendants; easy moves:
- Closure table.
-
Many-to-many relationships, DAGs, complex traversals, shortest paths:
- Graph database (e.g., Neo4j).
-
Document-centric workloads; JSON-first; flexible schemas:
- Document store with ancestors array (materialized path) or embedded sub-documents.
Performance and Integrity Tips
-
Indexing:
- Adjacency: index parent_id; often add composite (parent_id, id).
- Materialized path: index path; with Postgres ltree use GIST/GIN on path.
- Nested sets: indexes on lft, rgt.
- Closure: composite indexes on (ancestor, depth), (descendant, depth).
- Graph DB: indexes on node labels and key properties.
-
Constraints:
- Use foreign keys with appropriate ON DELETE actions.
- Prevent cycles with triggers for adjacency lists, or by validating updates.
- For closure tables, ensure (ancestor, descendant) is unique and include self-links with depth=0.
-
Concurrency:
- Nested sets are sensitive to concurrent writes; consider advisory locks or choose closure/materialized path instead.
- Batch updates for materialized path moves to reduce write amplification.
-
Pagination and ordering:
- For trees, store a position/order field among siblings.
- Materialized path and nested sets naturally support hierarchical ordering.
-
Large datasets:
- Consider partitioning by root, caching computed paths, or precomputing commonly used closures.
Worked Example: Choosing for a Product Catalog
- Requirements: fast category page loads, frequent browsing of subcategories, occasional reorganizations.
- Good fit: materialized path in Postgres with ltree or text prefix; or closure table if moves happen regularly and you need both ancestors and descendants efficiently.
- If you must support products belonging to multiple categories and advanced relationship queries (e.g., compatibility graphs), keep categories in closure/materialized path and model cross-product relationships in a graph DB or a relational join table, depending on query complexity.
Summary
- Start with adjacency list for simplicity.
- Use materialized path when subtree reads dominate and moves are rare.
- Use nested sets for static trees with heavy hierarchical reads.
- Use closure tables for balanced read/write and simple moves.
- Use graph databases for DAGs/general graphs and complex traversals.
- In document stores, embed when small and cohesive; otherwise use ancestry arrays or path strings with indexes.
Pick the model that matches your workload and operational constraints, not just the one that seems most elegant.