Skip to main content
  1. posts/

keep your snapshots simple

 Author
Author
philip mathew hern
philliant
Table of Contents
dbt - This article is part of a series.
Part : This Article

snapshots are one of those features that feel like a free win the first time you reach for them. dbt handles the merge, the warehouse handles the storage, and suddenly you have a tidy history of every change a source row has ever gone through. then you ship a few of them, leave them running for a while, and discover that the maintenance, the backups, and the recovery story are quietly the most expensive parts of your pipeline.

i still use snapshots, but the rule i hold myself to is short. use them only when i absolutely have to, never on top of another snapshot, and never as a substitute for logic i could recompute deterministically. the rest of this post explains why.

quick answer
#

dbt snapshots implement the type 2 slowly changing dimension pattern. they track every change to a source row by closing the previous version and inserting a new one with dbt_valid_from and dbt_valid_to columns that define when each version was current. they are powerful when the source overwrites history and you genuinely need point-in-time answers, but they are expensive to operate, brittle under source-schema or grain changes, and impossible to fully rebuild from scratch once you have lost the original change events. the practical rule is to use the smallest number of snapshots you can get away with, never let one snapshot read from another, and recompute everything else deterministically.

who this is for
#

  • analytics engineers and data engineers who already use dbt and are deciding when to reach for snapshots
  • teams that have inherited a snapshot-heavy project and are trying to reduce the operational tax
  • anyone who has felt the pain of a corrupted or backfilled snapshot and wants a more conservative pattern going forward

why this matters
#

snapshots are different from every other model in your dbt project. a regular model is a pure function of its inputs, and you can drop and rebuild it any time. a snapshot is stateful, meaning its output depends on every prior run, the order those runs happened in, and the source values that existed at the moment each run executed. once that state is wrong, no amount of dbt run --full-refresh will fix it for you, because the historical events that produced the original sequence of rows are gone.

that is the trade you are making when you adopt a snapshot. you are giving up reproducibility in exchange for history capture. that trade is worth it for some sources, but it is far less common than people assume. most of the time, what you actually need is either a deterministic transformation, a freeze calendar, or a properly modeled effective satellite. snapshots should be a small, deliberate slice of your warehouse, not a default.

type 2 scd: a quick refresher
#

before going further, here is a short reminder about what type 2 actually means. the dimension community talks about slowly changing dimensions in numbered types because the trade-offs are very different across them. the most common ones are:

  • type 0: never change the value, even if the source does
  • type 1: overwrite the value in place, no history kept
  • type 2: keep every version of the row across time, with start and end timestamps that mark when each version was current
  • type 3: keep a small number of prior values in additional columns on the same row (for example previous_status)

type 2 is the only one that gives you a complete record of how an attribute evolved. the trade-off is that the row count grows every time something changes, and every consumer has to know how to filter the table to get the version they want.

what type 2 looks like in a row
#

imagine a customer table where the status column changed twice. with type 2 history, the same customer_id shows up multiple times, with non-overlapping validity intervals.

customer_idstatusvalid_fromvalid_to
1001prospect2026-01-01 00:00:002026-02-15 09:30:00
1001active2026-02-15 09:30:002026-04-22 14:10:00
1001churned2026-04-22 14:10:009999-12-31 23:59:59

a few things in that table do most of the work:

  • the business key (customer_id) is no longer unique on its own, so the grain is now customer_id plus the validity interval
  • valid_from is inclusive and valid_to is exclusive in most type 2 conventions, so the intervals tile cleanly without overlap
  • the current row uses a sentinel like 9999-12-31 23:59:59 instead of null, which makes downstream filters cleaner
  • to answer “what was the status at time t”, you filter where valid_from <= t and valid_to > t
  • to answer “what is the status now”, you filter where valid_to = '9999-12-31 23:59:59'

if any of that feels familiar from data vault, that is because effective satellites are essentially the same idea expressed in vault vocabulary. i wrote about a related grain trap in left join an effective satellite without duplicating rows, and the same care applies here. the moment you have validity intervals, every join and every filter has to respect them.

what a dbt snapshot is
#

a dbt snapshot is dbt’s built-in implementation of type 2 history capture. you write a query that returns the current state of a source, and dbt takes responsibility for comparing that current state against the snapshot table on every run, closing rows that changed and inserting new versions. the columns it adds are:

  • dbt_valid_from: when this version became current
  • dbt_valid_to: when this version stopped being current (or the sentinel for current rows)
  • dbt_scd_id: a hash of the unique key plus dbt_valid_from for stable surrogate identity

i wrote a longer companion piece on the practical migration story in dbt snapshots, moving from merges to native history. that post is the “how to do it well” view. this post is the “how to do less of it” view.

when a snapshot is the right call
#

reach for a snapshot when all of the following are true:

  • the source system overwrites the row in place and does not retain history of its own
  • you genuinely need point-in-time answers, not just “the current value”
  • you cannot reconstruct the historical state from a deterministic formula or from another system that already keeps history
  • the source has a stable grain and a reliable unique key
  • you can guarantee the snapshot will run on a cadence that catches every change you care about

if any one of those is false, a snapshot is probably the wrong tool. for example, if the source already publishes change events to a message broker, capture the events into a regular append-only table and model on top of that. if the value is a deterministic function of other inputs (for example a derived score from frozen reference data), recompute it. if the source has a cycle_id or some other natural temporal key, join on that key directly instead of leaning on validity intervals.

the cost: complexity, brittleness, maintenance, backups
#

this is the part most adoption guides skip over. snapshots look free in the demo and feel free for the first month. then the bills start to come in.

complexity in the dag
#

a snapshot is a node in your dag, but it does not behave like other nodes. it is the only model type that has hidden state from prior runs, and it requires its own command (dbt snapshot) on its own schedule. a dbt project that contains snapshots has, in practice, two pipelines that have to stay in lockstep, namely the regular dbt build and the snapshot pipeline. when one of them lags or fails, downstream consumers see stale or partial history and the symptoms can be subtle.

every snapshot also forces every downstream model that reads it to think about validity intervals. queries that used to be a simple select now need a current-row filter or a point-in-time predicate, and reviewers have to verify that filter on every change.

brittleness under change
#

snapshots are unusually sensitive to upstream changes. a few examples i have seen close up:

  • a source query is widened to include a new column, and the check strategy now flags every row as changed on the next run, doubling the table overnight
  • a source briefly drops keys (because of a partial backfill or a bad upstream join), and a hard_deletes = invalidate snapshot closes thousands of rows that are still valid
  • duplicate keys appear in the source for a single run, and the snapshot either fails or quietly bloats with overlapping intervals
  • the source schema changes type on a column, and the snapshot now refuses to merge because the staged data and the historical data disagree

each of these takes a careful, surgical fix. you cannot just rerun the snapshot from scratch, because the original sequence of source values is gone.

maintenance and backups
#

because snapshot output is not reproducible, you have to treat the snapshot table itself as production data, not a derived artifact. that means:

  • regular backups of the snapshot tables to a separate schema or storage location, with a retention policy you actually enforce
  • a documented recovery runbook for partial corruption (for example, restore from backup, replay only specific keys, validate intervals)
  • alerting on row-count deltas, row-count ratios per run, and anomalies in dbt_valid_to distributions, so a misconfigured run does not run for a week before anyone notices
  • change review for any edit to the snapshot definition, because changing check_cols, unique_key, or the source query can rewrite history in non-obvious ways

a regular dbt model needs none of that. a snapshot needs all of it, and the cost scales with the number of snapshots in your project.

the worst pattern: snapshots on top of snapshots
#

if there is one rule i would carve into the wall, never build a snapshot on top of another snapshot. mixing two type 2 tables produces validity intervals on top of validity intervals, and the result is almost never what anyone wants.

why this is so dangerous
#

a single type 2 table is already a careful object. its grain is the business key plus the validity interval, and every consumer has to filter to a single moment in time before doing anything else. when you stack a second snapshot on top of it, you are now tracking history of a thing that was already historical, and the questions you can sensibly ask multiply in ugly ways.

think about what the row count of snapshot_b becomes when its source query reads from snapshot_a without point-in-time filtering. for any business key, you get the cartesian product of versions, which means changes in snapshot_b get attributed to the wrong intervals of snapshot_a. even if you do filter for current rows, the second snapshot will react to every change in the first, including changes that have nothing to do with the attributes you care about, so you end up with a much noisier history than you wanted.

even if you carefully filter the inner snapshot to its current row, you have lost something important. the outer snapshot now records history of a moving target. when you reread the outer snapshot at a past timestamp, the row you get back was generated against the inner snapshot’s current state at the time the outer run executed, not against the inner snapshot’s state at the same past timestamp. this is the validity-on-validity trap, and it is almost impossible to reason about by inspection.

what to do instead
#

if you find yourself wanting to build a second snapshot on top of a first, treat that as a signal that the design is wrong, not as a problem to solve in sql. a few healthier alternatives:

  • have one snapshot per source object that genuinely needs history, and read it directly in your information delivery layer
  • if you need a derived attribute that depends on a snapshot, compute that attribute in a regular view that filters the snapshot to a single point in time and is itself recomputable
  • if the derived attribute genuinely needs its own history, snapshot the source inputs independently and join them by point-in-time logic in a downstream view, instead of stacking the snapshots themselves
  • if your business has a natural temporal key (cycle, period, year), prefer joining by that key over inferring history from validity intervals

the goal is to push as much of the temporal logic as you can into deterministic transformations, and keep the snapshots themselves at the edges of the dag.

less is more, simple is better
#

most of the temporal questions you think need a snapshot do not. before adding one, run through this short checklist:

  1. is the value already historical somewhere upstream, in events, in a cycle table, or in another system, where i can read it without snapshotting myself?
  2. can i compute the value deterministically from current inputs, so any past answer is just a recomputation against frozen reference data?
  3. is the source overwriting history in place with no other record of the prior value?
  4. if i never built this snapshot, would consumers really lose information they care about, or just convenience?

if the answer to either of the first two is yes, do not add a snapshot. if the answer to the third is no, do not add a snapshot. if the answer to the fourth is “just convenience”, do not add a snapshot.

what you are aiming for is a warehouse that is mostly deterministic, with a small ring of carefully managed snapshots at the edges. the deterministic core is cheap to rebuild, easy to test, and forgiving to refactor. the snapshot ring is where the real cost lives, so you want it to be small enough that you can afford to back it up, monitor it, and recover it when something goes wrong.

simple beats clever here. one well-run snapshot you understand is worth ten clever snapshots that nobody can rebuild.

faq
#

when is a snapshot definitely worth it?
#

when the source overwrites in place, you have a real business need for point-in-time answers, and there is no upstream event log or cycle key to lean on instead. operational systems that mutate rows without retaining history are the canonical case.

what is the single biggest mistake people make with snapshots?
#

reading from a snapshot in another snapshot’s source query. the validity-on-validity trap is the worst class of bug to debug, because the symptom shows up far away from the cause and the table looks plausible at a glance.

how do i reduce the number of snapshots in an existing project?
#

start with the snapshots that are read by the smallest number of downstream models, and ask whether the consumers really need history or just the current row. if they only need current, replace the snapshot with a regular view. for the snapshots that genuinely need history, make sure each one is independent and that nothing else in the project reads a snapshot to feed another snapshot.

should i ever full-refresh a snapshot?
#

almost never in production. a full refresh wipes the historical rows that no longer match the current source, which is the entire reason you built the snapshot in the first place. treat the snapshot table like operational data, not a derived artifact.

references
#

related reading#

dbt - This article is part of a series.
Part : This Article

Related