Skip to content

Access archived data

Sometimes, you may need to access an archived dataset or snapshot and compare it with the current one. Archived steps are no longer kept as code in the repository — instead, dag/archive/*.yml records each step that was once active, with a marker comment carrying the commit where it was last active. You recover a step by checking out that commit.

Git History

The simplest way to access an older dataset is by checking out the commit where the step was last active and running the ETL from there.

  1. Find the commit of interest:
  2. Look up the step in dag/archive/*.yml. Its marker comment gives the recovery commit, e.g. # archived; last active in 4e6b5dfb9cb7 on 2026-05-11.
  3. (Alternatively, open the file in GitHub, click History, and copy the SHA of the desired commit.)

  4. Checkout the commit:

    git checkout <SHA>
    

  5. Re-run the ETL:

    make .venv
    etlr <dataset>
    

Tip

Run this in a separate folder (e.g., etl2) to retain access to the current datasets. This setup allows you to compare datasets in a notebook.

Example comparison in Python

from etl.dataset import Dataset

# Load current dataset
tb_current = Dataset("~/projects/etl/data/garden/climate/latest/weekly_wildfires").read_table('wildfires')

# Load dataset from a previous commit
tb_old = Dataset("~/projects/etl2/data/garden/climate/latest/weekly_wildfires").read_table('wildfires')

Update MD5 for archived Snapshots

If the code hasn’t changed and only new snapshots have been created (e.g., for automatically updated datasets), you can modify the snapshot MD5 in the .dvc file to point to an older snapshot.

  1. Find the MD5 and size:
  2. Locate the desired commit in GitHub.
  3. Copy the MD5 and size from the relevant .dvc file (e.g., snapshots/climate/latest/weekly_wildfires.csv.dvc).

  4. Update the .dvc file locally:

  5. Replace the MD5 and size in your local .dvc file.

  6. Re-run the ETL with the updated MD5:

    make .venv
    etlr <dataset>
    

Tip

For chart comparisons, create a PR with the updated .dvc file, commit the changes, and use the chart diff tool. Enable "Show all charts" to view them side-by-side.


Comparing Snapshots

To directly compare snapshots, use the etl.snapshot module.

  1. Load the current snapshot:

    from etl.snapshot import Snapshot
    
    snap = Snapshot("climate/latest/weekly_wildfires.csv")
    snap.pull()
    pd.read_csv(snap.path).shape
    

  2. Load an older snapshot:

  3. Find its MD5 and size from a previous commit.
  4. Update the MD5 and size in your script:
    from etl.snapshot import Snapshot
    
    snap = Snapshot("climate/latest/weekly_wildfires.csv")
    snap.metadata.outs[0]["md5"] = "356177e363926b959f5af281443f0a35"
    snap.metadata.outs[0]["size"] = 12548867
    snap.pull()
    pd.read_csv(snap.path).shape