Post

Data extraction and transformation design patterns

  • You will need to choose between full and incremental extracts
    • Always prefer full extracts and partitioned snapshots if the data is small enough
    • If this is not possible, then design an incremental approach
  • Extract data in an idempotent manner
    • Eg. extract data for a specific timeframe from the data source and make sure that in your data store you have the same data in the same timeframe (eg. same size, distinct values in columns…)
    • Eg. perform incremental extracts using row IDs,so if you have a set of rows in the data source then do a set difference to determine which rows should be ingested
      • Otherwise maybe incrementally ingest using the created/modified timestamps but only if they are reliable. After this extraction window functions are necessary for data deduplication (choosing the latest source)
  • Use a staging table for the output of ETL job
    • Eg. if you recompute an output table, first create the temporary output table, then check it’s values are good before replacing the actual output table
  • Use partitions for per-partition data extraction
    • Eg. an ETL gets data every 1 hour or 1 day from the data source and puts it into the desired output
  • Build a dashboard to visualise input data

  • Reproducibility through immutable datasets
    • Snapshot your extracts and dimensions
    • Have a clear data lineage
  • Favor ELT over ETL
    • Put work into the database engine
    • Have a specific ingestion staging area to put data into
  • Favor SQL over anything else for transformation
  • Add data quality tests and measures during every ingest
    • If the data quaity of some source system is below a threshold, throw that data away
  • Create data ingestion SLAs
    • A script that checks the latest timestamp for some data
This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.