Post

General guidelines for design of batch jobs

  • Have a config driven and easily extensible library for your ETL
  • Break the process down into components
    • UNIX way of doing
    • Eg first step is ingestion, second is taking the ingested data and filtering…
  • Idempotent and deterministic ETL tasks
    • Have no side effects
    • Use immutable storage
    • Usually target a single partition
    • Never UDPATE, INSERT, DELETE (mutations)
    • Limit the number of source partitions (no wide-scans)
    • Transform your data, don’t overwrite
    • All jobs must be idempotent meaning that no matter how many times you run it, the state after N runs is same as after N + M runs
  • Use configuration files to define your ETL
    • Do not code to a specific data source, query, file format, location…
    • Code the dependencies of the job as external file
      • eg. inputs, outputs, output file format…
  • Engineer to anticipate failures
    • Use retries (potentially exponential)
      • Expect external connections to fail
      • Expect data not to be there
  • Workflow tool you use needs to have:
    • Basic dependency management
    • Clarity on status
    • Scheduling (eg. similar to crontab)
    • Easy access to logs
    • Parametrized retries
    • Notifications (errors, success)
    • Retries of jobs
  • Make sure your code base handles:
    • Testing - test the code, especially have integration tests
    • Logging - log everything (ideally ship the code)
    • Packaging
This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.