Modern data engineering stack
Data storage
- Data lake: Object store + storing data in a columnar format
- examples:
- S3 + Athena (Presto)
- PROs:
- cheaper than a data warehouse
- easier to store very complicated data (eg. arrays in a column) than in most data warehouses
- CONs
- worser performance than a data warehouse
- can become very messy without a good data catalog and understanding of schemas
- I would always choose a data warehousing solution over Parquet files in Object storage, though Apache Iceberg is a very promising solution in this arena…
- examples:
- Data warehouse
- example:
- Snowflake
- Google BigQuery
- Amazon Redshift
- Click House (great solution but not ANSI SQL compliant and with disappointing window functions support in 2023!)
- PROs
- very fast and efficient
- CONs
- more expensive storage
- less flexible schemas
- Note
- prefer data warehouses where the storage is fundamentally decoupled from the compute (eg. use Snowflake and not Redshift)
- example:
- Things to avoid:
- using a traditional RDBMS (SQL Server, Oracle, PostgreSQL) for storing data on which only analytics will be done
- the traditional databases do not scale well (though if you have smaller data I would still entertain them) and are just not optimised for this use case no matter how many optimisation they now have (parallel execution, materialized views, column store indexes, roll-ups using upserts, B+ tree indexes…)
- using a traditional RDBMS (SQL Server, Oracle, PostgreSQL) for storing data on which only analytics will be done
Data ingestion
- ELT vs ETL
- Always prefer the ELT approach as data warehouses
- with an ELT approach you are more flexible, the loading process is faster and data warehouses are better at doing transformations (in a batch manner) compared to ETL tools
- Always prefer the ELT approach as data warehouses
- Depending on the data source
- Is the data source from outside of your organisation and somewhat “popular” (e.g. stripe, adyen, blaze…)?
- Easy solution: use companies that have build these connectors e.g. Fivetran, Stitch… though the final decision depends on your budget
- On-premise, custom data source
- write your own connectors, probably simplest in Python
- Is the data source from outside of your organisation and somewhat “popular” (e.g. stripe, adyen, blaze…)?
Tasks scheduling
- Ideally you could use an automated data movement platform (e.g. Fivetran stitch, meltano…) for most of the data ingestion tasks
- For other custom tasks I like to use Airflow, ideally in a cloud setting (so that Airflow is managed by the cloud provider) and that is uses Kubernetes executors (which allow it to very easily scale)
- For managing and executing SQL views and transformations use DBT. It’s a phenomenal tool.
This post is licensed under CC BY 4.0 by the author.
Comments powered by Disqus.