Figuring out the basics Data Pipelines.
A pipelining sandbox is how I would put it. Data pipelines can have several frameworks each serving a different use-case. This project enables one to test those frameworks and understand the basics for each of them.
- Apache Superset - Data Visualization
- Apache Airflow - Workflow Orchestration
The current data source for testing these pipelines as of now is the database
directory which basically does two things -
- Start a PostgreSQL Database with a simple schema mimicing an Online E-Commerce Store
- Start a Data Generator that will populate the PostgreSQL Database with artifical "orders".
Each framework will be defined in it's own directory. Each framework will also have a Dockerfile
and docker-compose.yaml
that can be used for easy setup and quick testing.
[airflow] - Will start the Airflow service and contain a simple DAG for reporting
[superset] - Will start the Superset service that can be used for Data Visualization
[database] - The data source and data generator to test out the frameworks
Personally, I have two goals -
- Having an easy setup for myself to simply test a framework that I come across when I a either read about it or hear someone talk about it.
- I wish to improve my code quality and hence write better code that can be reused. I plan on doing this by -
- Writing complete tests and for the simple pipelines that I write.
- Making it easily accessible by Dockerizing it
- Having a good documentation (Pending)