Skip to the content.

pyDag

Check the article here: How to build a DAG based Task Scheduling tool for Multiprocessor systems using python

Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag

pyDag’s Architecture for a Multiprocessor Machine.

An object of the pyDag class contains everything mentioned below, this is an whole overview of the architecture.

Let’s run an example

GCP - API credentials in JSON

GCP - BigQuery

GCP - Dataproc

LOCAL MACHINE

ramse@DESKTOP-K6K6E5A MINGW64 /c
$ git clone https://github.com/Wittline/pyDag.git

ramse@DESKTOP-K6K6E5A MINGW64 /c
$ cd pyDag

ramse@DESKTOP-K6K6E5A MINGW64 /c/pyDag
$ cd code

ramse@DESKTOP-K6K6E5A MINGW64 /c/pyDag/code
$ cd apps

@DESKTOP-K6K6E5A MINGW64 /c/pyDag/code/apps
$ docker-compose up

Let’s explain the example

There are many configurations for the DAG that could work for this example, the most appropriate and the shortest is the second approach shown in the image below, I discarded the first approach, both approaches achieve the same goal, but, with the second approach there is more chances to take advantage of the parallelism and improve the overall latency.

This example is just to demonstrate that this tool can reach various levels of granularity, the example can be built in fewer steps, in fact using a single query against BigQuery, but it is a very simple example to see how it works.

Checking Tables After the execution

Checking logs After the execution

Next steps

In order to have an acceptable product with the minimum needed features, I will be working on adding the following:

Conclusions

You can clearly observe that in all cases there are two tasks taking a long time to finish “startup_dataproc_1” and “initial_ingestion_1” both related with the use of Google DataProc, one way to avoid the use of tasks that create Clusters in DataProc is by keeping an already cluster created and keeping it turned on waiting for tasks, with horizontally scaling, this is highly recommended for companies that has a high workloads by submitting tasks where there will be no gaps of wasted and time and resources.

You can see the effect of the caching in the executions, short tasks are shorter in cases where the cache is turned on.

Although the parallelism in tasks execution can be confirmed, we could assign a fixed number of processors per DAG, which represents the max number of tasks that could be executed in parallel in a DAG or maximum degree of parallelism, but this implies that sometimes there are processors that are being wasted, one way to avoid this situation is by assigning a dynamic number of processors, that only adapts to the number of tasks that need to be executed at the moment, in this way multiple DAGS can be executed on one machine and take advantage of processors that are not being used by other DAGS. The only issue with the above chart is that these results coming from one execution for each case, multiple executions should be done for each case and take an average time on each case, but I don’t have the enough budget to be able to do this kind of tests, the code is still very informal, and it’s not ready for production, I’ll be working on these details in order to release a more stable version.

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

License

This project is licensed under the terms of the Apache License.