A repository to show how to use Google Cloud's Dataflow pipelines for data preprocessing using Apache beam in python
- In this repo I have used 2 different procedures which use apache beams pipelines and run on Dataflow.
- Jobs can be submitted to Dataflow either through notebook directly or using a python script.
- The same apache beam pipelines can also be run locally by toggling a single parameter of pipeline.
- Notebooks I have mentioned above are Google Cloud Vertex AI workbench notebooks run in Google Cloud with apache beam environment.
- All the python scripts are also run in Vertex AI notebooks environment itself.
- While mentioning paths of files/directories in google cloud buckets, make sure to mention full path starting from gs:// .
pip install requirements.txt
- Reads csv data stored in bigquery, groups it on the basis of a field and writes details of each dataframe created into google cloud storage output directory path specified.
- Notebook has the capability to read data from csv files and push it to a bigquery table.
- Executing following commands after navigating to csv_reader/src/.
python3 runner.py --project=<project-id> --bucket=<bucket-name> --bigquery-table=<bigquery-table-id> --output-path=<output-dir> --direct-runner
python3 runner.py --project=<project-id> --bucket=<bucket-name> --bigquery-table=<bigquery-table-id> --output-path=<output-dir> --dataflow-runner
- Reads a huge input text file, applies multiple preprocessing techniques on it.
- Output of this pipeline is a csv file containing cleaned text data and labels.
- Executing following commands after navigating to text_parsing/src/.
python3 runner.py --project=<project-id> --bucket=<bucket-name> --input-path=<path-to-input> --output-dir=<output-dir> --direct-runner
python3 runner.py --project=<project-id> --bucket=<bucket-name> --input-path=<path-to-input> --output-dir=<output-dir> --dataflow-runner