Skip to content

A repository to show how to use Google Cloud's Dataflow pipelines for data preprocessing using Apache beam in python

Notifications You must be signed in to change notification settings

Subrahmanyajoshi/Preprocessing-with-Dataflow-Pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google-Cloud-Dataflow-Pipelines

A repository to show how to use Google Cloud's Dataflow pipelines for data preprocessing using Apache beam in python

  • In this repo I have used 2 different procedures which use apache beams pipelines and run on Dataflow.
  • Jobs can be submitted to Dataflow either through notebook directly or using a python script.
  • The same apache beam pipelines can also be run locally by toggling a single parameter of pipeline.

Important:

  • Notebooks I have mentioned above are Google Cloud Vertex AI workbench notebooks run in Google Cloud with apache beam environment.
  • All the python scripts are also run in Vertex AI notebooks environment itself.
  • While mentioning paths of files/directories in google cloud buckets, make sure to mention full path starting from gs:// .

Install Dependencies

pip install requirements.txt

Csv Reader

  • Reads csv data stored in bigquery, groups it on the basis of a field and writes details of each dataframe created into google cloud storage output directory path specified.
  • Notebook has the capability to read data from csv files and push it to a bigquery table.
  • Executing following commands after navigating to csv_reader/src/.

Running locally

python3 runner.py --project=<project-id> --bucket=<bucket-name> --bigquery-table=<bigquery-table-id> --output-path=<output-dir> --direct-runner

Running on Dataflow

python3 runner.py --project=<project-id> --bucket=<bucket-name> --bigquery-table=<bigquery-table-id> --output-path=<output-dir> --dataflow-runner

Text parser

  • Reads a huge input text file, applies multiple preprocessing techniques on it.
  • Output of this pipeline is a csv file containing cleaned text data and labels.
  • Executing following commands after navigating to text_parsing/src/.

Running locally

python3 runner.py --project=<project-id> --bucket=<bucket-name> --input-path=<path-to-input>  --output-dir=<output-dir> --direct-runner

Running on Dataflow

python3 runner.py --project=<project-id> --bucket=<bucket-name> --input-path=<path-to-input>  --output-dir=<output-dir> --dataflow-runner

About

A repository to show how to use Google Cloud's Dataflow pipelines for data preprocessing using Apache beam in python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published