Skip to content

Dataproc Serverless templates and pipelines for solving simple in-Cloud data tasks

License

Notifications You must be signed in to change notification settings

rishabkhawad/dataproc-templates

 
 

Repository files navigation

Java Build Status Java Integration Tests Status Python Build Status Python Integration Test Status

Dataproc Templates

Dataproc templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations. The technology under the hood which makes these operations possible is the serverless spark functionality based on Google Cloud's Dataproc.

Google is providing this collection of pre-implemented Dataproc templates as a reference and to provide easy customization for developers wanting to extend their functionality. (Video Link)

Open in Cloud Shell

Dataproc Templates (Java - Spark)

Please refer to the Dataproc Templates (Java - Spark) README for more information

Dataproc Templates (Python - PySpark)

Please refer to the Dataproc Templates (Python - PySpark) README for more information

Dataproc Templates (Notebooks)

Please refer to the Dataproc Templates (Notebooks) README for more information

Getting Started

  1. Clone this repository

     git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
    
  2. Obtain authentication credentials

    Create local credentials by running the following command and following the oauth2 flow (read more about the command here.

     gcloud auth application-default login
    

    Or manually set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to a service account key JSON file path.

    Learn more at Setting Up Authentication for Server to Server Production Applications.

Note: Application Default Credentials is able to implicitly find the credentials as long as the application is running on Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.

  1. Executing a Template

    Follow the specific guide, depending on your use case:

Flow diagram

Below flow diagram shows execution flow for Dataproc Templates:

Dataproc templates flow diagram

Contributing

See the contributing instructions to get started contributing.

License

All solutions within this repository are provided under the Apache 2.0 license. Please see the LICENSE file for more detailed terms and conditions.

Disclaimer

This repository and its contents are not an official Google Product.

Contact

Share you feedback, ideas, thoughts feedback-form

Questions, issues, and comments should be directed to [email protected]

About

Dataproc Serverless templates and pipelines for solving simple in-Cloud data tasks

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 38.3%
  • Python 38.2%
  • Jupyter Notebook 20.2%
  • Shell 3.0%
  • Dockerfile 0.3%