Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support to read and write to Databricks delta tables #630

Merged
merged 35 commits into from
Oct 5, 2023

Conversation

pthalasta
Copy link
Contributor

Adds support for querying data from Databricks delta tables and write the results to delta tables. Resolves #611

@pthalasta pthalasta requested a review from a team as a code owner January 24, 2023 21:16
@ghost
Copy link

ghost commented Jan 24, 2023

Pull requests from external contributors require approval from a nv-morpheus organization member with write or admin permissions before CI can begin.

@efajardo-nv efajardo-nv added non-breaking Non-breaking change feature request New feature or request labels Jan 24, 2023
@raykallen
Copy link
Contributor

/ok to test

@raykallen
Copy link
Contributor

/ok to test

Copy link
Contributor

@mdemoret-nv mdemoret-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions are inline. Biggest issue is that there are no tests that exercise these new stages. To guarantee that they will continue to work, we need tests built that run these stages.

docker/Dockerfile Outdated Show resolved Hide resolved
morpheus/stages/input/deltalake_source_stage.py Outdated Show resolved Hide resolved
morpheus/stages/input/deltalake_source_stage.py Outdated Show resolved Hide resolved
morpheus/stages/input/deltalake_source_stage.py Outdated Show resolved Hide resolved
morpheus/stages/input/deltalake_source_stage.py Outdated Show resolved Hide resolved
morpheus/stages/output/write_to_deltalake_stage.py Outdated Show resolved Hide resolved
morpheus/stages/output/write_to_deltalake_stage.py Outdated Show resolved Hide resolved
morpheus/stages/input/deltalake_source_stage.py Outdated Show resolved Hide resolved
morpheus/stages/output/write_to_deltalake_stage.py Outdated Show resolved Hide resolved
morpheus/stages/input/deltalake_source_stage.py Outdated Show resolved Hide resolved
@pthalasta
Copy link
Contributor Author

will take a look at the changes

@mdemoret-nv mdemoret-nv changed the base branch from branch-23.01 to branch-23.11 July 14, 2023 18:59
@BartleyR
Copy link
Contributor

@pthalasta Pinging here to resurface. This would be a good stage to add once issues are addressed.

@pthalasta
Copy link
Contributor Author

@mdemoret-nv i can resolve the comments except the usage of JDK as it is a requirement and regardless of which source we install it from there will be increase in container size. As this stage reads data from pyspark via databricks-connect, i don't think it will work without JDK. There are early stage python libs to read from delta without any involvement of databricks/spark, if JDK alternative is a must, i can probably test with early phase python libs, please let me know.

@gurpreets-nvidia
Copy link

@cwharris @pthalasta @mdemoret-nv : what are next steps with this MR? Are any discussions happening offline to find a way to resolve the JDK issue and close/merge this MR?

@cwharris
Copy link
Contributor

@gurpreets-nvidia we are indeed discussions options. It looks like adding the JRE, JDK, and Databricks Connect will introduce an additional ~300MB to the release container. We're currently in the process of determining what CVE's would be introduced. Once we understand the CVE implications we can determine whether including it is the right way to go or if we need to look for a workaround.

@pthalasta
Copy link
Contributor Author

@cwharris i see two options

  • Load delta table into memory using python based delta table libs and then filter data the data. This doesn’t require spark so no, JDK. However, there can be memory issues in certain cases if we try to load all the data without filtering
  • Pre-filter data using spark or distributed compute that can read the delta tables and share the resulting df as done in the approach above and use the result df as needed in downstream jobs.

Let me know how to proceed in closing on this MR

@pthalasta pthalasta closed this Aug 30, 2023
@pthalasta pthalasta deleted the branch-23.01 branch August 30, 2023 18:58
@pthalasta pthalasta restored the branch-23.01 branch August 30, 2023 22:24
@pthalasta pthalasta reopened this Aug 30, 2023
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 30, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pthalasta pthalasta reopened this Oct 2, 2023
@ajschmidt8
Copy link
Collaborator

/ok to test

@dagardner-nv
Copy link
Contributor

/ok to test

@dagardner-nv
Copy link
Contributor

/ok to test

@efajardo-nv
Copy link
Contributor

/ok to test

@cwharris
Copy link
Contributor

cwharris commented Oct 3, 2023

/ok to test

@efajardo-nv
Copy link
Contributor

/ok to test

@cwharris
Copy link
Contributor

cwharris commented Oct 3, 2023

/ok to test

@dagardner-nv
Copy link
Contributor

/ok to test

@cwharris
Copy link
Contributor

cwharris commented Oct 4, 2023

/ok to test

@pthalasta
Copy link
Contributor Author

@cwharris i think we have all tests and checks passing. There was no change to documentation part, so not sure if that is needed?

@pthalasta
Copy link
Contributor Author

@dagardner-nv i see that the documentation is failing due to import of stages by sphinx. Is this expected?

@dagardner-nv
Copy link
Contributor

/ok to test

@dagardner-nv
Copy link
Contributor

/ok to test

@pthalasta
Copy link
Contributor Author

@mdemoret-nv @cwharris can we merge if there are no other changes required and all looks good?

@mdemoret-nv
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 36ccb8e into nv-morpheus:branch-23.11 Oct 5, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA]: Support reading data from delta lake tables
10 participants