Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HyperAI integration: orchestrator and service connector #2372

Merged
merged 140 commits into from
Feb 6, 2024

Conversation

christianversloot
Copy link
Contributor

@christianversloot christianversloot commented Jan 29, 2024

Describe changes

HyperAI (hyperai.ai) has recently become one of our suppliers of GPU instances. Unfortunately, unlike the major public cloud providers, HyperAI does not yet have an SDK like sagemaker. Still, it is critical for us to keep using ZenML, as many of our pipelines are built that way. This PR implements a HyperAI integration by means of an orchestrator and service connector.

Service connector:

  • The service connector is effectively a SSH based service connector. Using paramiko, it provides an authenticated SSHClient given the configured ip_address (with an optional instance_name serving as a nickname to distinguish multiple IP addresses), username, base64_ssh_key and optionally ssh_passphrase.
  • Unfortunately, the ZenML CLI does not support multiline entry via cli. That is why the key can be provided in Base64 encoded format; the service connector ensures that it is decoded.
  • It has support for multiple key types: RSA, DSA (DSS), ECDSA and ED255519.

The orchestrator:

  • Effectively uses Docker Compose and its service_completed_successfully depends_on condition to compose a Docker Compose file which guarantees the order of execution, including more complex pipelines by using step.spec.upstream_steps.
  • Uses the provided SSH client to upload the file to the instance and then executes it (and hence assumes a very lightweight setup at the user side with Docker and Docker Compose being installed).
  • Has support for scheduled pipelines (then also assumes Cron daemon is running).
  • Has a built-in cleanup mechanism for non-scheduled pipeline runs, which are automatically deleted after 7 days upon starting a new pipeline run. Following the paradigm where users are responsible for cleaning up their schedule pipeline runs (per the ZenML docs), it does nothing with those.
  • Optionally (via configuration, but set to False by default recognizing possible security implications) is able to authenticate the instance to the stack's configured container registry by logging in. Once again, this is entirely optional: if, from a security perspective, the user does not want this, they are free to leave it to False; then, they must ensure that the instance is logged in themselves.
  • Allows for mounts to be made between folders on the instance and the container, hence allowing data stored on these GPU instances to be readily available to the pipeline run. Mounts that are made can be configured by the user in component configuration and can be changed at any time.

Logo:
The code assumes a HyperAI logo to be present in a seemingly public bucket on your end. I can ask the HyperAI team to provide a proper logo that can be put in this bucket so that it's visible within the ZenML dashboard.

This way:

  • Users have full flexibility as to whether they want to create one service connector/orchestrator combination per instance; reuse service connectors (and thus keys) with multiple instances; even provide single users with different data access patterns by giving them different orchestrators with different mounts.
  • Users have full freedom to deploy any Docker based pipeline using this orchestrator: as with SageMaker, the orchestrator fully respects the Docker configuration set by the user (in fact, I used the local Docker orchestrator to find inspiration).

Testing:

  • I have added limited tests because I could not find many nor do I have an understanding about any testing infrastructure you may have.
  • I did test thoroughly via a local ZenML / ZenML dashboard setup and it worked very well.

image

image

image

As discussed with @htahir1 , we're actively awaiting HyperAI usage (and so is the HyperAI team, as they've been in touch as far as I know) so we'd prefer to start using this integration as soon as possible. Do note that this does not mean that we should merge it recklessly, we're just very excited :)

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

Summary by CodeRabbit

  • New Features
    • Introduced the HyperAI Service Connector for enhanced authentication options, supporting various key types.
    • Added comprehensive documentation for configuring HyperAI Connectors and orchestrating pipelines on HyperAI instances, including support for Docker Compose and scheduled runs.
    • Implemented the ZenML HyperAI orchestrator for deploying machine learning pipelines on HyperAI instances, with support for GPU-backed hardware via CUDA.
  • Documentation
    • New guides and documentation for setting up and using the HyperAI Service Connector and orchestrator.
  • Tests
    • Added tests for the HyperAI orchestrator to ensure correct attribute settings.

Copy link
Contributor

@schustmi schustmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦭

Copy link

socket-security bot commented Feb 5, 2024

New dependencies detected. Learn more about Socket for GitHub ↗︎

Package New capabilities Transitives Size Publisher
pypi/[email protected] Transitive: environment, eval, filesystem, network, shell, unsafe +32 318 MB typeshed_bot

View full report↗︎

@stefannica stefannica merged commit 66b6d99 into zenml-io:develop Feb 6, 2024
54 checks passed
kabinja pushed a commit to kabinja/zenml that referenced this pull request Feb 6, 2024
* Add init for HyperAI integration

* WIP: HyperAI service connector

* WIP

* WIP: HyperAI Service Connector

* WIP: HyperAI Orchestrator

* Replace Docker compose write with temporary file and SCP

* Variable assignment error

* Set dependency

* Set basic values of the HyperAI settings and config

* Add config property

* Allow mounts to be made

* Remove newline

* Finish (untested) orchestrator

* Import HyperAI integration

* Import HyperAI service connector in service connector registry

* Rename resource type

* Rename auth method

* Force key to be base64

* Fixes to service connector

* Identify instance by name and IP address

* Strip IP address Python

* Strip IP address Python

* Return paramiko client

* WIP

* Mimic sagemaker integration

* Fixes to make HyperAI orchestrator visible

* Fixes to make orchestrator work

* Temp change default local ip for testing

* Environment fix

* Use upstream steps to determine dependencies

* Add support for scheduled pipelines

* Polish schedules

* Add configuration support for multiple Paramiko key types

* Add Base64 instructions

* Rename various vars

* Add instructions about possible cron

* Some docstring edits

* Add setting for CR autologin

* Add rudimentary Docker login

* Move value

* Add docstring

* Remove unused def

* Extract Paramiko key type given service connector configuration

* Add better warnings

* Check for None differently

* Automatic Docker login if configured

* Add HyperAI orchestrator flavor to docs

* Basic docs for HyperAI orchestrator

* Add HyperAI service connector to auth management docs

* Add HyperAI service connector to docs

* Set autologin to False by default

* Add test similar to Airflow orchestrator

* Formatting

* Revert changes needed to run successfully locally

* Add mount path validation

* Improve error handling and formatting

* Format mount paths differently

* Upgrade azureml-core to 1.54.0.post1

* Fix docstring

* Update src/zenml/integrations/hyperai/service_connectors/hyperai_service_connector.py

Co-authored-by: Michael Schuster <[email protected]>

* Rename def into _validate_mount_paht

* Update config docstring to default to False

* Move Settings, Config and Flavor to lavor folder

* Remove type from docstring

* Remove type from docstring

* Remove type check convered by pydantic

* Select container registry more efficiently

* Remove redundant type conversion

* Move Paramiko client creation into helper method

* Reformatting

* Fix imports

* Temp changes for local testing

* Fix imports

* Revert "Temp changes for local testing"

This reverts commit 76fdb29.

* Rename HYPERAI_RESOURCE_TYPE into hyperai-instance

* Rename ip_address into hostname

* Update src/zenml/integrations/hyperai/service_connectors/hyperai_service_connector.py

Co-authored-by: Stefan Nica <[email protected]>

* Raise AuthorizationException if client cannot be created

* Remove RuntimeError in two places because it will never arrive in that state anymore

* Remove try/catch statement

* Let exception fall through if applicable

* Remove raises

* Add warning hint about long-lived credentials

* Renames in docs based on changes

* Add missing io import

* Formatting

* Add automatic_cleanup_pipeline_files to HyperAIOrchestratorConfig

* Remove redundant variable assignment

* Clean only if users configure auto cleaning

* Update docs

* Work in progress: multi IP service connector

* Resources

* Append hostname instead

* Omit assigning value

* Rename config value

* Ensure that hostname is passed to Paramiko client

* Raise NotImplementedError instead of pass value

* Formatting

* Changes to _verify

* Reflect changes in service connector docs

* Fix connector value validation to allow arrays to be used with the CLI

* Reflect changes in orchestrator docs

* Fix connector verification to allow the multi-instance case

* Ensure that pipelines can run when scheduled by setting run ID dynamically

* Reformatting

* Add information about scheduled pipelines to docs

* Use service connector username to create Compose files on instance

* Add GPU reservation if configured that way

* Formatting

* Add instruction

* Add prerequisites for HyperAI instance

* Formatting and docstrings

* Fixed remaining linter errors

* Applied review suggestions

* Add paramiko to API docs mocks

* HyperAI orchestrator config tests; make additional assertions available and fix is_remote

* Remove GPU-based Dockerfile

* Ensure that shell commands are escaped when used

* Provide password to stdin differently

* Escape case where file cannot be written to HyperAI instance

* Escape inputs differently

* Use network mode host to avoid non-overlapping IPv4 network pool error

* Disable security check for paramiko auto-add-policy

* Changes to escaping

* Silenced remaining security issues and fixed remaining linter errors

---------

Co-authored-by: Michael Schuster <[email protected]>
Co-authored-by: Stefan Nica <[email protected]>
Co-authored-by: Alex Strick van Linschoten <[email protected]>
adtygan pushed a commit to adtygan/zenml that referenced this pull request Mar 21, 2024
* Add init for HyperAI integration

* WIP: HyperAI service connector

* WIP

* WIP: HyperAI Service Connector

* WIP: HyperAI Orchestrator

* Replace Docker compose write with temporary file and SCP

* Variable assignment error

* Set dependency

* Set basic values of the HyperAI settings and config

* Add config property

* Allow mounts to be made

* Remove newline

* Finish (untested) orchestrator

* Import HyperAI integration

* Import HyperAI service connector in service connector registry

* Rename resource type

* Rename auth method

* Force key to be base64

* Fixes to service connector

* Identify instance by name and IP address

* Strip IP address Python

* Strip IP address Python

* Return paramiko client

* WIP

* Mimic sagemaker integration

* Fixes to make HyperAI orchestrator visible

* Fixes to make orchestrator work

* Temp change default local ip for testing

* Environment fix

* Use upstream steps to determine dependencies

* Add support for scheduled pipelines

* Polish schedules

* Add configuration support for multiple Paramiko key types

* Add Base64 instructions

* Rename various vars

* Add instructions about possible cron

* Some docstring edits

* Add setting for CR autologin

* Add rudimentary Docker login

* Move value

* Add docstring

* Remove unused def

* Extract Paramiko key type given service connector configuration

* Add better warnings

* Check for None differently

* Automatic Docker login if configured

* Add HyperAI orchestrator flavor to docs

* Basic docs for HyperAI orchestrator

* Add HyperAI service connector to auth management docs

* Add HyperAI service connector to docs

* Set autologin to False by default

* Add test similar to Airflow orchestrator

* Formatting

* Revert changes needed to run successfully locally

* Add mount path validation

* Improve error handling and formatting

* Format mount paths differently

* Upgrade azureml-core to 1.54.0.post1

* Fix docstring

* Update src/zenml/integrations/hyperai/service_connectors/hyperai_service_connector.py

Co-authored-by: Michael Schuster <[email protected]>

* Rename def into _validate_mount_paht

* Update config docstring to default to False

* Move Settings, Config and Flavor to lavor folder

* Remove type from docstring

* Remove type from docstring

* Remove type check convered by pydantic

* Select container registry more efficiently

* Remove redundant type conversion

* Move Paramiko client creation into helper method

* Reformatting

* Fix imports

* Temp changes for local testing

* Fix imports

* Revert "Temp changes for local testing"

This reverts commit 76fdb29.

* Rename HYPERAI_RESOURCE_TYPE into hyperai-instance

* Rename ip_address into hostname

* Update src/zenml/integrations/hyperai/service_connectors/hyperai_service_connector.py

Co-authored-by: Stefan Nica <[email protected]>

* Raise AuthorizationException if client cannot be created

* Remove RuntimeError in two places because it will never arrive in that state anymore

* Remove try/catch statement

* Let exception fall through if applicable

* Remove raises

* Add warning hint about long-lived credentials

* Renames in docs based on changes

* Add missing io import

* Formatting

* Add automatic_cleanup_pipeline_files to HyperAIOrchestratorConfig

* Remove redundant variable assignment

* Clean only if users configure auto cleaning

* Update docs

* Work in progress: multi IP service connector

* Resources

* Append hostname instead

* Omit assigning value

* Rename config value

* Ensure that hostname is passed to Paramiko client

* Raise NotImplementedError instead of pass value

* Formatting

* Changes to _verify

* Reflect changes in service connector docs

* Fix connector value validation to allow arrays to be used with the CLI

* Reflect changes in orchestrator docs

* Fix connector verification to allow the multi-instance case

* Ensure that pipelines can run when scheduled by setting run ID dynamically

* Reformatting

* Add information about scheduled pipelines to docs

* Use service connector username to create Compose files on instance

* Add GPU reservation if configured that way

* Formatting

* Add instruction

* Add prerequisites for HyperAI instance

* Formatting and docstrings

* Fixed remaining linter errors

* Applied review suggestions

* Add paramiko to API docs mocks

* HyperAI orchestrator config tests; make additional assertions available and fix is_remote

* Remove GPU-based Dockerfile

* Ensure that shell commands are escaped when used

* Provide password to stdin differently

* Escape case where file cannot be written to HyperAI instance

* Escape inputs differently

* Use network mode host to avoid non-overlapping IPv4 network pool error

* Disable security check for paramiko auto-add-policy

* Changes to escaping

* Silenced remaining security issues and fixed remaining linter errors

---------

Co-authored-by: Michael Schuster <[email protected]>
Co-authored-by: Stefan Nica <[email protected]>
Co-authored-by: Alex Strick van Linschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request run-slow-ci
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants