HyperAI integration: orchestrator and service connector #2372

christianversloot · 2024-01-29T20:39:05Z

Describe changes

HyperAI (hyperai.ai) has recently become one of our suppliers of GPU instances. Unfortunately, unlike the major public cloud providers, HyperAI does not yet have an SDK like sagemaker. Still, it is critical for us to keep using ZenML, as many of our pipelines are built that way. This PR implements a HyperAI integration by means of an orchestrator and service connector.

Service connector:

The service connector is effectively a SSH based service connector. Using paramiko, it provides an authenticated SSHClient given the configured ip_address (with an optional instance_name serving as a nickname to distinguish multiple IP addresses), username, base64_ssh_key and optionally ssh_passphrase.
Unfortunately, the ZenML CLI does not support multiline entry via cli. That is why the key can be provided in Base64 encoded format; the service connector ensures that it is decoded.
It has support for multiple key types: RSA, DSA (DSS), ECDSA and ED255519.

The orchestrator:

Effectively uses Docker Compose and its service_completed_successfully depends_on condition to compose a Docker Compose file which guarantees the order of execution, including more complex pipelines by using step.spec.upstream_steps.
Uses the provided SSH client to upload the file to the instance and then executes it (and hence assumes a very lightweight setup at the user side with Docker and Docker Compose being installed).
Has support for scheduled pipelines (then also assumes Cron daemon is running).
Has a built-in cleanup mechanism for non-scheduled pipeline runs, which are automatically deleted after 7 days upon starting a new pipeline run. Following the paradigm where users are responsible for cleaning up their schedule pipeline runs (per the ZenML docs), it does nothing with those.
Optionally (via configuration, but set to False by default recognizing possible security implications) is able to authenticate the instance to the stack's configured container registry by logging in. Once again, this is entirely optional: if, from a security perspective, the user does not want this, they are free to leave it to False; then, they must ensure that the instance is logged in themselves.
Allows for mounts to be made between folders on the instance and the container, hence allowing data stored on these GPU instances to be readily available to the pipeline run. Mounts that are made can be configured by the user in component configuration and can be changed at any time.

Logo:
The code assumes a HyperAI logo to be present in a seemingly public bucket on your end. I can ask the HyperAI team to provide a proper logo that can be put in this bucket so that it's visible within the ZenML dashboard.

This way:

Users have full flexibility as to whether they want to create one service connector/orchestrator combination per instance; reuse service connectors (and thus keys) with multiple instances; even provide single users with different data access patterns by giving them different orchestrators with different mounts.
Users have full freedom to deploy any Docker based pipeline using this orchestrator: as with SageMaker, the orchestrator fully respects the Docker configuration set by the user (in fact, I used the local Docker orchestrator to find inspiration).

Testing:

I have added limited tests because I could not find many nor do I have an understanding about any testing infrastructure you may have.
I did test thoroughly via a local ZenML / ZenML dashboard setup and it worked very well.

As discussed with @htahir1 , we're actively awaiting HyperAI usage (and so is the HyperAI team, as they've been in touch as far as I know) so we'd prefer to start using this integration as soon as possible. Do note that this does not mean that we should merge it recklessly, we're just very excited :)

Pre-requisites

Please ensure you have done the following:

I have read the CONTRIBUTING.md document.
If my change requires a change to docs, I have updated the documentation accordingly.
I have added tests to cover my changes.
I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Other (add details above)

Summary by CodeRabbit

New Features
- Introduced the HyperAI Service Connector for enhanced authentication options, supporting various key types.
- Added comprehensive documentation for configuring HyperAI Connectors and orchestrating pipelines on HyperAI instances, including support for Docker Compose and scheduled runs.
- Implemented the ZenML HyperAI orchestrator for deploying machine learning pipelines on HyperAI instances, with support for GPU-backed hardware via CUDA.
Documentation
- New guides and documentation for setting up and using the HyperAI Service Connector and orchestrator.
Tests
- Added tests for the HyperAI orchestrator to ensure correct attribute settings.

Develop

schustmi

LGTM 🦭

…nml into hyperai-integration

socket-security · 2024-02-05T08:00:46Z

New dependencies detected. Learn more about Socket for GitHub ↗︎

Package	New capabilities	Transitives	Size	Publisher
pypi/[email protected]	Transitive: environment, eval, filesystem, network, shell, unsafe	`+32`	318 MB	typeshed_bot

View full report↗︎

src/zenml/integrations/hyperai/orchestrators/hyperai_orchestrator.py

…le and fix is_remote

…nml into hyperai-integration

* Add init for HyperAI integration * WIP: HyperAI service connector * WIP * WIP: HyperAI Service Connector * WIP: HyperAI Orchestrator * Replace Docker compose write with temporary file and SCP * Variable assignment error * Set dependency * Set basic values of the HyperAI settings and config * Add config property * Allow mounts to be made * Remove newline * Finish (untested) orchestrator * Import HyperAI integration * Import HyperAI service connector in service connector registry * Rename resource type * Rename auth method * Force key to be base64 * Fixes to service connector * Identify instance by name and IP address * Strip IP address Python * Strip IP address Python * Return paramiko client * WIP * Mimic sagemaker integration * Fixes to make HyperAI orchestrator visible * Fixes to make orchestrator work * Temp change default local ip for testing * Environment fix * Use upstream steps to determine dependencies * Add support for scheduled pipelines * Polish schedules * Add configuration support for multiple Paramiko key types * Add Base64 instructions * Rename various vars * Add instructions about possible cron * Some docstring edits * Add setting for CR autologin * Add rudimentary Docker login * Move value * Add docstring * Remove unused def * Extract Paramiko key type given service connector configuration * Add better warnings * Check for None differently * Automatic Docker login if configured * Add HyperAI orchestrator flavor to docs * Basic docs for HyperAI orchestrator * Add HyperAI service connector to auth management docs * Add HyperAI service connector to docs * Set autologin to False by default * Add test similar to Airflow orchestrator * Formatting * Revert changes needed to run successfully locally * Add mount path validation * Improve error handling and formatting * Format mount paths differently * Upgrade azureml-core to 1.54.0.post1 * Fix docstring * Update src/zenml/integrations/hyperai/service_connectors/hyperai_service_connector.py Co-authored-by: Michael Schuster <[email protected]> * Rename def into _validate_mount_paht * Update config docstring to default to False * Move Settings, Config and Flavor to lavor folder * Remove type from docstring * Remove type from docstring * Remove type check convered by pydantic * Select container registry more efficiently * Remove redundant type conversion * Move Paramiko client creation into helper method * Reformatting * Fix imports * Temp changes for local testing * Fix imports * Revert "Temp changes for local testing" This reverts commit 76fdb29. * Rename HYPERAI_RESOURCE_TYPE into hyperai-instance * Rename ip_address into hostname * Update src/zenml/integrations/hyperai/service_connectors/hyperai_service_connector.py Co-authored-by: Stefan Nica <[email protected]> * Raise AuthorizationException if client cannot be created * Remove RuntimeError in two places because it will never arrive in that state anymore * Remove try/catch statement * Let exception fall through if applicable * Remove raises * Add warning hint about long-lived credentials * Renames in docs based on changes * Add missing io import * Formatting * Add automatic_cleanup_pipeline_files to HyperAIOrchestratorConfig * Remove redundant variable assignment * Clean only if users configure auto cleaning * Update docs * Work in progress: multi IP service connector * Resources * Append hostname instead * Omit assigning value * Rename config value * Ensure that hostname is passed to Paramiko client * Raise NotImplementedError instead of pass value * Formatting * Changes to _verify * Reflect changes in service connector docs * Fix connector value validation to allow arrays to be used with the CLI * Reflect changes in orchestrator docs * Fix connector verification to allow the multi-instance case * Ensure that pipelines can run when scheduled by setting run ID dynamically * Reformatting * Add information about scheduled pipelines to docs * Use service connector username to create Compose files on instance * Add GPU reservation if configured that way * Formatting * Add instruction * Add prerequisites for HyperAI instance * Formatting and docstrings * Fixed remaining linter errors * Applied review suggestions * Add paramiko to API docs mocks * HyperAI orchestrator config tests; make additional assertions available and fix is_remote * Remove GPU-based Dockerfile * Ensure that shell commands are escaped when used * Provide password to stdin differently * Escape case where file cannot be written to HyperAI instance * Escape inputs differently * Use network mode host to avoid non-overlapping IPv4 network pool error * Disable security check for paramiko auto-add-policy * Changes to escaping * Silenced remaining security issues and fixed remaining linter errors --------- Co-authored-by: Michael Schuster <[email protected]> Co-authored-by: Stefan Nica <[email protected]> Co-authored-by: Alex Strick van Linschoten <[email protected]>

christianversloot and others added 30 commits January 23, 2024 16:35

Add init for HyperAI integration

d7c4ada

WIP: HyperAI service connector

38a0f75

WIP

069aa5a

WIP: HyperAI Service Connector

8555b06

WIP: HyperAI Orchestrator

1cf7b1e

Replace Docker compose write with temporary file and SCP

61aaa28

Variable assignment error

b1971ef

Set dependency

433ffb0

Set basic values of the HyperAI settings and config

f911c65

Add config property

25b4e6b

Allow mounts to be made

b353a18

Remove newline

ca9b525

Merge pull request #1 from zenml-io/develop

3a68dde

Develop

Finish (untested) orchestrator

d611a4b

Import HyperAI integration

b9bd84f

Import HyperAI service connector in service connector registry

a614d4d

Rename resource type

17d016f

Rename auth method

93b1dbe

Force key to be base64

c82c4c6

Fixes to service connector

ec4f48e

Identify instance by name and IP address

442d7f3

Strip IP address Python

17b4990

Strip IP address Python

99b15e2

Return paramiko client

15658a5

WIP

cb53440

Mimic sagemaker integration

00bbe76

Fixes to make HyperAI orchestrator visible

502908e

Fixes to make orchestrator work

52b2417

Temp change default local ip for testing

e5fd25b

Environment fix

0bbe217

Add instruction

e7d0575

schustmi approved these changes Feb 2, 2024

View reviewed changes

christianversloot added 2 commits February 2, 2024 14:51

Add prerequisites for HyperAI instance

f35882c

Merge branch 'develop' into hyperai-integration

e48632b

stefannica added the run-slow-ci label Feb 2, 2024

christianversloot and others added 4 commits February 3, 2024 00:50

Merge branch 'develop' into hyperai-integration

c20f9c5

Formatting and docstrings

a43baed

Merge branch 'hyperai-integration' of github.com:christianversloot/ze…

5c6c887

…nml into hyperai-integration

Fixed remaining linter errors

aa25e21

christianversloot commented Feb 5, 2024

View reviewed changes

src/zenml/integrations/hyperai/orchestrators/hyperai_orchestrator.py Outdated Show resolved Hide resolved

src/zenml/integrations/hyperai/orchestrators/hyperai_orchestrator.py Outdated Show resolved Hide resolved

stefannica and others added 16 commits February 5, 2024 09:12

Applied review suggestions

d75ece0

Add paramiko to API docs mocks

7a1bf4b

HyperAI orchestrator config tests; make additional assertions availab…

bc96c84

…le and fix is_remote

Remove GPU-based Dockerfile

d0be24c

Merge branch 'develop' into hyperai-integration

45cc0d4

Ensure that shell commands are escaped when used

8147575

Provide password to stdin differently

e1bc7d0

Escape case where file cannot be written to HyperAI instance

6e680bf

Escape inputs differently

fe026f4

Use network mode host to avoid non-overlapping IPv4 network pool error

cd16e76

Disable security check for paramiko auto-add-policy

b4278cb

Changes to escaping

12231d4

Merge branch 'hyperai-integration' of github.com:christianversloot/ze…

05debc3

…nml into hyperai-integration

Merge branch 'develop' into hyperai-integration

67ea70e

Silenced remaining security issues and fixed remaining linter errors

2a134aa

Merge branch 'hyperai-integration' of github.com:christianversloot/ze…

a7be14c

…nml into hyperai-integration

stefannica merged commit 66b6d99 into zenml-io:develop Feb 6, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HyperAI integration: orchestrator and service connector #2372

HyperAI integration: orchestrator and service connector #2372

christianversloot commented Jan 29, 2024 •

edited by coderabbitai bot

Loading

schustmi left a comment

socket-security bot commented Feb 5, 2024 •

edited

Loading

HyperAI integration: orchestrator and service connector #2372

HyperAI integration: orchestrator and service connector #2372

Conversation

christianversloot commented Jan 29, 2024 • edited by coderabbitai bot Loading

Describe changes

Pre-requisites

Types of changes

Summary by CodeRabbit

schustmi left a comment

Choose a reason for hiding this comment

socket-security bot commented Feb 5, 2024 • edited Loading

christianversloot commented Jan 29, 2024 •

edited by coderabbitai bot

Loading

socket-security bot commented Feb 5, 2024 •

edited

Loading