Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: generate an instanceID #4238

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

frrist
Copy link
Member

@frrist frrist commented Jul 16, 2024

  • The ID is based on the host running the bacalhau node, and is deterministic across re-installs and node initalization. There is no guarantee setting the ID will succeed, its a best effort.

Copy link

coderabbitai bot commented Jul 16, 2024

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@frrist frrist requested a review from wdbaruni July 16, 2024 20:08
@frrist frrist force-pushed the frrist/repo/metadata-instanceID branch from 8ed6d0f to 61f27f9 Compare July 16, 2024 20:20
@frrist frrist force-pushed the frrist/repo/metadata-instanceID branch 3 times, most recently from e0bf832 to c1ec55c Compare July 16, 2024 20:45
Copy link
Member

@wdbaruni wdbaruni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is mixed with metadata changes, but I reviewed the instanceID commit. Looks good overall, but left a comment to fallback to uuid

Comment on lines +13 to +15
// GenerateInstanceID creates a unique, anonymous identifier for the instance of bacalhau.
// It combines the machine ID and MAC address to ensure uniqueness across different
// environments, including virtual machines and cloud instances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth clarifying in the comment:
We only attempt to generate an instance id if no id was found the system_metadata, which can happen during new node initialization or if the DataDir was wiped. We attempt to generate deterministic id for analytics purposes and track the usage of the instance.

--

Now since it is best effort, rather than failing because of errors when trying to extract the machine id or mac address, why don't we fallback to hash(uuid)? This will be sufficient 99.999% of the time since we do persist it in the system_metadata

My other question is, if the user is running a node or a client against different networks with different DataDir, we will assign them the same InstanceID. Maybe this is the desired behaviour, but worth thinking about

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several clarifying comments related to when we generate the instanceID, and how we persist it in the repo Open and Init methods. I think usage is well covered there. (It should be easier to grok now that I've fixed the mixed commits 😀)

Now since it is best effort, rather than failing because of errors when trying to extract the machine id or mac address, why don't we fallback to hash(uuid)?

What value would including this random string (UUID) add?
The purpose of the instanceID is to identify (anomalously) unique instances of bacalhau. I'd much prefer an empty value - which implies a value couldn't be derived - than a random string with no relationship to the actual instance. My preference is to leave the field empty if it couldn't be derived.

This will be sufficient 99.999% of the time since we do persist it in the system_metadata

I've made a small change since your last review: We now generate and persist an instanceID when the repo is first initialized and also each time the repo is opened. This ensures the instanceID is associated with the machine running bacalhau, even if the repo is copied to a different machine. In this sense we are treating the system_metadata.yaml file a bit like a cache. In work to follow this, we can read the instanceID from the system_metadata.yaml file and include it in the job metadata and/or HTTP headers when an instance interacts with the network.

My other question is, if the user is running a node or a client against different networks with different DataDir, we will assign them the same InstanceID. Maybe this is the desired behaviour, but worth thinking about

Right, I think this behavior makes sense. The same instance of bacalhau is free to interact with many networks. A measure of this interaction could be derived be counting the number of different networks an instance has interactions with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value would including this random string (UUID) add?
The purpose of the instanceID is to identify (anomalously) unique instances of bacalhau. I'd much prefer an empty value - which implies a value couldn't be derived - than a random string with no relationship to the actual instance. My preference is to leave the field empty if it couldn't be derived.

How a persisted UUID doesn't provide a relationship to the instance? Our goal is to identify the instance the jobs are submitted from (client) or executing the job (node). Assigning and persisting a UUID when you first initialize the client/node is more than enough and fulfils this purpose. What you are trying to do here with generating idempotent IDs is just to handle the edge cases where the repo is deleted and we still want to tag jobs with the same instance, which is great, but lets treat it as best effort and fallback to UUID instead of not tracking at all

Keep in mind this means we only attempt to generate an ID if no ID was found in the system_metadata.yaml, and don't attempt to override the value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address these comments so we can close the PR?

@frrist frrist force-pushed the frrist/repo/metadata branch 2 times, most recently from 82a1b40 to a8a86b1 Compare August 6, 2024 21:21
@frrist frrist force-pushed the frrist/repo/metadata-instanceID branch from c1ec55c to 6d31889 Compare August 6, 2024 21:22
@frrist frrist force-pushed the frrist/repo/metadata-instanceID branch from 6d31889 to 6bdfde4 Compare August 6, 2024 21:45
@frrist frrist requested a review from wdbaruni August 6, 2024 22:11
Base automatically changed from frrist/repo/metadata to main August 6, 2024 22:12
frrist added 2 commits August 6, 2024 15:13
- The ID is based on the host running the bacalhau node, and is
  deterministic across re-installs and node initalization. There is no
  guarantee setting the ID will succeed, its a best effort.
@wdbaruni
Copy link
Member

Lets avoid having PRs open for too long. It gets more difficult to review things again after weeks, and I am sure it gets difficult for you as well

@frrist
Copy link
Member Author

frrist commented Aug 20, 2024

I will close this and re-open when there is a strong need for it.

@frrist frrist closed this Aug 20, 2024
@wdbaruni wdbaruni reopened this Aug 21, 2024
@wdbaruni
Copy link
Member

@frrist there is a strong need for having a persisted InstanceID in the job spec as we've removed the ClientID, and not persisting the InstallationID anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants