Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add command for dumping an Elasticsearch instance in the PaaS #188

Merged
merged 9 commits into from
Jan 8, 2024

Conversation

Kuruyia
Copy link
Contributor

@Kuruyia Kuruyia commented Jun 9, 2023

What does this PR do?

This adds the paas:elasticsearch:dump command that allows PaaS clients to easily dump the indexes and documents from an Elasticsearch instance of a PaaS application. This PR follows the addition of the application/storage controller in the PaaS console.

This command asks for the PaaS environment name and application ID (a project name can also be specified), and a directory where to dump the Elasticsearch data.
The directory structure is automatically created.

First, the indexes are dumped using the application/storage:getIndexes action, and are stored in the <dump_dir>/indexes.json file.

Then, the documents are dumped usind the application/storage:dumpDocuments action, and are stored in the <dump_dir>/documents.jsonl file. This file is a Newline delimited JSON file.
Because there could be a huge amount of documents totaling a big size for the response, the PaaS console only allows to dump at most a certain amount of documents at a time. Kourou makes multiple requests to the PaaS console until there are no more documents to dump to get them all. The amount of documents dumped can be set with the batch-size flag, but keep in mind that the PaaS console may also limit the maximum amount of documents that can be returned at once.

Finally, the dump session is cleanly terminated and the user is informed where to find their dump files.

image

Other changes

Types for Node.js were updated to Node 18.

Copy link
Contributor

@rolljee rolljee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but there are some modifications

sort: string[];
};

class PaasInit extends PaasKommand {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classname seems odd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's a copy-paste error! 😅

);

// Create the dump directory
await fs.mkdir(this.args.dumpDirectory, { recursive: true });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does option { recursive true } works as mkdirp ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works the same as mkdir -p

@Aschen
Copy link
Contributor

Aschen commented Jul 5, 2023

Why using multiple files instead of a single file in JSONL format?
We use this format with a stream for the index/collection dump to avoid high memory usage.

I see some potentials problems with the actual design:

  • having multiple file can end up with literally thousands of small files which is not efficient
  • 10 documents limit is really low and the overhead of a network request should be considered, for instance if you have 100000 documents then you will need 10 thousands requests. This limit should be configurable

Actually most of the issues you will encounter here were already discussed and solved for the index/collection export commands. You can have a look how it was done


// Finish the dump
try {
await this.finishDump(pitId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the finishDump method if the dump crash in between two pages of documents, should it be called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, yes it should totally be called.

@Kuruyia
Copy link
Contributor Author

Kuruyia commented Jul 5, 2023

Why using multiple files instead of a single file in JSONL format?

I didn't know about that file format, I will look at that 😄

  • 10 documents limit is really low and the overhead of a network request should be considered, for instance if you have 100000 documents then you will need 10 thousands requests. This limit should be configurable

You're right, I think I just forgot to make this limit configurable 😅

Actually most of the issues you will encounter here were already discussed and solved for the index/collection export commands. You can have a look how it was done

👍

@rolljee rolljee linked an issue Nov 23, 2023 that may be closed by this pull request
@rolljee rolljee merged commit a66bfe9 into 1-dev Jan 8, 2024
4 checks passed
@rolljee rolljee deleted the feat/paas-dump-es branch January 8, 2024 09:54
@rolljee rolljee mentioned this pull request Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implements dump command for paas elastic search
3 participants