Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Turbinia LLM analyzer, LLM lib interface and LLM lib implemntation for VertexAI #1441

Merged
merged 21 commits into from
Feb 29, 2024

Conversation

sa3eed3ed
Copy link
Contributor

@sa3eed3ed sa3eed3ed commented Feb 21, 2024

New Turbinia LLM analyzer, LLM lib interface and LLM lib implementation for VertexAI

please assign to @hacktobeer for review, he is aware of this work

  • New LLM lib interface
  • LLM lib for Vertex AI (using Gemini pro 1.0 model)
  • Interface can be extended or implemented for other LLM providers
  • New configs for Vertex AI
  • LLM_PROVIDER config value can be used to choose LLM provider (currently only Vertex AI)
  • New Job to analyze history, log and config files using LLM
  • New evidence type (ExportedFileArtifactLLM) for FileArtifactExtractionTask to avoid redundant procesing of artifacts between LLM analyzer and other analyzers using same artifacts
  • Files to analyze are extracted using FileArtifactExtractionTask, i.e. all artifacts supported by image_exporter.py are supported
  • Tested end to end using evidence/artifact_disk.dd

@hacktobeer hacktobeer self-requested a review February 21, 2024 17:53
@hacktobeer
Copy link
Collaborator

Excellent @sa3eed3ed - I have assigned myself and will review before EOW.

Copy link
Collaborator

@hacktobeer hacktobeer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass. ptal.

pyproject.toml Outdated Show resolved Hide resolved
turbinia/jobs/llm_artifacts_analyzer.py Show resolved Hide resolved
turbinia/lib/llm_libs/llm_lib_base.py Outdated Show resolved Hide resolved
turbinia/lib/llm_libs/vertex_ai_lib.py Show resolved Hide resolved
@jleaniz
Copy link
Collaborator

jleaniz commented Feb 23, 2024

Drive-by comment: could we specify a minimum version of the new dependencies in pyproject.toml with "^x.y.z" instead of "*"? That way we are less likely to run into dependency breakages down the line. There's also an open PR that will remove most GCP library dependencies from the Turbinia code base. From what I can tell, the vertexAI package only depends on google-api-core which would be kept anyway so it's not a problem.

@sa3eed3ed
Copy link
Contributor Author

Drive-by comment: could we specify a minimum version of the new dependencies in pyproject.toml with "^x.y.z" instead of "*"? That way we are less likely to run into dependency breakages down the line. There's also an open PR that will remove most GCP library dependencies from the Turbinia code base. From what I can tell, the vertexAI package only depends on google-api-core which would be kept anyway so it's not a problem.

Done, added version, I thought even if google-api-core is removed from pyproject.toml the poetry.lock file will have all the deps needed by vertexAI package

@jleaniz
Copy link
Collaborator

jleaniz commented Feb 23, 2024

Drive-by comment: could we specify a minimum version of the new dependencies in pyproject.toml with "^x.y.z" instead of "*"? That way we are less likely to run into dependency breakages down the line. There's also an open PR that will remove most GCP library dependencies from the Turbinia code base. From what I can tell, the vertexAI package only depends on google-api-core which would be kept anyway so it's not a problem.

Done, added version, I thought even if google-api-core is removed from pyproject.toml the poetry.lock file will have all the deps needed by vertexAI package

Yes, it will have the dependencies. My point was just to add a version, nothing else is needed. :) the core lib is included in libcloudforwnsics dependencies as well , which is already in the tonl file

@hacktobeer
Copy link
Collaborator

Thanks @sa3eed3ed. I have reviewed and tested the PR, looks pretty cool, looking forward to getting more real life results! I have no other review comments.
Example output for others following along:

* LLMAnalyzerTask (/evidence/002ef2465f6b46c1a63d2ad93c783a02/1708894370-9c15b072c9bd49f8b5e13fd04b4fbcad-FileArtifactExtractionTask/export/etc/redis/redis.conf): **Summary:** Redis configuration file contains default bind address of "0.0.0.0", allowing remote clients to connect without authentication.

* LLMAnalyzerTask (/evidence/002ef2465f6b46c1a63d2ad93c783a02/1708894292-8b20bc2016f14f16b5d5bbd8ee39b278-FileArtifactExtractionTask/export/home/dummyuser/.jupyter/jupyter_notebook_config.py): **Summary:** Jupyter Notebook server is exposed to the internet with weak security settings, allowing unauthorized access, remote code execution, and potential compromise of sensitive data.

* LLMAnalyzerTask (/evidence/002ef2465f6b46c1a63d2ad93c783a02/1708894416-d4f26a75a3124996bf90723c51c501a3-FileArtifactExtractionTask/export/etc/ssh/sshd_config): **SSH configuration allows weak ciphers, root login, password authentication, and empty passwords, posing a high security risk.**

@hacktobeer
Copy link
Collaborator

@aarontp - before I merge this can I get your opinion on the inclusion of this analyser in all triage recipes?

Copy link
Collaborator

@hacktobeer hacktobeer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hacktobeer
Copy link
Collaborator

For future ideas regarding this analyser:

  • bundling output reports (this is more generic and applies to other analysers as well in case we get eg disk images from GKE nodes with tons of containers)
  • adding/removing the analyser from any triage recipe depending on real world output results
  • make module configuration parameters configurable in eg recipes depending

Copy link
Member

@aarontp aarontp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool analysis task! I just left a drive by comment about potentially consolidating at least the extraction tasks.

turbinia/workers/analysis/llm_analyzer.py Show resolved Hide resolved
turbinia/jobs/llm_artifacts_analyzer.py Outdated Show resolved Hide resolved
@aarontp
Copy link
Member

aarontp commented Feb 27, 2024

@aarontp - before I merge this can I get your opinion on the inclusion of this analyser in all triage recipes?

Do we have any data about how long it takes to run on a typical input disk? Assuming it doesn't take too long to run, generally I would say it makes sense to include it anywhere we are including the other analysis tasks, which at the moment are not in the triage recipes as defined by the triage-* recipes here: https://github.com/google/turbinia/tree/master/turbinia/config/recipes, but we do have them in the disk related dftimewolf recipes, so we could include it in the turbinia recipes used by those (I can't remember if those disk related dftimewolf recipes are currently just using the default recipe, or if there is a dedicated recipe, but we do have a goal of making every dftimewolf recipe use a corresponding turbinia recipe this year).

Copy link
Collaborator

@berggren berggren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by comment, sorry for freelancing :)

poetry.lock Outdated Show resolved Hide resolved
turbinia/lib/llm_libs/llm_client.py Outdated Show resolved Hide resolved
turbinia/workers/analysis/llm_analyzer.py Outdated Show resolved Hide resolved
turbinia/workers/analysis/llm_analyzer.py Show resolved Hide resolved
@hacktobeer
Copy link
Collaborator

@aarontp - before I merge this can I get your opinion on the inclusion of this analyser in all triage recipes?

Do we have any data about how long it takes to run on a typical input disk? Assuming it doesn't take too long to run, generally I would say it makes sense to include it anywhere we are including the other analysis tasks, which at the moment are not in the triage recipes as defined by the triage-* recipes ....

It's fast, faster than plaso. FileExtraction is fast as the artifact definitions are pretty specific and VertexAI calling is fast as well. It will be done faster than the plaso task that is ran in parallel.

@sa3eed3ed
Copy link
Contributor Author

@aarontp - before I merge this can I get your opinion on the inclusion of this analyser in all triage recipes?

Do we have any data about how long it takes to run on a typical input disk? Assuming it doesn't take too long to run, generally I would say it makes sense to include it anywhere we are including the other analysis tasks, which at the moment are not in the triage recipes as defined by the triage-* recipes ....

It's fast, faster than plaso. FileExtraction is fast as the artifact definitions are pretty specific and VertexAI calling is fast as well. It will be done faster than the plaso task that is ran in parallel.

Removed from Triage recipes

@hacktobeer hacktobeer self-requested a review February 29, 2024 10:01
@hacktobeer
Copy link
Collaborator

Ran local tests and looks good. One final nit.
Can you add below to the configuration template? turbinia/config/turbinia_config_tmpl.py

}, {
    'job': 'LLMAnalysisJob',
    'programs': [],
    'docker_image': None,
    'timeout': 600
}, {
    'job': 'LLMArtifactsExtractionJob',
    'programs': [],
    'docker_image': None,
    'timeout': 600

After that I'll do a final check if the e2e tests run fine and will approve/merge

@sa3eed3ed
Copy link
Contributor Author

Ran local tests and looks good. One final nit. Can you add below to the configuration template? turbinia/config/turbinia_config_tmpl.py

}, {
    'job': 'LLMAnalysisJob',
    'programs': [],
    'docker_image': None,
    'timeout': 600
}, {
    'job': 'LLMArtifactsExtractionJob',
    'programs': [],
    'docker_image': None,
    'timeout': 600

After that I'll do a final check if the e2e tests run fine and will approve/merge

done, I made the timeout 3600 matching default

timeout_default = 3600

I don't expect it to take 1 hour, but there seem to be many other jobs with longer timeouts but if you think this might be problematic feel free to amend

@hacktobeer
Copy link
Collaborator

Local e2e (with api key added) run good. I am going to approve and merge, we can tune based on real world usage results.
@sa3eed3ed Thank you very much for this awesome contribution. I am looking forward to tune this based on the results!

@hacktobeer hacktobeer merged commit dbfe4cb into google:master Feb 29, 2024
5 checks passed
jleaniz pushed a commit to jleaniz/turbinia that referenced this pull request Mar 18, 2024
…on for VertexAI (google#1441)

New Turbinia LLM analyzer, LLM lib interface and LLM lib implementation for VertexAI

* New LLM lib interface
* LLM lib for Vertex AI (using Gemini pro 1.0 model)
* Interface can be extended or implemented for other LLM providers
* New configs for Vertex AI
* LLM_PROVIDER config value can be used to choose LLM provider (currently only Vertex AI)
* New Job to analyze history, log and config files using LLM
* New evidence type (ExportedFileArtifactLLM) for FileArtifactExtractionTask to avoid redundant procesing of artifacts between LLM analyzer and other analyzers using same artifacts
* Files to analyze are extracted using FileArtifactExtractionTask, i.e. all artifacts supported by image_exporter.py are supported
* Tested end to end using evidence/artifact_disk.dd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants