Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store scancode scans locally #235

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

TG1999
Copy link
Member

@TG1999 TG1999 commented Dec 13, 2023

This PR allows to store scans locally using a PURL hash-based distribution of scans in multiple directories to avoid stressing file systems with too many files in one directory.
The goal is also to store these in Git repositories

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! here is some review.



def get_cd_item_by_repo_name(cd_items):
cd_item_by_repo_name = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a defaultdict(list) to simplify the code in https://github.com/nexB/purldb/pull/235/files#diff-a1fd023bd42d73f56019d540f38be711255403547add15108540d70f9948dd40R106

Also since you are also computing the purl here, you could also keep it.

So you return two mappings instead:

  • cd_items_by_repo: {repo_name: [list of CDitem pk/ids,...]}
  • purl_by_cd_item_id: {CDitem pk/id: computed purl}

This is still manageable at scale: with 20M items, the first mapping may be would be under 2GB with and 100 bytes storage per item, and the second would be about 6GB with 300 bytes storage per item.

(I reckon that this second purl mapping may be premature optimization though)

Save and commit them in git repositories in work dir
"""
cd_items = CDitem.objects.filter(~Q(content=b''), path__contains="tool/scancode")
x = get_cd_item_by_repo_name(cd_items=cd_items)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call this cd_items_by_repo_name ?
Also you cannot store a whole list of actual objects, as I think you will exhaust memory with this approach. Instead, a way that will use the least possible amount of memory would be to keep a mapping of {repo_name: [list of CDitem pk/ids,...]} and create a second query set to iterate over in https://github.com/nexB/purldb/pull/235/files#diff-a1fd023bd42d73f56019d540f38be711255403547add15108540d70f9948dd40R75 with a filter(id__in=cd_item_ids)

"""
cd_items = CDitem.objects.filter(~Q(content=b''), path__contains="tool/scancode")
x = get_cd_item_by_repo_name(cd_items=cd_items)
print(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want keep this (but this may be way too big in practice to print), guard it with a TRACE flag

#TODO: this is a bug, we need to log it
continue
purl = coordinate.to_purl()
# temporary hack iterate on all hashes until we store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This four lines may no longer be of use? By constructions we always have one repo name at this stage

with open(scancode_scan_path, "w") as f:
json.dump(scancode_scan,f,indent=2)
repo.index.add([scancode_scan_path])
repo.index.commit(message=f"Add scancode-toolkit scan for {purl}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small twist there. Since we may have multiple scans for the same single purl (each with a different version of ScanCode), we would need to ensure that we add scans in the correct sequence, from oldest to newest. And it would make sense to add the ScanCode toolkit version used in the commit message as in Add scancode-toolkit {sctk_version} scan for {purl}



def is_valid_coordinate(coordinate):
return coordinate.type and coordinate.name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have no version, that's also a problem and we should treat these as invalid.

Copy link
Contributor

@JonoYang JonoYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TG1999 Sorry for the delay, I've left some comments

clearcode/store_scans.py Show resolved Hide resolved
clearcode/store_scans.py Outdated Show resolved Hide resolved
clearcode/store_scans.py Outdated Show resolved Hide resolved
clearcode/store_scans.py Outdated Show resolved Hide resolved
clearcode/store_scans.py Outdated Show resolved Hide resolved
clearcode/store_scans.py Outdated Show resolved Hide resolved
@pombredanne
Copy link
Member

@TG1999 gentle ping.

Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
@TG1999 TG1999 force-pushed the store_scancode_scans_locally branch from a158efc to d8b3fc4 Compare July 15, 2024 15:51
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants