-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store scancode scans locally #235
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! here is some review.
clearcode/store_scans.py
Outdated
|
||
|
||
def get_cd_item_by_repo_name(cd_items): | ||
cd_item_by_repo_name = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a defaultdict(list) to simplify the code in https://github.com/nexB/purldb/pull/235/files#diff-a1fd023bd42d73f56019d540f38be711255403547add15108540d70f9948dd40R106
Also since you are also computing the purl here, you could also keep it.
So you return two mappings instead:
- cd_items_by_repo:
{repo_name: [list of CDitem pk/ids,...]}
- purl_by_cd_item_id:
{CDitem pk/id: computed purl}
This is still manageable at scale: with 20M items, the first mapping may be would be under 2GB with and 100 bytes storage per item, and the second would be about 6GB with 300 bytes storage per item.
(I reckon that this second purl mapping may be premature optimization though)
clearcode/store_scans.py
Outdated
Save and commit them in git repositories in work dir | ||
""" | ||
cd_items = CDitem.objects.filter(~Q(content=b''), path__contains="tool/scancode") | ||
x = get_cd_item_by_repo_name(cd_items=cd_items) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call this cd_items_by_repo_name ?
Also you cannot store a whole list of actual objects, as I think you will exhaust memory with this approach. Instead, a way that will use the least possible amount of memory would be to keep a mapping of {repo_name: [list of CDitem pk/ids,...]}
and create a second query set to iterate over in https://github.com/nexB/purldb/pull/235/files#diff-a1fd023bd42d73f56019d540f38be711255403547add15108540d70f9948dd40R75 with a filter(id__in=cd_item_ids)
clearcode/store_scans.py
Outdated
""" | ||
cd_items = CDitem.objects.filter(~Q(content=b''), path__contains="tool/scancode") | ||
x = get_cd_item_by_repo_name(cd_items=cd_items) | ||
print(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want keep this (but this may be way too big in practice to print), guard it with a TRACE flag
clearcode/store_scans.py
Outdated
#TODO: this is a bug, we need to log it | ||
continue | ||
purl = coordinate.to_purl() | ||
# temporary hack iterate on all hashes until we store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This four lines may no longer be of use? By constructions we always have one repo name at this stage
clearcode/store_scans.py
Outdated
with open(scancode_scan_path, "w") as f: | ||
json.dump(scancode_scan,f,indent=2) | ||
repo.index.add([scancode_scan_path]) | ||
repo.index.commit(message=f"Add scancode-toolkit scan for {purl}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a small twist there. Since we may have multiple scans for the same single purl (each with a different version of ScanCode), we would need to ensure that we add scans in the correct sequence, from oldest to newest. And it would make sense to add the ScanCode toolkit version used in the commit message as in Add scancode-toolkit {sctk_version} scan for {purl}
clearcode/store_scans.py
Outdated
|
||
|
||
def is_valid_coordinate(coordinate): | ||
return coordinate.type and coordinate.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have no version, that's also a problem and we should treat these as invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TG1999 Sorry for the delay, I've left some comments
@TG1999 gentle ping. |
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
a158efc
to
d8b3fc4
Compare
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
Signed-off-by: Tushar Goel <[email protected]>
This PR allows to store scans locally using a PURL hash-based distribution of scans in multiple directories to avoid stressing file systems with too many files in one directory.
The goal is also to store these in Git repositories