-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SBOM from all-layers scope showing duplicate packages #32
Comments
Case: Centos 7 installed a Python RPM + a python package called |
A more common case is adding packages via a package manager will cause the package DB to be duplicated. However, this can be handled by setting the default scope to "squashed" instead of "all-layers" |
What is the "correct" result, from a distro package perspective the version is the version identified by the pkgdb, but from an application package perspective the version that other code would use to check that dependency is the app-level version. But we've seen cases where those versions don't align due to backports from distro without version bumps in the application package. Solutions I can think of:
|
Here is a prototype branch that proves out package deduplication (via package fingerprints) https://github.com/anchore/syft/compare/add-fingerprint (on #363) |
This issue really only encapsulates the "duplicate" case described in #32 (comment) (from scoping selections). If we find more cases for the package relationships we can open issues accordingly. |
PR #595 had code that could account for this functionality, but was removed during the review process as to not change the existing package behavior as we made the impactful ID change. The specific behavior that was highlighted was the following: if we exclude location from the definition of a package and merge entries that are otherwise identical except for the location information (thus, the location fields are combined), then you may loose nuance in the set of distinct location sets. That is, if package
This means that the syft json output would be lossy in a way that it hasn't been before in terms of "how many times does this package appear in the source I'm analyzing". To balance this view: without a way to easily compare the "sameness" of packages independent of location, there isn't a good way to determine this today either. For example, given two packages with the same name and version, how would you determine if they are the same package in different locations? Comparing package IDs would not quite do the trick, since location is a component of the ID. This would mean you would need to compare all attributes except for location between all packages with the same name and version. Assuming we want the de-duplication functionality, there are two high-level options:
My vote is option |
There is one more case that needs to be handled: how will duplicate synthetic information (data derived from the underlying raw data) be handled? CPEs and pURLs are examples of this synthetic data. In theory we shouldn't see CPEs and pURLs that differ form package to package if the fields that make up these packages are the same. However, since we can have packages input from SPDX or CycloneDX decoding paths, it is very possible for other tooling to introduce duplicate packages with differing underlying information but have differing CPEs/pURLs (for what reason? unclear... but the execution path is now possible). Possible options:
Option |
From some offline conversation it would be good to keep this centered on deduplication for the same paths across multiple layers, we can always expand this to include different paths within the same layer (or different layers) in the future. |
The set of analyzers will surface packages based on the small set of rules that each analyzer is coded to enforce. This may surface multiple packages from the same underlying source (e.g. python egg-info analyzer picks up a package that was also picked up by dpkg).
Note: this behavior should be optional via configuration and CLI options, defaulting to not deduping packages.
The text was updated successfully, but these errors were encountered: