Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SBOM from all-layers scope showing duplicate packages #32

Closed
wagoodman opened this issue Jun 1, 2020 · 8 comments · Fixed by #930
Closed

SBOM from all-layers scope showing duplicate packages #32

wagoodman opened this issue Jun 1, 2020 · 8 comments · Fixed by #930
Assignees
Labels
bug Something isn't working

Comments

@wagoodman
Copy link
Contributor

wagoodman commented Jun 1, 2020

The set of analyzers will surface packages based on the small set of rules that each analyzer is coded to enforce. This may surface multiple packages from the same underlying source (e.g. python egg-info analyzer picks up a package that was also picked up by dpkg).

Note: this behavior should be optional via configuration and CLI options, defaulting to not deduping packages.

@wagoodman
Copy link
Contributor Author

wagoodman commented Jul 27, 2020

Case: Centos 7 installed a Python RPM + a python package called Python which conflicts (in anchore-engine).

@wagoodman
Copy link
Contributor Author

wagoodman commented Aug 3, 2020

A more common case is adding packages via a package manager will cause the package DB to be duplicated. However, this can be handled by setting the default scope to "squashed" instead of "all-layers"

@zhill
Copy link
Member

zhill commented Jan 20, 2021

What is the "correct" result, from a distro package perspective the version is the version identified by the pkgdb, but from an application package perspective the version that other code would use to check that dependency is the app-level version. But we've seen cases where those versions don't align due to backports from distro without version bumps in the application package.

Solutions I can think of:

  1. Indicate one package manages another, allowing downstream users (like vuln scanner) to skip matches against managed packages.
  2. Always defer to the managing package, remove any listing that isn't independent.

@wagoodman
Copy link
Contributor Author

wagoodman commented Jul 2, 2021

Here is a prototype branch that proves out package deduplication (via package fingerprints) https://github.com/anchore/syft/compare/add-fingerprint (on #363)

@wagoodman
Copy link
Contributor Author

wagoodman commented Aug 17, 2021

This issue really only encapsulates the "duplicate" case described in #32 (comment) (from scoping selections). If we find more cases for the package relationships we can open issues accordingly.

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 7, 2022

PR #595 had code that could account for this functionality, but was removed during the review process as to not change the existing package behavior as we made the impactful ID change.

The specific behavior that was highlighted was the following: if we exclude location from the definition of a package and merge entries that are otherwise identical except for the location information (thus, the location fields are combined), then you may loose nuance in the set of distinct location sets. That is, if package 1 locations were [A, B, C] and package 2 locations were [C, D, E], with the (flat) combined locations of [A, B, C, D, E] we would loose the nuance that:

  1. C is a shared location between both sets
  2. there are originally two ways this package was discovered --you can't tell if this package was discovered one time, two times, ten times, etc.

This means that the syft json output would be lossy in a way that it hasn't been before in terms of "how many times does this package appear in the source I'm analyzing". To balance this view: without a way to easily compare the "sameness" of packages independent of location, there isn't a good way to determine this today either. For example, given two packages with the same name and version, how would you determine if they are the same package in different locations? Comparing package IDs would not quite do the trick, since location is a component of the ID. This would mean you would need to compare all attributes except for location between all packages with the same name and version.

Assuming we want the de-duplication functionality, there are two high-level options:

  • a. merge the two location sets into a flat set and ignore the "count" use case for now. This can be later enhanced on an as-needed-basis.
  • b. start solutioning for additionally capturing "count" information as part of this work (by adding additional count field or breaking location into a list-of-location-lists to indicate "count").

My vote is option a for a few of reasons: it is simple, I haven't seen anyone with the "count" use case yet, and we are not cornered in a sense where we could not enhance the syft output later to include "counts".

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 7, 2022

There is one more case that needs to be handled: how will duplicate synthetic information (data derived from the underlying raw data) be handled?

CPEs and pURLs are examples of this synthetic data. In theory we shouldn't see CPEs and pURLs that differ form package to package if the fields that make up these packages are the same. However, since we can have packages input from SPDX or CycloneDX decoding paths, it is very possible for other tooling to introduce duplicate packages with differing underlying information but have differing CPEs/pURLs (for what reason? unclear... but the execution path is now possible).

Possible options:

  • a. include synthetic information in the definition of a package ID: this has some complications as it means that catalogers would be responsible for assigning this synthetic information for it to be included in the package ID (the catalogers assign the ID). It would be possible to leave this empty, however, you could get different IDs for the same package depending on the execution path (from a cataloger vs from a SBOM decode operation).
  • b. still exclude synthetic information from the definition of a package ID, additionally merge the two field sets: CPEs and pURLs would be put into sets and combined onto a single definition of the package.

Option b here makes the most sense to me since we'd be consistently handling data that isn't considered in the ID system: combine in a merge operation, irregardless of the field in question being "location", "CPE", "pURL", or something else in the future.

@wagoodman
Copy link
Contributor Author

From some offline conversation it would be good to keep this centered on deduplication for the same paths across multiple layers, we can always expand this to include different paths within the same layer (or different layers) in the future.

@wagoodman wagoodman changed the title Deduplicate package catalog SBOM from all-layers scope showing duplicate packages Mar 30, 2022
@wagoodman wagoodman added the bug Something isn't working label Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants