Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2020-resolver] Pip downloads lots of different versions of the same package #9922

Closed
1 task done
ifokeev opened this issue Apr 29, 2021 · 9 comments
Closed
1 task done
Labels
type: bug A confirmed bug or unintended behavior

Comments

@ifokeev
Copy link

ifokeev commented Apr 29, 2021

Description

Sorry I reopen the issue #8713,
after pip upgrade docker image started to build much more time even with strict versions, it looks like a bug, not a feature. Solution for this is to use --use-deprecated=legacy-resolver and downgrade pip version

Expected behavior

I'm expecting soft version comparison, not comparison all packages by all packages, it could be a long journey

pip version

20.3.1

Python version

3.8

OS

Mac OS 10.15.7

How to Reproduce

Just install any of your existing projects using both versions and using --use-deprecated=legacy-resolver

Output

No response

Code of Conduct

@ifokeev ifokeev added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Apr 29, 2021
@aptalca
Copy link

aptalca commented Apr 29, 2021

Yup, can concur:
https://ci.linuxserver.io/blue/organizations/jenkins/Docker-Pipeline-Builders%2Fdocker-sickchill/detail/master/430/pipeline/123/

requirements.txt: give me ANY version of pytz since 2012

Pip: here's EVERY version of pytz since 2012 and it will add another hour to your build time.

@aptalca
Copy link

aptalca commented Apr 29, 2021

Here's the dockerfile for that build for reference: https://github.com/linuxserver/docker-sickchill/blob/master/Dockerfile

@aptalca
Copy link

aptalca commented Apr 30, 2021

Here's the dependency map of the package above (sickchill):
https://pastebin.com/VK3fiP28

A dependency package (subliminal) listed pytz>=2012c as a dependency and therefore pip is trying to download all versions since 2012c. Two other dependency packages also listed pytz as a dependency, but did not specify a version. In that circumstance, the latest version of pytz would suffice. But pip still downloads everything since 2012c. Is that intended behavior?

@pradyunsg
Copy link
Member

There's an entire section in pip's documentation about this. Please read https://pip.pypa.io/en/stable/user_guide/#dependency-resolution-backtracking.

Since GitHub hid the closing notes on #8713, I'll link to them directly as well: #8713 (comment). You can scan the discussion above that, where we've described the specific tracking issues for the various things.

@aptalca
Copy link

aptalca commented May 2, 2021

@pradyunsg thanks, but I think you misunderstood my message.

I did read the docs and that thread you linked, before I posted my last message here and they don't really explain the issue I brought up.

The issue is that, I'm trying to install a single package, sickchill and 3 of its dependencies list pytz as a dependency (as listed in the pastebin link above and also below)
Those are:

  • sickchill depends on Js2Py<0.72,>=0.70 depends on tzlocal>=1.2 depends on pytz
  • sickchill depends on twilio>=6.55.0 depends on pytz
  • sickchill depends on subliminal>=2.1.0 depends on pytz>=2012c

With the above info, the listings, pytz, pytz and pytz>=2012c, I would expect pip to download the highest rated candidate, which is the latest version of pytz. Upon download, pip would check the dependencies of the latest pytz and would realize there are no dependencies of pytz, and thus no possibility of conflicts, and would install that as the correct version.

As you can see in the build logs above, pip instead downloads all versions of pytz since 2012 without even checking any of the versions for dependencies and potential conflicts.

Is that the intended behavior? If so what is the reasoning because I can't come up with one? There is literally no reason I can think of for pip to download all those versions.

Also, if a package lists just pytz as a dependency, doesn't that equate to pytz>=0. But pip doesn't attempt to download all versions of pytz since the dawn of time. How come it blindly downloads all versions that fit pytz>=2012c when the latest version would easily suffice and pip would know about it instantly upon downloading that version and checking for the non-existent dependencies.

I'm not trying to be difficult. It's just that this new behavior causes many other issues. If you look at the build log I linked above, you'll see that the downloads from pypi get slower and slower, and eventually halt (perhaps throttled?). The build I linked to was cancelled by me after 4 hours, and the one before that was going on 15 hours (I thought the builder had crashed and cancelled that one) whereas the same build would complete with pip 21.0.1 in just 15 minutes on a slow arm32v7 device, including building py-cryptography with rust/cargo as seen here: https://ci.linuxserver.io/blue/organizations/jenkins/Docker-Pipeline-Builders%2Fdocker-sickchill/detail/master/427/pipeline/124

I get that your group's official stance is this is the new way, deal with it. But honestly, how are we supposed to deal with this issue when we can no longer install the packages? We (at linuxserver.io) maintain hundreds of docker images, many of which rely on pip to install packages.

Thanks

@pradyunsg
Copy link
Member

pradyunsg commented May 2, 2021

Gah. Apologies. I should've spent more time on my comment, to be a bit more elaborate:

  • The main reason I closed this is not because this is "not a real issue" but because this is titled in a way to become a dumpyard for "me too, help me" comments. We already have one of those open right now: New resolver takes a very long time to complete #9187. Those comments are not useful for working on something that's as nuanced as the resolver's behavior, and having multiple places for people to go write them is not useful.
  • Dependency resolution is an NP-complete problem. We're well aware that the resolver's not behaving perfectly, but like, it literally can't be "good at every case" unless someone makes significant progress on one of the Millennium Prize Problems. (update: this might read like a cop-out that the resolver is already great, but the problem is hard -- no no, the resolver is solving a hard problem and is also not going a great job in pedantic cases of that hard problem)

I get that your group's official stance is this is the new way, deal with it.

That's not the "official stance". Quoting myself from the specific comment I linked to already:

we do want to improve the backtracking logic to be more efficient and also want to make broader improvements toward making dependency information available without making complete downloads. All of those however, have separate tracking issues for them that are linked in the above discussion as well.

The reality is that there are costs to having multiple issues that effectively serve as a blanket for all kinds of weird things that the dependency resolver might do (that would result in aggressive backtracking). They're usually not used for anything except folks to say "me too!" on, especially if we've broken out the discussion to other places. It adds to maintainance overhead for this issue tracker and is not really useful from my PoV.

Your report is excellent, and significantly clearer than many of the ones we've received in the past, and it does seem to have very clear instructions on how to reproduce it. I've added it to my pile of existing excellent reports for where the resolver is just being stupid, and will likely test against it when we make improvements to validate that they actually improve things.

There is literally no reason I can think of for pip to download all those versions.

It's trying to be exhaustive and, for some reason, the specific package structure that you have is making it backtrack on a bad choice for the requirement to backtrack on.

Honestly, there's significantly better things pip's resolver can do, with the easiest examples being "CDCL" or "Tree-Pruning". While the resolver is operating on incomplete information and all that, it is also not remembering some of the useful bits of information that it could infer. The reality of it is that, well, it could be smarter and isn't. pip's maintainers know that.

OTOH, I'm gonna take 5 minutes now and write about the "costs" of reopening such blanket-scope issues: Instead of sitting through and making progress toward actually fixing these short-falls, I've spent well over an hour drafting this comment, to make sure that I'm not contradicting something we've said already, because you used the phrase "official stance" and now I feel the need to be careful around what I say. And, this whole thing has already demotivated me enough to not be working on pip's resolver stuff this weekend, and I'll likely go do something else with my free time this weekend.

@pfmoore
Copy link
Member

pfmoore commented May 2, 2021

As a related point, I have a side project where I'm working on trying to write a program that, when given a a set of requirements, generates a report of the dependency graph in a form that makes it easier to diagnose these issues. Of course, that involves reproducing a big chunk of pip's logic (luckily there are libraries for a reasonable proportion of this) and it's not guaranteed that it will be that much quicker than just running pip itself. (And maybe even slower, as to get the full graph I can't even prune the tree).

Another diagnostic tool I want to try to write is something that runs resolvelib (the core of pip's resolver) on a dependency tree, to do quicker "offline analysis" of problems like this.

But both of those projects take time to develop, and I only have limited free time. So progress is slow.

In your case, you've provided that info, thank you for that. For many reports, we don't have that level of detail. One point to note, though - you have sections in that report saying things like

├── twilio>=6.55.0                      Twilio API client and TwiML generator
│   ├── PyJWT==1.7.1                    JSON Web Token implementation in Python
│   ├── pytz                            World timezone definitions, modern and historical
│   ├── requests>=2.0.0                 Python HTTP for Humans.
│   │   ├── certifi>=2017.4.17          Python package for providing Mozilla's CA Bundle.
│   │   ├── chardet<5,>=3.0.2           Universal encoding detector for Python 2 and 3
│   │   ├── idna<3,>=2.5                Internationalized Domain Names in Applications (IDNA)
│   │   └── urllib3<1.27,>=1.21.1       HTTP library with thread-safe connection pooling, file post, and more.
│   └── six                             Python 2 and 3 compatibility utilities

This is over-simplified, in that twilio 6.56.0 could have different dependencies than 6.55.0, and pip has to check that level of detail as well. Doing so is usually redundant (it is for twilio, I believe) but not always, and it's one reason that even a detailed analysis can miss a problem (suppose twilio 6.57.0 had a dependency bug causing a conflict - we could try a lot of options before concluding that 6.57.0 was a lost cause and backtracking to 6.56.0).

@pfmoore
Copy link
Member

pfmoore commented May 2, 2021

FWIW, I did some more digging on this, and the problem is that twilio requires pyjwt==1.7.1 but PyGithub requires pyjwt>=2.0. So the requirements are incompatible. It's visible in the data you supplied, but hard to spot unless you check everything carefully (I didn't, I got resolvelib to do it for me).

And yes, I don't know why specifically pip is downloading many copies of pytz. Desperation, probably 😉

The problem here is that if we have incompatible requirements, then pip will, given enough time, download every version of every package involved in the installation - simply to check that there isn't some combination that has different requirements which are resolvable. So even though you and I know that there's no chance that a different version of pytz will fix the problem, pip can't know that.

This is typical of most "pip doesn't finish" problems - there's a conflict that pip can't resolve, so how long do we keep trying before we give up? We can't give useful diagnostics if we don't try everything, so stopping quickly (which is something we're considering) will just mean we get more people complaining "pip didn't tell me what was wrong when it gave up".

@aptalca
Copy link

aptalca commented May 10, 2021

FWIW, I did some more digging on this, and the problem is that twilio requires pyjwt==1.7.1 but PyGithub requires pyjwt>=2.0. So the requirements are incompatible. It's visible in the data you supplied, but hard to spot unless you check everything carefully (I didn't, I got resolvelib to do it for me).

Ah, that makes a lot of sense. I'll look into that. Thanks

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 27, 2021
@pradyunsg pradyunsg removed the S: needs triage Issues/PRs that need to be triaged label Mar 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants