Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --json flag to print download information #5400

Closed
wants to merge 1 commit into from

Conversation

nataliejameson
Copy link

This adds a --json flag to pip download. This prints out some basic information about packages resolved by pip download (acknowledging that it may not be 100% correct because pip) such as name, version, url and dependencies.

Also added a --log-stderr to force info messages to go to stderr so that machine readable output is not clobbered by human readable text.

Added and updated unit tests to make sure it worked. Sample run through jq here: https://gist.github.com/philipjameson/88580962e0fed1ad9ec75f1733c0a625

Fixes #5398

@nataliejameson
Copy link
Author

json schema is of course entirely up for argument. I just went with what seemed reasonable.

@cooperlees
Copy link
Contributor

The joys of different dep lists between Python 2 and Python 3 🐍

@nataliejameson nataliejameson force-pushed the add_json_flag branch 2 times, most recently from fd30e28 to 9c6eb9e Compare May 15, 2018 22:25
@pradyunsg pradyunsg added the S: needs triage Issues/PRs that need to be triaged label May 16, 2018
@cooperlees
Copy link
Contributor

Ping. Would love to get feedback about this so I can assume or not if it will be accepted as I want to use it internally @ Facebook. So does @philipjameson :) 👏

@@ -222,13 +232,39 @@ def run(self, options, args):
resolver.resolve(requirement_set)

downloaded = ' '.join([
req.name for req in requirement_set.successfully_downloaded
req.name
for req in requirement_set.successfully_downloaded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change? I'm not too worried, but it's easier to review if you don't include unneeded cosmetic changes.

if options.json:
details = self._get_download_details(
resolver, requirement_set, options.download_dir)
print(json.dumps(details))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about this - we never use raw "print" statements anywhere else in pip to generate output (see pip list for a concrete example, which uses logger.info). This PR should use the same mechanisms as the rest of pip does.

action='store_true',
default=False,
help=("Output information about affected packages "
"in an easily parseable json format"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"affected"? Surely "downloaded", as this is only for pip download? Also, describing JSON as "easily parseable" is unnecessary...

@pfmoore
Copy link
Member

pfmoore commented May 18, 2018

I've added some review comments. I'm not keen on this PR as a whole - the point you make about "acknowledging that it may not be 100% correct because pip" concerns me, as it's not clear whether the average user would appreciate the limitations. At a minimum, there should probably be some documentation explaining where the data might be wrong and why, so that people know when they can rely on it.

I won't block this PR if one of the other pip committers feels it's worth merging, but I probably won't do anything further on it myself.

@cooperlees
Copy link
Contributor

Thanks for the comments @pfmoore, and I agree with most of them. I'll leave the cleanup to @philipjameson.

print was used here on purpose as we'd want the JSON output to come to stdout and not stderr to separate the output we expect and errors. This idea is based on how most UNIX utilities work. I feel this is the right thing to do here so that it's parsable to a calling program without potential warnings and the likes breaking the JSON formatting.

Documentation is a good point. I will work on getting some documentation up once this lands/is accepted. I will describe:

@pfmoore
Copy link
Member

pfmoore commented May 18, 2018

I'm perfectly OK with writing to stdout but internally we don't use print for that in pip (as per my reference to pip list, we typically use logger.info to write to stdout). I think we should be consistent for maintainability.

@nataliejameson
Copy link
Author

As far as the printing to stdout, the reason that I wasn't using logger.info (and indeed added a flag to force info level to go to stderr instead) is because there are a number of other places that we write human readable text at info level (e.g. "Downloaded X (from Y)") that you don't want to end up on the same stream as data you're trying to parse. I'll see if I can find a decent way to setup the logger module to handle that split.

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label May 19, 2018
@pradyunsg
Copy link
Member

I won't block this PR if one of the other pip committers feels it's worth merging, but I probably won't do anything further on it myself.

I am a little short on time and I have some concerns here:

  • exposure of resolver's internal data
  • there's logging-related changes
  • yet another method added to InstallRequirement

I understand that some of these are probably needed here but I don't really have the bandwidth to hash out the details here.

@cooperlees
Copy link
Contributor

cooperlees commented May 20, 2018

@pradyunsg - I know you say you do not have time, but it's hard to action your comments to make the PR more acceptable.

Can you or someone offer some suggestions. My questions/statements are:

  • What's the risk of internal resolver data? What do you think we should not share? All I really would like is name and version.
  • The logging changes make sense so that other data can not bleed into the JSON output.
  • Is there some refactoring we can follow up with to reduce the burden of another method to InstallRequirement ?

Also, happy to meet in IRC tomorrow to chat this out to see if we can meet everyones concerns! I'll keep it open from 9am pacific time.

@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label May 21, 2018
@nataliejameson
Copy link
Author

@dstufft Any chance you could take a look at this conceptually? Is there someone else that would be best to review this? (sorry, I'm not really familiar w/ pypa's structure). I'd rather not have to maintain a fork of pip if I can avoid it, but the current information that is returned is just not sufficient for the project I'm working on, nor is it for @cooperlees afaik.

I'm more than open to getting things fixed up, throwing up whatever red flags / warnings / documentation you all want, etc, but I'm not sure there's anything super actionable here so far, and I don't even have an inkling of whether it'd be likely to be accepted.

@pfmoore
Copy link
Member

pfmoore commented May 27, 2018

To clarify the organisational position, you need approval from at least one of the @pypa/pip-committers to get a PR merged. Both @pradyunsg and myself are pip committers and we've both expressed reservations. @dstufft is the only other commiter who's active at the moment (we're all volunteers, so other commitments affect everyone's availability). So while pinging @dstufft is the best option for getting another opinion, you're basically hoping he'll be happy enough with this to override the reservations of the other committers...

I'll do my best to summarise the concerns raised by @pradyunsg and myself, and try to make them more actionable for you. But that doesn't constitute a commitment to merge is these are addressed - I remain unclear as to the actual real-world situation this is intended to address and why this is the best option (or even why this needs to be in pip and can't be handled by a 3rd party program).

  1. (From me) The PR still seems to be "fighting" pip's standard logging - you've removed the use of raw print, but you're using a non-standard pip.__structured_output logger, and you require the user to pass --log-stderr to avoid corrupt output. To me, this indicates that the solution isn't properly designed, but simply patched on top of the existing code.
  2. (From me) "acknowledging that it may not be 100% correct because pip" - still not addressed. I don't like the idea of providing machine-parseable data that we know may be wrong. That seems like a recipe for bugs in people's scripts (that will inevitably be reported to us as "pip reported the wrong data").
  3. (From me) Documentation - needs to be added. You're saying you'll add documentation "once this is accepted", whereas I'm saying it shouldn't be accepted without documentation. Sorry, but if you want my approval, you'll need to do that work up front.
  4. (From @pradyunsg) exposure of resolver's internal data - The problem here is that when (not if!) we replace the existing resolver, the way we calculate dependencies may change significantly enough that we aren't even able to produce the data that this PR currently produces. We need at a minimum a transition plan for that situation. Dropping the feature just after it's been provided because we can no longer deliver it, isn't exactly helpful to our users. To put it another way, by providing this feature we're stating that the data it provides is part of pip's documented interface, and it will be subject to our deprecation process if it changes. It's not at all obvious from what's here, how that will limit our ability to change the resolver. Essentially, we've gone through a lot of grief recently enforcing the principle that pip's internals are not something that 3rd parties can rely on, as we can't support that usage. Giving access to the internals via the command line doesn't suddenly mean that we can.
  5. (From @pradyunsg) there's logging-related changes - I think this is the same as (1) above from me.
  6. (From @pradyunsg) yet another method added to InstallRequirement - again, I believe this relates to the planned internal changes to change to a full resolver. The InstallRequirement class is key to that, as it is to a lot of pip's internals, and it's already very complex (and as such a maintainability problem). The new property you add (and the supporting code for it) exposes more of InstallRequirement's internal state, making it harder to refactor the class. Part of this is the same point as (4), we don't want to expose (and hence freeze) more of the internal workings than is necessary, and part of it is the maintenance cost of tighter coupling and extra complexity in the internals.

So, for some concrete actions:

  1. Provide the real-world use case, and explain what other options you have considered, and why they are not acceptable. In particular, why can't you write this as a standalone script? We've done a lot of work to standardise the interfaces pip uses to packaging metadata, and if we're still in a position where users can't work with that data directly, and are blocked needing new features in pip, then the standards work hasn't achieved its goals - so we'd like to understand why.
  2. Add documentation to the PR.
  3. Clarify what you mean by the output possibly not being 100% correct, and clarify how you expect to make this workable for users (who will assume it is reliable) and the pip maintainers (who will have to deal with any confused users).
  4. Change the way the JSON is output to fit better with the existing UI. This may mean writing the JSON to a file rather than to stdout, or modifying (remembering deprecation policies) the way the standard download command uses stdout, or having the --json option force all user output onto stderr (that will likely break people who expect success to equate to "nothing on stderr"). The design is up to you, but needs to take into account the comments above.
  5. Write up (as a comment here) how you expect this PR to interact with the planned change in pip's resolver. When we make the change, how will we maintain the output of the --json flag? And yes, I know the new resolver isn't fully specified yet. That's sort of the key point we're making here about not exposing internals :-)

I appreciate that's a lot of work, and I'm perfectly OK if you feel that it's more than you're willing to do. There's probably a certain level of compromise that might be reached - if the use case is compelling enough and the risks of this blocking the new resolver work are low enough, we could consider getting something in now and refining it later, for example. But the starting point has to be an attempt to address at least some of the above concerns.

@pradyunsg
Copy link
Member

Thanks @pfmoore for elaborating my comment correctly. :)

Additionally, about my first and third points:

  • Instead of exposing the entire internal dictionary from Resolver, just pass in the name to the method and get the candidates within the Resolver itself. (it translates into a method on one of the classes under the new resolver's currently planned design); that'll resolve my first concern.

  • everything that @pfmoore said about InstallRequirement + populate_link setting an attribute other than link adding more "action-at-a-distance" to that class -- this is a maintenance and design concern.

@nataliejameson
Copy link
Author

Thanks both of you for responding. I'll dig through that reply in a bit, but for clarification on the use case I'm working with: I'm looking at representing what pip resolves in a different polyglot build system that we use FB (http://buckbuild.com/). I'd like it so that a person can say "Oh, I want to generate build files for the equivalent of "pip install foo" and we'll get build files generated for them. In order for it to work, I need to know where to pull the source files from (so we can get hashes of the downloads for reproduceable hermetic builds), and the dependencies for each library so that you can just use foo and you'll get its depenencies 'bar'. The versioning would be nice because we allow in some cases for users to pick different major revs, and need to make sure the transitive closures work as the user expects.

If there's already a clean interface to do this, I can wrap this up some other way, but I guess it wasn't super clear that any of this resolution information was exposed anywhere.

@pfmoore
Copy link
Member

pfmoore commented May 27, 2018

At the risk of sounding glib, it seems to me that you're trying to use write something that wants to resolve dependencies the same way as pip does, but rather than writing it yourself you want to reuse pip's logic. While I can see the reason you'd want to do that, it very definitely is something we don't want to support in pip - we don't expose a Python-accessible API for pip precisely for this reason. Some possible alternatives for you:

  1. The "new resolver" (https://github.com/pradyunsg/zazo), when it becomes available, will have a public API. Maybe you could wait until that's available and use that?
  2. If all you want is to "do what pip does", why not use pip wheel foo -w wheelhouse to create a wheelhouse of all the files needed to install foo and its dependencies, and then pip install wheelhouse/*.whl to install them? Your build system would presumably replace the pip install with its own unpack-and-move code to install the wheels.

@nataliejameson
Copy link
Author

Yes, I absolutely want to reuse pips work because it's a bit mercurial, and I'd have to make sure I mirrored every change to an internal resolver in my code. What's the (rough) timeline on this new resolver?

As far as just running the pip commands you mentioned, I don't see any way to determine where the source came from other than trying to parse out 'https' and hoping that's correct. I suppose I could do that and just fail if pip ever decides to change its output. This is necessary because just running 'pip wheel' is /suuuupppper/ non deterministic since it's reaching out to the internet. With the commands you gave, I don't think there's any way to actually replace pip install without rewriting parsing logic to find out actual dependencies.

End of the day, I can do a bunch of hacks if I have to. This seems like a cumbersome way to have to use pip, but sure it can be done.

@pfmoore
Copy link
Member

pfmoore commented May 27, 2018

I'd have to make sure I mirrored every change to an internal resolver in my code.

Yep, and that's why we don't want to expose the internal workings of the current resolver, because then we wouldn't be able to change them.

What's the (rough) timeline on this new resolver?

No idea really. We'd like to get it in in one of the next couple of releases, but volunteer resources make that at best a wild guess.

One thing that would help me better understand what you're trying to achieve here, is if you could demonstrate (in pretty-printed form!) the output you expected to see for a project reasonably complex dependencies - something like Jupyter, maybe. I could probably reverse engineer that myself, or apply your PR and look, but I'm not sure when I'll have time to do that. Maybe if I was clearer on what you want to see, it would alleviate some of my concerns.

@nataliejameson
Copy link
Author

I'm a little confused on the resolver comment. If the exposed data to users is limited, I'm not quite sure where the difficulty in changing the backend would be unless the thought is that at some point the resolver wouldn't resolve dependency information or where it got it from? I think I'm missing a piece here.

https://gist.github.com/philipjameson/1900c6b4c5b89c5b729ebc648c0bfa51 is an example for jupyter (it'd work the same for --no-binary, but jupyter was being a bit persnickety with one of its deps not working with setuptools properly o_O)

@pfmoore
Copy link
Member

pfmoore commented May 27, 2018

OK, so:

  1. download_path doesn't tell you anything you can't work out yourself, surely (from the project name and the fact that you know where you told it to download to).
  2. URL isn't resolver information, you can get that just by querying PyPI.
  3. Version you can get from the downloaded filename
  4. Dependencies come from the metadata in the file (for sdists you may need to run setup.py egg_info).

You may have to duplicate some of the work pip did (querying PyPI, extracting metadata) but I don't see anything there that you can't get by introspecting the directory you downloaded the files to (assuming that directory was empty beforehand).

Which doesn't mean that there's no possibility that it could be added to pip, just that the bar is higher if it's stuff you can get yourself already.

And I'm still not clear what you'll be doing with the downloaded files once you get that information. You're using the JSON to feed into a system that seems to re-download everything, so are you just throwing away those downloads? If so, then making this an option of the doanload command seems pretty inappropriate...

@nataliejameson
Copy link
Author

nataliejameson commented May 27, 2018

download_path doesn't tell you anything you can't work out yourself, surely (from the project name and the fact that you know where you told it to download to).

Nope, but it just seemed kind of handy if we're downloading anyways. Not super necessary, but it also reduces the need for globbing and making sure that I'm rewriting filenames correctly (e.g. jupyter-core gets rewritten to jupyter_core-.). I think @cooperlees also mentioned that he had run into some cases where the filenames weren't as expected.

URL isn't resolver information, you can get that just by querying PyPI.

No, but it reduces extra lookups and indirection. I hadn't seen the json API, so maybe this isn't quite as necessary.

Version you can get from the downloaded filename

Is this guaranteed to always be there for sdists? I could have sworn I had run into a case where it wasn't right, but I may be misremembering or running into the naming issue above.

Dependencies come from the metadata in the file (for sdists you may need to run setup.py egg_info).

This is kind of the big one for me. Other than the fact that it seems like there are several different ways to find the dependencies (looks like setup.py, setup.cfg as well as requrements files? I'll have to go dig in the pip code some more), it also means that I have to go write a dependency resolver when a perfectly good one already exists (and a better one will exist at some non-determined state in the future :) ).

As far as downloading the files, part of it was to verify hashes and potentially do some introspection on them (e.g. finding binaries and exposing them as binary build rules within the build system), but it was mostly to force the dependency resolver into action. Honestly a pip resolve could also be kind of nice (though, would it not also require downloading things due to how people tend to specify dependencies?) though it would probably suffer from similar problems

@pradyunsg
Copy link
Member

it seems like there are several different ways to find the dependencies

Just setup.py egg_info and directly from wheel metadata.

[version on sdist]

If I'm remembering correctly, pip doesn't really try to guess the version of an sdist from the file name - it runs setup.py egg_info to get it. There's an open issue for this somewhere in this tracker that suggests it should.

@pradyunsg
Copy link
Member

Paul makes a good point - pypi's json api should be a good way to get almost all this information.

The only things that'll be missing is the dependency information for sdists, which can be computed by executing setup.py egg_info.

@nataliejameson
Copy link
Author

While some of this stuff, yes, can be determined from the json api, the important part, the dependency resolver, is not available anywhere afaict. Parsing out files and writing this myself, rather than just having the thing that ran the resolver tell me what the right answer is seems like extra work that shouldn't really have multiple implementations, unless there's some other file that's not in the .egg-info directory that I'm missing.

So right now the only way to do this, so far as I can tell, is (and I really hope I'm missing something):

  • Run pip download/wheel
  • Extract package names and versions out of the filenames
  • Pull metadata for urls from the pypi json api
  • Setup a virtual env
  • Run pip install on the downloaded files
  • Run pip show on each of the packages to pull the dependency information
    -- These last two can be replaced by parsing out the the egg_info/requires.txt ini file, though even then, I'd have to know about all of the meta information that can be in that ini file like python_version etc

Is that roughly the suggested pattern?

@pradyunsg
Copy link
Member

Cross Referencing: #53.

@pfmoore
Copy link
Member

pfmoore commented May 28, 2018

So right now the only way to do this

Stop after "Pull metadata for urls from the pypi json api", and use the content of the directory, which will be one wheel (or sdist, if you use pip download rather than pip wheel) per project, to get the list of all projects that will be installed.

You can then build the dependency tree by using the dependency data from the projects to decide what projects depend on each other, and the versions you have extracted to determine which precise versions pip decided to use.

Note: I believe there have also been requests for a pip install --dry-run option. (although I can't find an open issue for it) That sounds like a better fit for this requirement, were we to go ahead with this.

@pradyunsg
Copy link
Member

Note: I believe there have also been requests for a pip install --dry-run option. (although I can't find an open issue for it) That sounds like a better fit for this requirement, were we to go ahead with this.

I think that's the issue I linked to above -- it doesn't directly mention --dry-run though.

@pfmoore
Copy link
Member

pfmoore commented May 28, 2018

@pradyunsg Thanks, I'd seen the discussions on that issue had moved on to pip search, hadn't noticed it had come pack to pip install. I've seen people mention an interest in something like this elsewhere, too.

@pradyunsg
Copy link
Member

Yep -- I could find #1631 (comment) and #924 (comment).

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will eligible for code review and hopefully merging!

@pradyunsg
Copy link
Member

Closing this since it's bitrotten. Please file a new PR if there's still interest in this.

@pradyunsg pradyunsg closed this May 23, 2019
@cooperlees
Copy link
Contributor

cooperlees commented May 23, 2019

We had interest, but, have gone other routes as @pfmoore basically squashed this over a year ago.

If we could agree on a machine parsable output from download and other commands, we could look at this again for sure. I'll follow #6099 as that's interesting. cc: @philipjameson

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 22, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 22, 2019
@pradyunsg pradyunsg removed the S: needs triage Issues/PRs that need to be triaged label Feb 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation needs rebase or merge PR has conflicts with current master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add parsable output option to pip download (JSON?)
6 participants