Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Logic Refactor #6607

Open
pradyunsg opened this issue Jun 14, 2019 · 22 comments
Open

Build Logic Refactor #6607

pradyunsg opened this issue Jun 14, 2019 · 22 comments
Labels
type: refactor Refactoring code

Comments

@pradyunsg
Copy link
Member

pradyunsg commented Jun 14, 2019

This is an issue for the broader design discussions for the overall build logic refactor work that I'll be doing.

Related: #5051
Tracking project board: https://github.com/pypa/pip/projects/3


@pypa/pip-committers I was thinking about this a bit and IMO we should try to keep the "broad" discussion here since it'd probably make life easier for me to be able to look back and reference things. I've also created a project board to better organize myself while working on this.

@pradyunsg
Copy link
Member Author

Just filed #6622 containing the next step - moving logic into SourceDistribution.

@pradyunsg pradyunsg pinned this issue Jun 17, 2019
@cjerdonek
Copy link
Member

What is the relationship of this issue to the resolver issue (#6536)?

@pradyunsg
Copy link
Member Author

This cleans up things so that adapting pip to resolvelib's abstractions becomes easier.

Plus, I'll be less scared about having broken the general build process while swapping out the resolver logic. 🙃

@pradyunsg pradyunsg unpinned this issue Jul 30, 2019
@chrahunt
Copy link
Member

chrahunt commented Sep 5, 2019

It's not really a design consideration, but the refactoring may be easier if the TODOs and FIXMEs in e.g. InstallationRequirement were made into individual issues so that could be addressed in turn.

@pradyunsg
Copy link
Member Author

pradyunsg commented Sep 10, 2019

@cjerdonek I notice that my previous response is super terse, so I'll elaborate a bit.

I think this project, is closely related to the actually implementing the resolver.

We have to perform the fetch-build-query during our resolution process. Ideally, the resolver should not see that this is happening (i.e. operate on a more "abstract" space, like the resolvelib/zazo abstractions). This, however, also means that we need to make it easier to actually modify the resolution and build logic, which have been entangled in the codebase since I started working on pip. 🙃

In essence, what this project is aimed at doing is make the build logic code easier to reason about and follow, which in turn would make it easier to separate it from our resolution logic and make life easier when we do go ahead with the new resolution logic -- #6536 is for figuring out the communication and overall approach for that.

Another thing I'll note is that, this was the main blocker/issue/tricky-thing-that-I-sunk-time-into when I was working on a weekend-prototype (https://github.com/pradyunsg/pip/tree/resolver/make-it-inside). The build logic essentially had to be moved around a fair bit, and I realized that it'll be a worthwhile investment to decouple the two, and refactoring/improving the build logic was a great way to do that. :)

@pradyunsg
Copy link
Member Author

@chrahunt and I had a couple of voice calls over the past few days, where we discussed this (pip's build logic). We discussed how the builds "flow" and what information we have available at each stage.

To that end, @chrahunt has a GitHub Gist containing both a mental model that we came up with to understand how the build logic works currently as well as a structured file, for potentially generating insights from. All this is a little roughly sketched out right now, and we're still figuring things out in greater detail.

Inputs are definitely welcome. :)

@pradyunsg
Copy link
Member Author

pradyunsg commented Sep 10, 2019

I had an itch to scratch w/ graphviz and this problem lends itself perfectly to being visualized with a directed graph. @dreampuf helpfully built GraphvizOnline, which I used to prototype a graph that shows the code flow that we'd want to follow.

There are definitely a lot more ways to visualize this but my goal here was to have something I can look at later and make sense of, which this is. :)

Link to an online preview of the graph I came up with, since the actual image is too big to reasonably embed here.

Notes:

  • double border => artifact that the user can provide as inputs.
  • use the dot engine (it's the best one for directed graphs).
  • use svg format, to be able to view the entire image.
  • it's only visualizing the flow of data, not the creation/state of metadata or available information.

@pradyunsg
Copy link
Member Author

@techalchemy and I had a call recently, and we discussed some items related to this and related to the resolver situation. The following are (some of?) the actionable notes I have from our conversation:

  • The most complicated part of this refactor will be directory handling within InstallRequirement. (we both, just, strongly agreed)
  • Useful cleaner abstractions:
    • ModernMetadataGenerator(), LegacyMetadataGenerator() for the logic behind "prepare metadata" operations
    • WheelInstaller(), LegacyEditableInstaller(), LegacyInstaller() for the install mechanisms
  • Should rename InstallationCandidate to something that does not conflict with the Candidate from resolvelib's model (current: IndexCandidateInfo)

Broader notes related to the resolver:

  • "candidate, as state machine that only mutates on transitions" model makes sense -- @chrahunt and I came up with in our prior discussion where we modeled the build flows.
  • Should build a better abstraction model for RequirementSet (basically, don’t manipulate in-place and create a new “CandidateSet” for tracking chosen candidates)
  • In resolvelib model, we'd let ‘Provider’ populate & prepare requirements for installation, installation order, etc.
    • Current resolver's interface is the basically the same as resolvelib, except it modifies the RequirementSet in-place. (see point above)

@pradyunsg
Copy link
Member Author

pradyunsg commented Oct 25, 2019

Writing a summary/progress report, for folks who haven't been following this super keenly, and (mostly) self-reference later.


The "generate metadata" logic has been almost completely moved into pip._internal.operations.generate_metadata. The relevant PRs are: #7239, #7115, #7111, #7081, #7063, #7051, #7041.

With these changes, the current "generate metadata" flow is:

  • SourceDistribution.prepare_distribution_metadata()
    • setup isolation (if needed)
    • InstallRequirement.prepare_metadata()
      • operations.generate_metadata.get_metadata_generator(self)
      • self.metadata_directory = operations.generate_metadata._generate_metadata[_legacy](self)
      • sanity checks and warning printing
    • InstallRequirement.assert_source_matches_version()
      • another sanity check

This is quite a few hoops jumped here, with back-and-forth between the generate_metadata module and

My aim for this stage is to get rid of InstallRequirement.prepare_metadata() entirely, moving the sanity check logic to SourceDistribution as appropriate. That would change the flow to something like:

  • SourceDistribution.prepare_distribution_metadata()
    • setup isolation if needed
    • req.metadata_directory = generate_metadata(req)
    • sanity checks and warning printing.

I was going to work on RequirementPreparer next. There was some discussion about this in #7049. However, @chrahunt has flagged that he has some ideas w.r.t. this bit of the codebase, and stated that he'll be working on some cleanup in that area. \o/

To avoid duplicated effort, I'll be skipping working on this part of the codebase and progressing to the next "stage", since we can always come back to this if needed.


I've filed #7259 for discussing the overall design of what the result of the refactor should look like. That'll make it easier to figure out where stuff should go, since after that discussion, we'd have (mostly) decided what should go to which part of the codebase.


Installation code hasn't been touched yet. I have some rough plans w.r.t. how we'd want to handle this. It'll likely be the same approach as metadata generation:

  • Cleanup existing code, to make it nicer to "move out"
  • Create an overall "do the install" function, that'll just call these methods
  • Move out the code for installation approaches, one-by-one, into the relevant module
  • Make the overall function, the only way to do this operation

Metadata directory handling also got some much needed attention (#7100, #7087, #7086, #7064). These changes should make how metadata directories are handled, more consistent across PEP 517/"legacy" build logic. Other than that, some less ambiguous names are a nice improvement and make me feel a little better around this bit of the codebase.

There's still more work to do here, but I think we can only chip away on that after installation code is also better decoupled.


Cleaning up install options + build options, is something that I want to work on eventually but I don't think I can come around to this at this time. I imagine this would take the form of converting CLI options + flags into 2 objects BuildOptions and InstallOptions or something, which can then return the relevant values for the corresponding requirement. This can also mix-and-match with FormatControl so that all these mechanisms could possibly have a consistent API, which may reduce the "concepts" in the system.

One of the things to consider here is that, how we handle this, very directly affects the Scientific Python stack, since they're very closely tied to our build approaches.

@chrahunt
Copy link
Member

I would also mention that some progress is being made on breaking up our existing logic for initial creation and preparation of requirements (#7245, #7180, #7176, #7175, #7068, #7066, #7046, #7025). This lays some of the groundwork we'll need for implementing an InstallRequirement replacement as discussed in #7259.

@pradyunsg
Copy link
Member Author

pradyunsg commented Nov 12, 2019

We've hit at least 1 milestone with this -- #7317 means that we're no longer blocked for #988 because of the build logic's "strong" coupling with the resolution process.

@sbidoul
Copy link
Member

sbidoul commented Jan 17, 2020

[Arriving from #7589 (comment)]

@pradyunsg @chrahunt The diagram from #6607 (comment) is very nice and extremely useful!

It would be nice to have it in a place where can do PRs on it, and maintain the diagram across refactorings (for instance when the unpack arrow from local wheel is removed, or if/when the copy arrow from local non-editable directory is removed).

For instance one improvement suggestion I would make is to remove "Local editable named VCS" and make the download arrow go directly from "Remote editable named VCS" to "Local editable directory", because for all installation and build purposes, pip does not care if a local editable directory came from a VCS origin.

@pradyunsg
Copy link
Member Author

Modified as suggested. :)

@pradyunsg
Copy link
Member Author

pradyunsg commented Feb 7, 2020

#7709 -- seems like we might've made a mistake in "how does pip install from sdist" logic (in the graph above) because pip seems to have variable build behavior, depending on how it got the sdist.

@chrahunt
Copy link
Member

chrahunt commented Feb 8, 2020

For future reference, that turned out not to be true. :)

@brainwane
Copy link
Contributor

I believe a bunch of the work this issue has been tracking is now done, but am leaving it open and moving it to the "far future" column in the Resolver Rollout project board, to leave room for future improvements.

@pradyunsg
Copy link
Member Author

Picking this back up, there's a bunch of new stuff that we'd want to do with our build logic now:

  • Switch to using venv as the environment backend for our build logic.
  • Explore if we can use import hooks for providing pip inside an isolated build, now that we're Python 3.7+.
  • Move away from using pep517.hooks directly to moving to build.

This issue is about refactoring though, so I think we should track those separately. But, I'm noting these here since we'd want to be mindful of how this would fit in with this broader refactor.


I started looking at this again yesterday and @pfmoore also expressed interest in helping out with this.

Looking at this with somewhat fresh eyes now, it seems to me that we're still untangling the build logic from the rest of the codebase. Obligatory shoutout to @sbidoul for being accomodating with my comments in the PEP 660 PRs!

The most obvious next-step here seems to be #7049. I think a good follow up to that would be to remove/clean up the "prepare" verb throughout the codebase -- we're using prepare_metadata for example, in places where we're only doing metadata generation (IIUC, prepare = fetch+unpack + generate_metadata).

After that, if I'm understanding correctly, we'd basically be ready to start moving to the immutable-objects-that-transition model. That thing still needs a bunch of thought, since that's where we'd be getting rid of the mutate-me-however-you-want InstallRequirement and replacing that with immutable dataclasses that implement the state machine that we'd want (Python 3.7+ 🎉).

@pradyunsg
Copy link
Member Author

Oh, and here's the new-and-updated graph for how our build-system works currently.

  • Added PEP 660 code path.
  • Added a dedicated "unpacked sdist" state, which more correctly reflects the future build-via-sdist approach (in grey to denote that it's does not exist today).

pip-future

Graphviz sources
digraph G {
  // All the nodes
  subgraph user_provided {
    node [shape=box, peripheries=2];

    subgraph remote {
      remote_sdist [label="Remote sdist"];
      remote_wheel [label="Remote wheel"];
      remote_unnamed_vcs [label="Remote unnamed VCS"];
      remote_named_vcs [label="Remote named VCS"];
      remote_archive [label="Remote archive"];
    }

    subgraph local_sources {
      local_wheel [label="Local wheel"];
      local_sdist [label="Local sdist"];
      local_directory [label="Local directory"];

      subgraph downloadables {
        local_named_vcs [label="Local named VCS"];
        local_unnamed_vcs [label="Local unnamed VCS"];
        local_archive [label="Local archive"];
      }
    }
  }

  subgraph intermediate {
    unpacked_sdist [label="Unpacked sdist"];
    unpacked_sources [label="Unpacked sources", color=gray, fontcolor=gray];
    project_legacy [label="setup.py-based project", color=brown, fontcolor=brown];
    project [label="pyproject.toml-based project", color=royalblue, fontcolor=royalblue];
    local_wheel_editable[label="Local editable wheel"];
  }

  // All the clusters
  subgraph cluster_remote {
    label = "Remote";
    style = dashed;

    subgraph cluster_remote_one {
      style = invis;
      label = "";

      remote_unnamed_vcs;
      remote_named_vcs;
      remote_archive;
    }
    subgraph cluster_remote_two {
      style = invis;
      label = "";

      remote_sdist;
      remote_wheel;
    }
  }

  subgraph cluster_installed {
    node [shape=box];
    label = "Installed";
    style = dashed;

    installed [label="Installed distribution"];
    installed_editable [label="Installed editable distribution"];
  }

  // All the edges
  subgraph connection_download {
    edge [label="download", color=green, fontcolor=green];

    remote_unnamed_vcs -> local_unnamed_vcs;
    remote_named_vcs -> local_named_vcs;
    remote_sdist -> local_sdist;
    remote_wheel -> local_wheel;
    remote_archive -> local_archive;
  }

  subgraph connection_unpack {
    edge [color=violet, fontcolor=violet];

    local_sdist -> unpacked_sdist [label = "unpack"];
    local_directory -> unpacked_sdist;
    local_archive -> unpacked_sdist;
    local_unnamed_vcs -> unpacked_sdist;
    local_named_vcs -> unpacked_sdist;
  }

  subgraph connection_future {
    edge [color=gray, fontcolor=gray, style=dashed];

    local_directory -> unpacked_sources;
    local_archive -> unpacked_sources;
    local_unnamed_vcs -> unpacked_sources;
    local_named_vcs -> unpacked_sources;

    unpacked_sources -> local_sdist [label="setup.py sdist"];
    unpacked_sources -> local_sdist [label="build_sdist"];
  }

  subgraph connection_install {
    edge [label="install", color=purple, fontcolor=purple];

    local_wheel -> installed;
    local_wheel_editable -> installed_editable;
  }

  subgraph connection_modern {
    edge [color=royalblue, fontcolor=royalblue];

    unpacked_sdist -> project;
    project -> local_wheel_editable [label="build_editable_wheel"];
    project -> local_wheel [label="build_wheel"];
  }

  subgraph connection_legacy {
    edge [color=brown, fontcolor=brown];

    unpacked_sdist -> project_legacy;
    project_legacy -> local_wheel [label="setup.py bdist_wheel"];
    project_legacy -> installed [label="setup.py install"];
    project_legacy -> installed_editable [label="setup.py develop"];
  }
}

@sbidoul
Copy link
Member

sbidoul commented Mar 28, 2022

@pradyunsg thanks for the update, this diagram remains very useful! A couple of question pop up in my mind:

  • is there any practical difference between Local directory, Unpacked sources and Unpacked sdist ?
  • is there any difference between Local archive and Local sdist ?

@uranusjr
Copy link
Member

I’ve been thinking about how to implement PEP 658 (metadata-only resolution); it could be useful if Remote metadata is a state in the build logic, I think.

@pfmoore
Copy link
Member

pfmoore commented Mar 28, 2022

One point I'd like to try to capture in the model is where we can reasonably apply the rule that all instances of package XXX version X.Y are equivalent. I think the only such cases are "local wheel", "remote wheel", and "installed distribution".

The motivation here is to know when it's OK to not build or reinstall, because we know we already have that version. So, for example, we don't need to install a local wheel if it's the same version as the installed distribution. But are we OK to skip building an unpacked sdist if we have a local wheel with the same version? I think if we do this we might want to split the decision making into 2 parts - standards (wheels having the same project name and version MUST be equivalent) and pip policy (pip won't rebuild a sdist if we have the wheel, but will rebuild a VCS checkout). The standards part should be straightforward (packaging standards have pretty much always assumed that name/version uniquely identifies an installable object, just never said so outright) but the pip policy side may be less so (reality gets in the way of theory 😉).

This may be too big a question to cram into this one diagram, but I think we have been working on unstated assumptions for too long, and it's something we should be explicit about.

@pradyunsg
Copy link
Member Author

is there any difference between Local archive and Local sdist ?

Whether they contain sdist-related metadata.

is there any practical difference between Local directory, Unpacked sources and Unpacked sdist ?

Between the first two, no (hence the lack of a label on the arrow, although I think I messed up with the colors). The unpacked sdist would contain sdist-related metadata for the package?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: refactor Refactoring code
Projects
None yet
Development

No branches or pull requests

7 participants