Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VariantEvalPort #5043

Merged
merged 1 commit into from
Nov 21, 2018
Merged

VariantEvalPort #5043

merged 1 commit into from
Nov 21, 2018

Conversation

bbimber
Copy link
Contributor

@bbimber bbimber commented Jul 20, 2018

This is not ready for review, but because this is almost certainly going to take some iteration, I wanted to make sure we're on the same page with what needs to occur. I tried to touch the GATK3 code as little as possible. I changed what was needed, and ported the GATK3 tests. I have these passing locally using the GATK3 test data. I'd like to preserve that as we work through this review (even though it will not get checked in), as a way to measure GATK3/4 parity. There are probably going to be some places where GATK3 behavior could get questioned too.

Because I tried to leave old code alone, that means there are plenty of existing issues, ranging from style/code-convention (i.e. arguments not 'final') to pieces of code that existed in GATK3, but might get questioned if put under the microscope. with that in mind, @droazen , @cmnbroad : do you have any general thoughts on the bar for porting this to GATK4? Can I leave the GATK3 code alone, or is every piece of code getting the level of review we just did for the core changes for this?

@magicDGS
Copy link
Contributor

This have all the commits from #4495 - could you rebase into the last master to keep the PR clean? I would like to review because I found some interesting changes here...

@bbimber
Copy link
Contributor Author

bbimber commented Jul 22, 2018

Yes, i will, but I'm hoping the GATK folks could at least talk through a few big picture points first. As I say above, I dont expect this is code-review ready.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 23, 2018

@bbimber Good questions all. Whenever we do these ports we have to strike a balance between minimizing the changes and updating to GATK4 standards. Generally, though, we want the code to conform to GATK4. This PR is large and likely to require some iteration so I'd be ok with starting with just the minimal "porting" changes to keep things simple, and then doing a code hygiene pass at the end. The "porting" changes should include things like updated javadoc, GATK4-style command line arguments, updating of outdated GATK3 terminology such as "ROD", Utils.nonNull assertions, etc. The finals and curly braces can wait for a separate pass (we will want to do those in this PR though). If you're not sure what to include or not just ask.

I like the idea of keeping the GATK3 tests working as we go along. We should make a clear distinction between the old and new tests though. Ideally the GATK3 tests would be in a separate commit that we can just delete at the end, but that can get unwieldy if the files in the commit need to change as we go along. Alternatively you could isolate them into a separate directory. They should either be disabled or made dependent on a test method (see the @Test annotation properties enabled and dependsOn) that is easily toggled so they can be run locally, but don't run on the CI server. Otherwise the CI server build will always fail.

In general, its really helpful to have the first commit in the PR contain the completely unmodified GATK3 source files. It makes it much easier for the reviewer to see what changed for the port.

I noticed that you have 2 new plugins included in this. I'm not sure if that was suggested by someone on the GATK team (I'm wondering if we want to go down that path...) but I can tell you that the existing plugins required an enormous amount of test development and review iteration. If we do decide to make them plugins, I think it would be a good idea to do so in a separate PR. Also, if we choose to make an AbstractPlugin base class, we may want that to live in the Barclay repo.

As @magicDGS points out, master already has your previous commits, so you should start by rebasing on that.

Ideally, the branch would have the following commits before we start the first review cycle:

  1. A single commit containing the unmodified GATK3 source (unmodified with the exception that if a file is renamed for GATK4, its helpful to rename the GATK3 version in this commit so it's easy to compare in the next commit). This commit doesn't have to compile or run - its just to make the review process easier for us, and will be deleted at some point. I can help with how to get this into your branch if you like.
  2. Your modified GATK3 tests in a single commit. This will also be removed before merge.
  3. A single commit with all of your "minimal" changes for the port, including the real, new tests. This should compile, and tests should pass on CI with pretty high code coverage. This is what we'll iterate on.

After that, its really helpful to have only a single new commit for each review iteration (you can create as many commits as you want as you work, but squash the new commits down before submitting). Just don't squash or rebase anything thats already been pushed up to the repo.

Also, note that most of the GATK engine team is out for a few weeks, so progress may be slow in the short term.

@bbimber
Copy link
Contributor Author

bbimber commented Jul 23, 2018

@cmnbroad, that's not wholly unreasonable, but i'd like to push back on a number of these points.

  1. First - would GATK consider simply letting us take over VariantEval and maintain as a GATK4-based tool in another repo? My understanding from GATK4 issues is that plan was to never migrate VariantEval (i think in favor of other picard/gatk QC tools). There is a bit of a conflict between keeping a lean core engine and having all these tools built off it. I would think there's an argument for keeping your core engine and the many tools built off it separated (GATK3 seemed to include some dead tools, for example). I appreciate we're the ones pushing this migration, but I hope on the other side you can appreciate the bar is pretty significant on our time.

  2. What new plugins are you talking about? VariantStratification and VariantEvaluator are part of GATK3's VariantEval? Yeah, I wrote a base PluginDescriptor class patterned on how ReadFilters work. It probably should exist in a more core position in code. While there's some good ideas in the argument-parsing/plugin code of GATK/Barlcay, frankly seems like much of it isnt fully developed yet, which is why I kept this separated at the moment.

  3. Be aware, the GATK3 tests depend on ~30GB of files. I dont know the limits of git lfs, but I did not currently have plans to check those in. I assumed I would convert these to use GATK4 chr20/21 data for a final commit, but felt there was a lot of value in using unaltered GATK3 data to confirm parity (and it was during the migration).

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 23, 2018

On the first question, we definitely appreciate how much work this will take. Often, porting the code is the easy part; developing new tests and test data can be a huge effort. I can try to find out if it would be possible for you to take the tool over - I know this kind of thing has come up before for other tools, but I'd have to ask around to find that out. @vdauwera do you have input on this ?

As for the plugins, currently in your branch VariantStratification and VariantEvaluator are modeled as Barclay command line plugin descriptors, and I was questioning whether thats necessary. Being a plugin is not necessarily required - ReadFilter and Annotation are both plugins, but they didn't have to be, and it takes quite a bit of work (again, mostly test development) to get a plugin right. Also, I'd consider the Barclay plugin framework to be pretty developed at this point, so I'd be curious to learn more about what issues you see.

And yes, definitely don't check any of the large GATK3 test files into the repo, even temporarily. Take a look at General guidelines for GATK4 developers if you haven't already. As you pointed out, new GATK4 tests that use smaller files would have to be developed. We'd want those to be included, and passing tests on the CI server, before we started reviewing the branch, so we know we're reviewing code that works and is covered by tests as much as possible. The second commit in my list above would have only your GATK3 java test files, etc (but not the big files, which you appear to have locally). The third commit would have your ported tool code, as well as the new test code, with the new tests enabled, as well as the smaller input files and expected results files. At the end we'd remove commit #2.

@bbimber
Copy link
Contributor Author

bbimber commented Jul 23, 2018

Let's hear what others say, but I think I would strongly prefer to simply take over VariantEval in another repo if this was something you'd consider. I'd likely do much of what you propose anyway (certainly WRT testing); however, perhaps not the microscope we went through with the core GATK changes earlier.

On plugins: I like what seems to be shaping up w/ Barclay. I carried over the Stratifier and Evaluator as plugins because it seems like it would make sense to allow tools to provide extensions (VariantEval, our tool, does). If I took this PR a step further, I would have migrated many arguments currently top-level on VariantEval into the plugins themselves (a good feature in Barclay). As an aside: I dont think VariantAnnotator is migrated yet, but we have many GATK3 plugins related to annotation, and hope that tool retains Annotator plugins when it get migrated.

My impressions of barclay are probably a little out of date. I agree the main argument parsing framework is pretty robust. Specifically on plugins, it seems a little less so, or at least there are not many tools I visibly see exercising that part of the code. For example, there really should be a default implmentation or base class between Barclay's plugins and ReadFilter plugins. I'm guessing if more tools in GATK4 were using plugins this would have happened. I created something like this for VariantEval, and without a ton of work that could probably get generalized; however, doing so would throw a lot higher bar on me and as noted above I'm trying to take on less, not more at the moment. If we do take over VariantEval, I'm certainly happy to try to contribute code and experiences to improve the core, through more targeted PRs.

Copy link
Contributor

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was interested in this PR due to the abstract plugin implementation, because it will make easier my life implementing plugins.

There are some parts that I don't agree with, and this is also in line with your discussion about the time which takes to develop a plugin - it requires lots of time and review rounds, because the interface is a bit difficult to understand.

If you are interested, I can draft an abstract plugin (either here or baclay).


protected List<String> pluginPackageNames;
protected Class<T> pluginBaseClass;
protected Class<T> defaultPluginBaseClass;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really generic for an abstract plugin descriptor: it assumes that the defaults are always extending a concrete classs - this does not happen with, for example, ReadFilter (where defaults are tool-specific)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I actually made this change locally as well. Again, this class as-is isnt really for prime-time as a product-wide Abstact class. I think we'd want to start by pulling the relevant parts out of ReadFilterPluginDescriptor. My implementation also doesnt deal with the bits related to plugins supplying command-line arguments.

List<T> ret = new ArrayList<>();
if (defaultPluginBaseClass != null){
for (T c : allDiscoveredPlugins.values()){
if (defaultPluginBaseClass.isAssignableFrom(c.getClass()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true for the case of the plugins implemented in this PR, but not as a generic abstract plugin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's why this class lives in the VariantEval package, and wasnt intended elsewhere in the current form.

Nonetheless, the pattern of having some primary class of Plugins (say VariantAnnotation), and then marker classes to denote things like "Required" or "Default" exists in other places. I agree w/ your comment above about not requiring Class (since the marker class doesnt need to inherit from the primary plugin class); however, in what situation does this not fit?

@Override
public List<T> getResolvedInstances() {
List<T> ret = new ArrayList<>();
allDiscoveredPlugins.forEach((x,y) -> ret.add(y));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true - the getResolvedInstances() should return only the ones that are included by the user and the defaults. With this abstract plugin all the instances of a concrete class will be applied.


@Override
public Class<?> getClassForPluginHelp(String pluginName) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an abstract class, this could be implemented using allDiscoveredPlugins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes


@Override
public Set<String> getAllowedValuesForDescriptorHelp(String longArgName) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be implemented by sub-classes, otherwise the automatic doc generation won't work; shouldn't be the default behavior for the abstract plugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also yes

}

@Override
public void validateAndResolvePlugins() throws CommandLineException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be empty by default, otherwise no plugin-resolution might be performed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@bbimber
Copy link
Contributor Author

bbimber commented Jul 24, 2018

@magicDGS I agree that a fully implemented AbstractPluginDescritor class is a good idea. Why not do this in a separate PR, and would this be most appropriate in Barclay? I would be happy to take a stab or let you. We should look at the usage pattern of ReadFilters, VariantEval's two plugins, VariantAnnotation and perhaps others I'm not thinking about.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 26, 2018

@magicDGS If you think an AbstractPluginDescriptor would be useful, I'd suggest initially creating a separate PR in GATK, since that would make it easy to see to how it simplifies the existing descriptors. Then once that converges, we can move the abstract part to Barclay.

@cmnbroad
Copy link
Collaborator

ping @vdauwera. Do you know what we need to do to respond to @bbimber's request to take over VariantEval in a separate repo ?

@bbimber
Copy link
Contributor Author

bbimber commented Aug 2, 2018

@cmnbroad @vdauwera: have you had a chance to look into whether you're willing to let us take over VariantEval?

@cmnbroad
Copy link
Collaborator

cmnbroad commented Aug 9, 2018

@vdauwera ping - Can you help us figure out how to proceed with making a determination on this, or who else we need to consult ? Thx.

@cmnbroad
Copy link
Collaborator

@bbimber I haven't heard anything from @vdauwera, but based on discussions with @droazen now that he is back, I think we're inclined to prefer to keep this as part of GATK. We can certainly work with you on making this less onerous than the last PR, which involved engine level changes. We will require tests that use data contained in the repo, though.

@bbimber
Copy link
Contributor Author

bbimber commented Aug 16, 2018

@cmnbroad : thanks for the reply. yes, obviously tests would need to be added/updated. there's no question it needs robust testing.

my main concern is that VariantEval is a fairly sprawling tool with all sorts of add-ons. The majority of the untouched code taken verbatim from GATK3 isnt going to pass muster based on the bar of our last PR without a lot of petty revision (and maybe some useful updates). There are certainly some code improvements one could make across VariantEval, but I'm just not that keen on combing through the whole thing if it can be avoided.

how about this: while tests need to be updated (as discussed above, they work on the GATK3 data, which isnt checked in), the code in this PR is functional. Would you be willing to review a couple classes, maybe VariantEval itself and a few ancillary classes to see what scope of work we're talking about?

@cmnbroad
Copy link
Collaborator

@bbimber I did start to take a look at this today. We'd probably take many of the peripheral classes from GATK3 as they are (though there will be some exceptions, i.e., VariantEvalUtils calls System.exit, which we wouldn't want to do). And in some cases we may want to make tickets for places where we want to do refactoring, like we did for MendelianViolations. The code will need to conform to GATK4 in some areas (use hyphen-separated argument names, updated javadoc where it's GATK3-specific, etc).

The easiest way to do this would be to add a commit with the raw GATK3 code as I mentioned above. We would then focus on reviewing the diffs, instead of line-by-line for all of the code. I can add that commit if you like - we'd just need to coordinate since I would have to push to your branch. Also, FYI, I'll be away all of next week so I won't be able to do any more on this until I return.

@bbimber
Copy link
Contributor Author

bbimber commented Aug 17, 2018

@cmnbroad OK. I have two grants in progress so time is a little tight. I can probably carve out some time though. If you dont mind doing the raw GATK3 commit I'll work off that. If we can take this plus my baseline here, then I will go through everything to clean up what I can pick out as obviously needing GATK4 updates. Thanks for working with us on this.

@cmnbroad
Copy link
Collaborator

@bbimber I have an updated version of this branch that contains the raw GATK3 sources as the first commit, but I don't want to clobber anything you may be in the process of pushing up, so I'll wait for you to tell me when to push it. One thing to note is that currently this branch doesn't compile (even without my commit). I just want to make sure github has all of your changes, since I know you had the old tests working with it. With my commit, the history looks like this:

image

@bbimber
Copy link
Contributor Author

bbimber commented Aug 27, 2018

thank you. i have no changes beyond what's there now. go for it.

@cmnbroad
Copy link
Collaborator

Done. I wasn't sure if I'd be able to push to your fork, but it seems to have worked. Also, I added one new unit test file that wasn't previously in your branch (SnpEffUtilsUnitTest.java), since it already exists. It should be a quick port.

@bbimber
Copy link
Contributor Author

bbimber commented Aug 27, 2018

thanks. if there a good intellij trick to inspect the diff between commits on a branch? I assume that's why you want the base GATK3 version in there?

@cmnbroad
Copy link
Collaborator

Yes, if you have Version Control integration enabled, you can select the branch, and then any commit, or group of commits, in order to see the diffs. So once its read, I'll be able to look at just the diffs between the initial GATK3 commit and your commit(s) right from within IntelliJ.

@cmnbroad
Copy link
Collaborator

@bbimber We just added full versions of the B37 and HG38 references to the repo a couple of days ago. You'll have to rebase this branch on current master to access them, but it might make it easier to port some of the GATK3 tests that use b37.

@bbimber
Copy link
Contributor Author

bbimber commented Oct 17, 2018

good timing. as you might have seen, i found some time to work on this. I now have everything 100% backwards compatible w/ GATK3's behavior. I started to migrate tests. In many cases, it was actually using 20/21 anyway, so it's not a big deal. For any GATK3 tests files that were tiny, I just checked them in, and the tests should work as-is. However, there's 5ish "large" VCFs (meaning more than ~250kb), and a couple into the MBs. I could either:

  1. Rework the tests to use another larger VCF already checked into GATK4 (this is what I assumed I'd do)
  2. Check in the GATK3 VCFs to the large file repo
  3. Perhaps subset the GATK3 VCFs to chr20/21 if not already, and check in?

Do you have other opinions on how to approach that? You'll see I checked in a handful of small files into the resources dir. Is that going to be OK for merging?

@cmnbroad
Copy link
Collaborator

The 250k-ish ones are probably ok in large. For any of them, especially the MB+ ones, anything you can reasonably subset (or use a different vcf) without sacrificing the utility of the tests would be good though.

@bbimber
Copy link
Contributor Author

bbimber commented Oct 18, 2018

Are there substantially stricter new compile requirements on this project? I may have missed it earlier when I first merged this code, but there's a lot of raw type / generic type errors I'm needing to deal with, taking straight from GATK3 code.

@cmnbroad
Copy link
Collaborator

GATK4 treats warnings as errors, if thats what you mean. I don't think GATK3 did. They're usually pretty easy to remedy though.

@bbimber
Copy link
Contributor Author

bbimber commented Oct 18, 2018

Yes, that's it. Somehow I missed that until now. Most are easy to address, but there's a couple trickier GATK3 patterns I'll need to figure out. At least the tests might be close to working.

@bbimber
Copy link
Contributor Author

bbimber commented Oct 19, 2018

One more question: I have everything passing locally. When tests run on travis, they fail with odd errors suggesting the VCF idx files are wrong. These were all created by IndexFactory.createDynamicIndex(), and work file for me. Have you ever seen a situation like that where indexes created on one machine doesnt transport? My machine is a windows laptop, if that's relevant.

@cmnbroad
Copy link
Collaborator

It looks like all the changes in my original commit with the raw GATK3 code (except for one file) got squashed out somehow, so I can't see just the changes from GATK3 anymore. I'll probably have to go back and re-commit those when you're ready to make this tractable to review. I'll wait to comment on #1 until that happens.

As for a default plugin descriptor, I'd prefer not to take one unless its fully implemented, with tests. Plus, although we could develop it here, it should really live in the Barclay repo if its truly generic.

More importantly, I'm not sure the plugins in this PR should be plugins at all. Historically, plugins have required a lot of test development and iteration because they have command line arguments (the plugin system is for extending the command line parser with discoverable, re-useable components that are shared across multiple tools, and need shared command line arguments). I haven't looked at the new ones closely, but I'm not sure they're a good fit.

As for the files, it look like about 400MB (?) Thats pretty big - you should try to squeeze them down or target some existing files if you can.

@bbimber
Copy link
Contributor Author

bbimber commented Oct 22, 2018

@cmnbroad : first - would it be possible to kick off travis tests? i refactored this and dont seem to be able to do that.

Second, yes, I was trying to reorder and condense the commits but clearly didnt work. I think the problem was trying to put your GATK3 commit first (which would seem to make sense). in any case, I just recreated this, putting a pristine GATK3 first, following a consolidated set of my commits with 1) the limited core changes, 2) the meat of the VariantEval port, and 3) A separate commit with a port of GATK3 VariantEvalIntegrationTest which is useful for validation but should not be merged.

To your points:

  1. I substantially cut down the incoming large files, mostly by limiting the intervals of new large VCFs.

  2. On the plugin: this was discussed above, and I initially also pointed out this should ultimately go into Barclay. You are actually the one who proposed staging it in GATK. I am not entirely sure I understand the reticence on plugins; however, my goal is to get VariantEval ported by touching as little of it as possible. This is already sucking up a ton of time. I flipped VariantEvalUtils to gather a list of classes from the appropriate package instead of a full-on plugin. That should satisfy that concern?

@lbergelson
Copy link
Member

@bbimber There's some github service outage right now and it's preventing travis from running. I'm not sure how to work around it. Hopefully it will be resolved soon.

@bbimber
Copy link
Contributor Author

bbimber commented Oct 22, 2018

OK, thanks. I wonder if that explains some of the oddness i saw last night too.

@bbimber
Copy link
Contributor Author

bbimber commented Oct 23, 2018

@lbergelson - github is reporting everything OK, but the last commit didnt appear to kick of a build. would one of you guys be able to do that?

Assuming this passes (and it seems to locally), then I hope this is in a good state to review.

@cmnbroad
Copy link
Collaborator

It looks like the tests ran on Travis but there is a legitimate failure (testMultipleCompTracks), which fails for me locally as well.

@bbimber
Copy link
Contributor Author

bbimber commented Oct 23, 2018

@cmnbroad - yes, missed one expected file after subsetting the data. tests passed.

is seeing the status something i could have seen by myself? in travis i only see the projects I own, with no clear way to look up this unless i'm linked from the PR.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Oct 23, 2018

I'm not sure if there is another way to see the reports on travis without going through the PR link. Looks like tests are passing now, though. There is quite a bot of code here - I'll start going through it in the next day or so and see where we stand.

Also, thx for reducing those files down...

@cmnbroad
Copy link
Collaborator

@bbimber I'm making way through this, and will focus on the diffs from GATK3, especially the tool itself. There are a few things that I mentioned above that will absolutely need to be done though; the VariantEval tool javadoc needs to be updated to reflect GATK4 syntax (i.e. references to GenomeAnalysisTK.jar need to be updated, etc. - see other tools for examples), and the tool command line arguments need to be changed to reflect GATK4 conventions (i.e. selectName -> select-name). Ideally those would be done before I go any further. Also, it would be much easier if we could start this review process/comments with just two commits - the initial GATK3 source code commit, and one commit with all the changes. Can you squash down all of your commits (except the initial one) into a single commit ?

@bbimber
Copy link
Contributor Author

bbimber commented Oct 26, 2018

Point taken on the squash, though I deliberately separated the copy of GATK3 integration tests using GATK3 data, since this would not get merged. It seems to make sense to keep that separate still?

i'll update examples, etc. i am honestly not that familiar updates to argument conventions for GATK4 - i'll give this a look but if you have more specifics that would help.

@bbimber
Copy link
Contributor Author

bbimber commented Oct 26, 2018

just curious, what's the reasoning for changing all the arguments from camel case to snake case (which seems to be the standard now)? that seems like a big swap across tools, without a lot of benefit, that's going to cause a lot of people to need to update existing scripts if they try to migrate from GATK3 to 4?

@cmnbroad
Copy link
Collaborator

Yeah, good idea on keeping the GATK3 tests separate as well. The javadoc changes should be pretty easy, mostly removing the -T, GenomeAnalysisTK.jar -> `./gatk'. Command line args are all lower case, hyphen-separated, and for most args except the commone ones (-V, -I, -R) we don'e use a separate short name. Just take a stab at it and I'll review them as part of the rest. Take a look at SelectVariants (though it seems to have a lot of neccessary short names...) or ask questions.

@cmnbroad
Copy link
Collaborator

We wanted to standardize on something since before there was a mix of various styles. See #2596. We did this change for all of the GTAK4 tools before we released 4.0, so almost a year ago now.

@bbimber
Copy link
Contributor Author

bbimber commented Oct 26, 2018

@cmnbroad I think that does it. let me know if there are rebase issues or argument conventions that should be updated

@cmnbroad
Copy link
Collaborator

cmnbroad commented Nov 8, 2018

@bbimber Just an update that I'm making my way through this and expect to have a review back to you some time next week.

@bbimber
Copy link
Contributor Author

bbimber commented Nov 8, 2018

great, thanks for the update

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbimber Thanks for doing all the work to get this done - I didn't realize how complicated it was until I started reviewing it. I've focused on the diffs from GATK3 for this review, per our earlier agreement, and included comments inline.

There is one pretty significant issue, which in the interest of getting this merged into master, we will address separately. It looks like the tool needs to process variants grouped by start position that originate from several different types of input sources (eval, comps, etc.), but maintain knowledge of the origin of each variant (let me know if I'm misreading this). Since the pattern isn't well supported by the engine, the tool uses the engine traversal to find the group boundaries, but then discards the variants and re-queries all the sources to get them grouped by source. This results in the engine reading and maintaining duplicate in-memory cache copies for each of the inputs! The code is also more complicated than it would be if the engine more fully embraced the pattern. Rather than burden this already large and long-lived PR with that, I'll create a separate ticket. In the meantime, the tool should be marked with the @Beta annotation.

Otherwise, given the constraint that we weren't going to require GATK4 style changes, it looks pretty good. Hopefully the changes I requested will be pretty straight-forward.

@@ -68,6 +68,15 @@ public FeatureContext(final FeatureManager featureManager, final SimpleInterval
this.interval = interval;
}

/**
* Creates a new FeatureContext given a FeatureContext and a query interval. This will copy the FeatureManager
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to change the word "copy" to "reference" for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

}
}

PositionAggregator aggr = new PositionAggregator();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely be final.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

private SampleDB sampleDB = null;

// maintain the mapping of FeatureInput to name used in output file
Map<FeatureInput<?>, String> inputToNameMap = new HashMap<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final. Also, I don't think there is a case where the type instantiation for this map's entries is something other than VariantContext, so the wildcard ? should be replaced with VariantContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, probably right


evals.forEach(x -> inputToNameMap.put(x, x.hasUserSuppliedName() ? x.getName() : "eval"));

//TODO: need to account below...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code isn't required it should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// Set up set of additional knowns
for ( FeatureInput<VariantContext> compRod : comps ) {
//TODO: consider this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this TODO refer to ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a relic from trying to reconcile all the quirks of GATK3 behavior with GATK4-style iteration. removed.

oneLineSummary = "General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv ratios, and a lot more)",
programGroup = VariantEvaluationProgramGroup.class
)
@DocumentedFeature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we resolve the engine integration issue, we should add an @Beta annotation to this tool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume you mean @BetaFeature?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @BetaFeature.

private class PositionAggregator {
private String contig = null;
private int start;
private int end;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 variables should just be combined and stored as a SimpleInterval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but any concern about unnecessarily creating more SimpleInterval objects than needed if we keep expanding the interval? That's going to be comparatively rare, but could happen with indels or when running w/ multiple VCF inputs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - SimpleInterval should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you might have seen I already added that anyway. however, for my own edification what exactly does using SimpleInterval gain here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarity, simplicity, correctness, economy of expression.... The class is caching an interval, which is what SimpleInterval is for, and it was already using SimpleInterval in one place to update the read and feature contexts, but storing the individual components. Now, instead of two representations across four variables, the class has one representation and one variable. Even though the code is fairly simple to start with, using SimpleInterval is way, way better all around. In my opinion.

Also, I didn't notice this until I saw the diff for this change, but because the code was able to update contig, start, and end independently, they were getting out of sync before - end wasn't updated when the new interval exceeded the end of the old one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, good points.

@@ -68,6 +68,15 @@ public FeatureContext(final FeatureManager featureManager, final SimpleInterval
this.interval = interval;
}

/**
* Creates a new FeatureContext given a FeatureContext and a query interval. This will copy the FeatureManager
* from the original and user the supplied interval.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user->use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

*
*/
public ReadsContext (ReadsContext readsContext, SimpleInterval interval){
this(readsContext.dataSource, interval, null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should propagate the source ReadContext's read filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

/**
* List of rod tracks to be used for specifying "known" variants other than dbSNP.
*/
@Argument(shortName="known-name", doc="Name of ROD bindings containing variant sites that should be treated as known when splitting eval rods into known and novel subsets", optional=true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GATK4 doesn't use the term "ROD" anymore, so we should eliminate it anywhere it surfaces to the user (argument doc, javadoc, and the classnames such as CompROD and EvalRod). Since this will change the outputs, it might be a good idea to do this change as a separate commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a preferred term? CompVcf isnt always going to be accurate. CompFeatureInput?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompFeatureInput sounds good to me...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

@bbimber
Copy link
Contributor Author

bbimber commented Nov 20, 2018

@cmnbroad moving the iteration/aggregation into the core would be a useful thing, and thank you very much for recognizing it would be nice to close our this particular PR.

@bbimber
Copy link
Contributor Author

bbimber commented Nov 20, 2018

@cmnbroad one of the tests failed; however, it seems to just be a timeout:

https://api.travis-ci.org/v3/job/457478297/log.txt

are you able to restart them? again, I believe this addresses all concerns listed above except for the iteration (which will be addressed in the engine in a new PR), and the naming of CompRod and EvalRod classes. I'm fine changing these and associated outputs; however, I would appreciate suggestions on the best new names. CompInput, CompFeatureInput, CompSource, CompTrack, or something like that?

@cmnbroad
Copy link
Collaborator

I restarted the failed job.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still one comment that is incorrect, but otherwise I think this can go in now as @BetaFeature. Before we merge, we need to drop the commit with the GATK3 tests, which I can't do with the "squash&merge" UI. Can you drop that commit, and while you're at it squash the rest of the commits down to one ? Ping me if you need help with the drop part.

@bbimber
Copy link
Contributor Author

bbimber commented Nov 21, 2018

@cmnbroad just dropped the GATK3 tests and squashed. let's hope this is it!

@cmnbroad cmnbroad merged commit 9c4a27b into broadinstitute:master Nov 21, 2018
@cmnbroad
Copy link
Collaborator

@bbimber Thanks so much for all your work and perseverance on this!

@cmnbroad cmnbroad mentioned this pull request Nov 21, 2018
@bbimber
Copy link
Contributor Author

bbimber commented Nov 21, 2018

@cmnbroad Great - thanks for working with us on this.

One related question: our VariantQC tool is a separate project (gradle) that depends on GATK4. Is there a way to depend on some kind of dev version of GATK4 that will include the VariantEval code?

@cmnbroad
Copy link
Collaborator

cmnbroad commented Nov 21, 2018

The snapshot builds get published to an artifact repository, but I don't think those are accessible from outside of Broad. The build from this morning with your branch is here if you can access it.

Otherwise, for local development, you can do the following:

  • pull gatk master from today so it includes your commit
  • run git fetch --tags (this is optional but it will give your local build a more reasonable version tag)
  • run ./gradlew install printVersion to install the locally built gatk into your local machine's maven repository
  • change your VariantQC gradle project to include the maven gradle plugin if its not already there
  • add mavenLocal() to your projects' repositories closure
  • change your gatk dependency to the version number printed out by 'printVersion'
  • rebuild VariantQC

Having said all that, what code are you dependent on ? I expect the command line interface to VariantEval, and the VariantUtils and StratificationManager and friends classes all to undergo some refactoring and evolve a bit before the tool has the beta tag removed and the interfaces are stabilized. See #5439 and #5440.

@bbimber
Copy link
Contributor Author

bbimber commented Nov 21, 2018

VariantQC is a tool we made that is somewhat analogous to FastQC. Given an input VCF, it runs VariantEval to generate various summary tables of data, and then makes an HTML report (borrowing a lot from the tool MultiQC) summarizing that VCF.

I wrote this originally by forking GATK3 and wrote a new walker that internally called and run VariantEval. That was never the final plan. I dont know what this will need to look like in GATK4 yet. I'm fine with the expectation that GATK4 VariantEval will evolve and we'd need to update our code wrapping it.

@bbimber bbimber deleted the VariantEval branch November 21, 2018 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants