-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VariantEvalPort #5043
VariantEvalPort #5043
Conversation
This have all the commits from #4495 - could you rebase into the last master to keep the PR clean? I would like to review because I found some interesting changes here... |
Yes, i will, but I'm hoping the GATK folks could at least talk through a few big picture points first. As I say above, I dont expect this is code-review ready. |
@bbimber Good questions all. Whenever we do these ports we have to strike a balance between minimizing the changes and updating to GATK4 standards. Generally, though, we want the code to conform to GATK4. This PR is large and likely to require some iteration so I'd be ok with starting with just the minimal "porting" changes to keep things simple, and then doing a code hygiene pass at the end. The "porting" changes should include things like updated javadoc, GATK4-style command line arguments, updating of outdated GATK3 terminology such as "ROD", Utils.nonNull assertions, etc. The finals and curly braces can wait for a separate pass (we will want to do those in this PR though). If you're not sure what to include or not just ask. I like the idea of keeping the GATK3 tests working as we go along. We should make a clear distinction between the old and new tests though. Ideally the GATK3 tests would be in a separate commit that we can just delete at the end, but that can get unwieldy if the files in the commit need to change as we go along. Alternatively you could isolate them into a separate directory. They should either be disabled or made dependent on a test method (see the In general, its really helpful to have the first commit in the PR contain the completely unmodified GATK3 source files. It makes it much easier for the reviewer to see what changed for the port. I noticed that you have 2 new plugins included in this. I'm not sure if that was suggested by someone on the GATK team (I'm wondering if we want to go down that path...) but I can tell you that the existing plugins required an enormous amount of test development and review iteration. If we do decide to make them plugins, I think it would be a good idea to do so in a separate PR. Also, if we choose to make an AbstractPlugin base class, we may want that to live in the Barclay repo. As @magicDGS points out, master already has your previous commits, so you should start by rebasing on that. Ideally, the branch would have the following commits before we start the first review cycle:
After that, its really helpful to have only a single new commit for each review iteration (you can create as many commits as you want as you work, but squash the new commits down before submitting). Just don't squash or rebase anything thats already been pushed up to the repo. Also, note that most of the GATK engine team is out for a few weeks, so progress may be slow in the short term. |
@cmnbroad, that's not wholly unreasonable, but i'd like to push back on a number of these points.
|
On the first question, we definitely appreciate how much work this will take. Often, porting the code is the easy part; developing new tests and test data can be a huge effort. I can try to find out if it would be possible for you to take the tool over - I know this kind of thing has come up before for other tools, but I'd have to ask around to find that out. @vdauwera do you have input on this ? As for the plugins, currently in your branch And yes, definitely don't check any of the large GATK3 test files into the repo, even temporarily. Take a look at General guidelines for GATK4 developers if you haven't already. As you pointed out, new GATK4 tests that use smaller files would have to be developed. We'd want those to be included, and passing tests on the CI server, before we started reviewing the branch, so we know we're reviewing code that works and is covered by tests as much as possible. The second commit in my list above would have only your GATK3 java test files, etc (but not the big files, which you appear to have locally). The third commit would have your ported tool code, as well as the new test code, with the new tests enabled, as well as the smaller input files and expected results files. At the end we'd remove commit #2. |
Let's hear what others say, but I think I would strongly prefer to simply take over VariantEval in another repo if this was something you'd consider. I'd likely do much of what you propose anyway (certainly WRT testing); however, perhaps not the microscope we went through with the core GATK changes earlier. On plugins: I like what seems to be shaping up w/ Barclay. I carried over the Stratifier and Evaluator as plugins because it seems like it would make sense to allow tools to provide extensions (VariantEval, our tool, does). If I took this PR a step further, I would have migrated many arguments currently top-level on VariantEval into the plugins themselves (a good feature in Barclay). As an aside: I dont think VariantAnnotator is migrated yet, but we have many GATK3 plugins related to annotation, and hope that tool retains Annotator plugins when it get migrated. My impressions of barclay are probably a little out of date. I agree the main argument parsing framework is pretty robust. Specifically on plugins, it seems a little less so, or at least there are not many tools I visibly see exercising that part of the code. For example, there really should be a default implmentation or base class between Barclay's plugins and ReadFilter plugins. I'm guessing if more tools in GATK4 were using plugins this would have happened. I created something like this for VariantEval, and without a ton of work that could probably get generalized; however, doing so would throw a lot higher bar on me and as noted above I'm trying to take on less, not more at the moment. If we do take over VariantEval, I'm certainly happy to try to contribute code and experiences to improve the core, through more targeted PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was interested in this PR due to the abstract plugin implementation, because it will make easier my life implementing plugins.
There are some parts that I don't agree with, and this is also in line with your discussion about the time which takes to develop a plugin - it requires lots of time and review rounds, because the interface is a bit difficult to understand.
If you are interested, I can draft an abstract plugin (either here or baclay).
|
||
protected List<String> pluginPackageNames; | ||
protected Class<T> pluginBaseClass; | ||
protected Class<T> defaultPluginBaseClass; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really generic for an abstract plugin descriptor: it assumes that the defaults are always extending a concrete classs - this does not happen with, for example, ReadFilter
(where defaults are tool-specific)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I actually made this change locally as well. Again, this class as-is isnt really for prime-time as a product-wide Abstact class. I think we'd want to start by pulling the relevant parts out of ReadFilterPluginDescriptor. My implementation also doesnt deal with the bits related to plugins supplying command-line arguments.
List<T> ret = new ArrayList<>(); | ||
if (defaultPluginBaseClass != null){ | ||
for (T c : allDiscoveredPlugins.values()){ | ||
if (defaultPluginBaseClass.isAssignableFrom(c.getClass())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only true for the case of the plugins implemented in this PR, but not as a generic abstract plugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that's why this class lives in the VariantEval package, and wasnt intended elsewhere in the current form.
Nonetheless, the pattern of having some primary class of Plugins (say VariantAnnotation), and then marker classes to denote things like "Required" or "Default" exists in other places. I agree w/ your comment above about not requiring Class (since the marker class doesnt need to inherit from the primary plugin class); however, in what situation does this not fit?
@Override | ||
public List<T> getResolvedInstances() { | ||
List<T> ret = new ArrayList<>(); | ||
allDiscoveredPlugins.forEach((x,y) -> ret.add(y)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true - the getResolvedInstances()
should return only the ones that are included by the user and the defaults. With this abstract plugin all the instances of a concrete class will be applied.
|
||
@Override | ||
public Class<?> getClassForPluginHelp(String pluginName) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an abstract class, this could be implemented using allDiscoveredPlugins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
|
||
@Override | ||
public Set<String> getAllowedValuesForDescriptorHelp(String longArgName) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be implemented by sub-classes, otherwise the automatic doc generation won't work; shouldn't be the default behavior for the abstract plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also yes
} | ||
|
||
@Override | ||
public void validateAndResolvePlugins() throws CommandLineException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be empty by default, otherwise no plugin-resolution might be performed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
@magicDGS I agree that a fully implemented AbstractPluginDescritor class is a good idea. Why not do this in a separate PR, and would this be most appropriate in Barclay? I would be happy to take a stab or let you. We should look at the usage pattern of ReadFilters, VariantEval's two plugins, VariantAnnotation and perhaps others I'm not thinking about. |
@magicDGS If you think an AbstractPluginDescriptor would be useful, I'd suggest initially creating a separate PR in GATK, since that would make it easy to see to how it simplifies the existing descriptors. Then once that converges, we can move the abstract part to Barclay. |
@vdauwera ping - Can you help us figure out how to proceed with making a determination on this, or who else we need to consult ? Thx. |
@bbimber I haven't heard anything from @vdauwera, but based on discussions with @droazen now that he is back, I think we're inclined to prefer to keep this as part of GATK. We can certainly work with you on making this less onerous than the last PR, which involved engine level changes. We will require tests that use data contained in the repo, though. |
@cmnbroad : thanks for the reply. yes, obviously tests would need to be added/updated. there's no question it needs robust testing. my main concern is that VariantEval is a fairly sprawling tool with all sorts of add-ons. The majority of the untouched code taken verbatim from GATK3 isnt going to pass muster based on the bar of our last PR without a lot of petty revision (and maybe some useful updates). There are certainly some code improvements one could make across VariantEval, but I'm just not that keen on combing through the whole thing if it can be avoided. how about this: while tests need to be updated (as discussed above, they work on the GATK3 data, which isnt checked in), the code in this PR is functional. Would you be willing to review a couple classes, maybe VariantEval itself and a few ancillary classes to see what scope of work we're talking about? |
@bbimber I did start to take a look at this today. We'd probably take many of the peripheral classes from GATK3 as they are (though there will be some exceptions, i.e., The easiest way to do this would be to add a commit with the raw GATK3 code as I mentioned above. We would then focus on reviewing the diffs, instead of line-by-line for all of the code. I can add that commit if you like - we'd just need to coordinate since I would have to push to your branch. Also, FYI, I'll be away all of next week so I won't be able to do any more on this until I return. |
@cmnbroad OK. I have two grants in progress so time is a little tight. I can probably carve out some time though. If you dont mind doing the raw GATK3 commit I'll work off that. If we can take this plus my baseline here, then I will go through everything to clean up what I can pick out as obviously needing GATK4 updates. Thanks for working with us on this. |
@bbimber I have an updated version of this branch that contains the raw GATK3 sources as the first commit, but I don't want to clobber anything you may be in the process of pushing up, so I'll wait for you to tell me when to push it. One thing to note is that currently this branch doesn't compile (even without my commit). I just want to make sure github has all of your changes, since I know you had the old tests working with it. With my commit, the history looks like this: |
thank you. i have no changes beyond what's there now. go for it. |
Done. I wasn't sure if I'd be able to push to your fork, but it seems to have worked. Also, I added one new unit test file that wasn't previously in your branch (SnpEffUtilsUnitTest.java), since it already exists. It should be a quick port. |
thanks. if there a good intellij trick to inspect the diff between commits on a branch? I assume that's why you want the base GATK3 version in there? |
Yes, if you have Version Control integration enabled, you can select the branch, and then any commit, or group of commits, in order to see the diffs. So once its read, I'll be able to look at just the diffs between the initial GATK3 commit and your commit(s) right from within IntelliJ. |
@bbimber We just added full versions of the B37 and HG38 references to the repo a couple of days ago. You'll have to rebase this branch on current master to access them, but it might make it easier to port some of the GATK3 tests that use b37. |
good timing. as you might have seen, i found some time to work on this. I now have everything 100% backwards compatible w/ GATK3's behavior. I started to migrate tests. In many cases, it was actually using 20/21 anyway, so it's not a big deal. For any GATK3 tests files that were tiny, I just checked them in, and the tests should work as-is. However, there's 5ish "large" VCFs (meaning more than ~250kb), and a couple into the MBs. I could either:
Do you have other opinions on how to approach that? You'll see I checked in a handful of small files into the resources dir. Is that going to be OK for merging? |
The 250k-ish ones are probably ok in large. For any of them, especially the MB+ ones, anything you can reasonably subset (or use a different vcf) without sacrificing the utility of the tests would be good though. |
Are there substantially stricter new compile requirements on this project? I may have missed it earlier when I first merged this code, but there's a lot of raw type / generic type errors I'm needing to deal with, taking straight from GATK3 code. |
GATK4 treats warnings as errors, if thats what you mean. I don't think GATK3 did. They're usually pretty easy to remedy though. |
Yes, that's it. Somehow I missed that until now. Most are easy to address, but there's a couple trickier GATK3 patterns I'll need to figure out. At least the tests might be close to working. |
One more question: I have everything passing locally. When tests run on travis, they fail with odd errors suggesting the VCF idx files are wrong. These were all created by IndexFactory.createDynamicIndex(), and work file for me. Have you ever seen a situation like that where indexes created on one machine doesnt transport? My machine is a windows laptop, if that's relevant. |
It looks like all the changes in my original commit with the raw GATK3 code (except for one file) got squashed out somehow, so I can't see just the changes from GATK3 anymore. I'll probably have to go back and re-commit those when you're ready to make this tractable to review. I'll wait to comment on #1 until that happens. As for a default plugin descriptor, I'd prefer not to take one unless its fully implemented, with tests. Plus, although we could develop it here, it should really live in the Barclay repo if its truly generic. More importantly, I'm not sure the plugins in this PR should be plugins at all. Historically, plugins have required a lot of test development and iteration because they have command line arguments (the plugin system is for extending the command line parser with discoverable, re-useable components that are shared across multiple tools, and need shared command line arguments). I haven't looked at the new ones closely, but I'm not sure they're a good fit. As for the files, it look like about 400MB (?) Thats pretty big - you should try to squeeze them down or target some existing files if you can. |
@cmnbroad : first - would it be possible to kick off travis tests? i refactored this and dont seem to be able to do that. Second, yes, I was trying to reorder and condense the commits but clearly didnt work. I think the problem was trying to put your GATK3 commit first (which would seem to make sense). in any case, I just recreated this, putting a pristine GATK3 first, following a consolidated set of my commits with 1) the limited core changes, 2) the meat of the VariantEval port, and 3) A separate commit with a port of GATK3 VariantEvalIntegrationTest which is useful for validation but should not be merged. To your points:
|
@bbimber There's some github service outage right now and it's preventing travis from running. I'm not sure how to work around it. Hopefully it will be resolved soon. |
OK, thanks. I wonder if that explains some of the oddness i saw last night too. |
@lbergelson - github is reporting everything OK, but the last commit didnt appear to kick of a build. would one of you guys be able to do that? Assuming this passes (and it seems to locally), then I hope this is in a good state to review. |
It looks like the tests ran on Travis but there is a legitimate failure (testMultipleCompTracks), which fails for me locally as well. |
@cmnbroad - yes, missed one expected file after subsetting the data. tests passed. is seeing the status something i could have seen by myself? in travis i only see the projects I own, with no clear way to look up this unless i'm linked from the PR. |
I'm not sure if there is another way to see the reports on travis without going through the PR link. Looks like tests are passing now, though. There is quite a bot of code here - I'll start going through it in the next day or so and see where we stand. Also, thx for reducing those files down... |
@bbimber I'm making way through this, and will focus on the diffs from GATK3, especially the tool itself. There are a few things that I mentioned above that will absolutely need to be done though; the VariantEval tool javadoc needs to be updated to reflect GATK4 syntax (i.e. references to GenomeAnalysisTK.jar need to be updated, etc. - see other tools for examples), and the tool command line arguments need to be changed to reflect GATK4 conventions (i.e. |
Point taken on the squash, though I deliberately separated the copy of GATK3 integration tests using GATK3 data, since this would not get merged. It seems to make sense to keep that separate still? i'll update examples, etc. i am honestly not that familiar updates to argument conventions for GATK4 - i'll give this a look but if you have more specifics that would help. |
just curious, what's the reasoning for changing all the arguments from camel case to snake case (which seems to be the standard now)? that seems like a big swap across tools, without a lot of benefit, that's going to cause a lot of people to need to update existing scripts if they try to migrate from GATK3 to 4? |
Yeah, good idea on keeping the GATK3 tests separate as well. The javadoc changes should be pretty easy, mostly removing the |
We wanted to standardize on something since before there was a mix of various styles. See #2596. We did this change for all of the GTAK4 tools before we released 4.0, so almost a year ago now. |
@cmnbroad I think that does it. let me know if there are rebase issues or argument conventions that should be updated |
@bbimber Just an update that I'm making my way through this and expect to have a review back to you some time next week. |
great, thanks for the update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbimber Thanks for doing all the work to get this done - I didn't realize how complicated it was until I started reviewing it. I've focused on the diffs from GATK3 for this review, per our earlier agreement, and included comments inline.
There is one pretty significant issue, which in the interest of getting this merged into master, we will address separately. It looks like the tool needs to process variants grouped by start position that originate from several different types of input sources (eval, comps, etc.), but maintain knowledge of the origin of each variant (let me know if I'm misreading this). Since the pattern isn't well supported by the engine, the tool uses the engine traversal to find the group boundaries, but then discards the variants and re-queries all the sources to get them grouped by source. This results in the engine reading and maintaining duplicate in-memory cache copies for each of the inputs! The code is also more complicated than it would be if the engine more fully embraced the pattern. Rather than burden this already large and long-lived PR with that, I'll create a separate ticket. In the meantime, the tool should be marked with the @Beta
annotation.
Otherwise, given the constraint that we weren't going to require GATK4 style changes, it looks pretty good. Hopefully the changes I requested will be pretty straight-forward.
@@ -68,6 +68,15 @@ public FeatureContext(final FeatureManager featureManager, final SimpleInterval | |||
this.interval = interval; | |||
} | |||
|
|||
/** | |||
* Creates a new FeatureContext given a FeatureContext and a query interval. This will copy the FeatureManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to change the word "copy" to "reference" for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
} | ||
} | ||
|
||
PositionAggregator aggr = new PositionAggregator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely be final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
private SampleDB sampleDB = null; | ||
|
||
// maintain the mapping of FeatureInput to name used in output file | ||
Map<FeatureInput<?>, String> inputToNameMap = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final. Also, I don't think there is a case where the type instantiation for this map's entries is something other than VariantContext
, so the wildcard ?
should be replaced with VariantContext
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, probably right
|
||
evals.forEach(x -> inputToNameMap.put(x, x.hasUserSuppliedName() ? x.getName() : "eval")); | ||
|
||
//TODO: need to account below... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this code isn't required it should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
// Set up set of additional knowns | ||
for ( FeatureInput<VariantContext> compRod : comps ) { | ||
//TODO: consider this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this TODO refer to ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a relic from trying to reconcile all the quirks of GATK3 behavior with GATK4-style iteration. removed.
oneLineSummary = "General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv ratios, and a lot more)", | ||
programGroup = VariantEvaluationProgramGroup.class | ||
) | ||
@DocumentedFeature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we resolve the engine integration issue, we should add an @Beta
annotation to this tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume you mean @BetaFeature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @BetaFeature
.
private class PositionAggregator { | ||
private String contig = null; | ||
private int start; | ||
private int end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 3 variables should just be combined and stored as a SimpleInterval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but any concern about unnecessarily creating more SimpleInterval objects than needed if we keep expanding the interval? That's going to be comparatively rare, but could happen with indels or when running w/ multiple VCF inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - SimpleInterval should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you might have seen I already added that anyway. however, for my own edification what exactly does using SimpleInterval gain here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarity, simplicity, correctness, economy of expression.... The class is caching an interval, which is what SimpleInterval is for, and it was already using SimpleInterval in one place to update the read and feature contexts, but storing the individual components. Now, instead of two representations across four variables, the class has one representation and one variable. Even though the code is fairly simple to start with, using SimpleInterval is way, way better all around. In my opinion.
Also, I didn't notice this until I saw the diff for this change, but because the code was able to update contig
, start
, and end
independently, they were getting out of sync before - end
wasn't updated when the new interval exceeded the end of the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, good points.
@@ -68,6 +68,15 @@ public FeatureContext(final FeatureManager featureManager, final SimpleInterval | |||
this.interval = interval; | |||
} | |||
|
|||
/** | |||
* Creates a new FeatureContext given a FeatureContext and a query interval. This will copy the FeatureManager | |||
* from the original and user the supplied interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user->use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
* | ||
*/ | ||
public ReadsContext (ReadsContext readsContext, SimpleInterval interval){ | ||
this(readsContext.dataSource, interval, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should propagate the source ReadContext
's read filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
/** | ||
* List of rod tracks to be used for specifying "known" variants other than dbSNP. | ||
*/ | ||
@Argument(shortName="known-name", doc="Name of ROD bindings containing variant sites that should be treated as known when splitting eval rods into known and novel subsets", optional=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GATK4 doesn't use the term "ROD" anymore, so we should eliminate it anywhere it surfaces to the user (argument doc, javadoc, and the classnames such as CompROD
and EvalRod
). Since this will change the outputs, it might be a good idea to do this change as a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a preferred term? CompVcf isnt always going to be accurate. CompFeatureInput?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CompFeatureInput sounds good to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
@cmnbroad moving the iteration/aggregation into the core would be a useful thing, and thank you very much for recognizing it would be nice to close our this particular PR. |
@cmnbroad one of the tests failed; however, it seems to just be a timeout: https://api.travis-ci.org/v3/job/457478297/log.txt are you able to restart them? again, I believe this addresses all concerns listed above except for the iteration (which will be addressed in the engine in a new PR), and the naming of CompRod and EvalRod classes. I'm fine changing these and associated outputs; however, I would appreciate suggestions on the best new names. CompInput, CompFeatureInput, CompSource, CompTrack, or something like that? |
I restarted the failed job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still one comment that is incorrect, but otherwise I think this can go in now as @BetaFeature
. Before we merge, we need to drop the commit with the GATK3 tests, which I can't do with the "squash&merge" UI. Can you drop that commit, and while you're at it squash the rest of the commits down to one ? Ping me if you need help with the drop part.
@cmnbroad just dropped the GATK3 tests and squashed. let's hope this is it! |
@bbimber Thanks so much for all your work and perseverance on this! |
@cmnbroad Great - thanks for working with us on this. One related question: our VariantQC tool is a separate project (gradle) that depends on GATK4. Is there a way to depend on some kind of dev version of GATK4 that will include the VariantEval code? |
The snapshot builds get published to an artifact repository, but I don't think those are accessible from outside of Broad. The build from this morning with your branch is here if you can access it. Otherwise, for local development, you can do the following:
Having said all that, what code are you dependent on ? I expect the command line interface to VariantEval, and the VariantUtils and StratificationManager and friends classes all to undergo some refactoring and evolve a bit before the tool has the beta tag removed and the interfaces are stabilized. See #5439 and #5440. |
VariantQC is a tool we made that is somewhat analogous to FastQC. Given an input VCF, it runs VariantEval to generate various summary tables of data, and then makes an HTML report (borrowing a lot from the tool MultiQC) summarizing that VCF. I wrote this originally by forking GATK3 and wrote a new walker that internally called and run VariantEval. That was never the final plan. I dont know what this will need to look like in GATK4 yet. I'm fine with the expectation that GATK4 VariantEval will evolve and we'd need to update our code wrapping it. |
This is not ready for review, but because this is almost certainly going to take some iteration, I wanted to make sure we're on the same page with what needs to occur. I tried to touch the GATK3 code as little as possible. I changed what was needed, and ported the GATK3 tests. I have these passing locally using the GATK3 test data. I'd like to preserve that as we work through this review (even though it will not get checked in), as a way to measure GATK3/4 parity. There are probably going to be some places where GATK3 behavior could get questioned too.
Because I tried to leave old code alone, that means there are plenty of existing issues, ranging from style/code-convention (i.e. arguments not 'final') to pieces of code that existed in GATK3, but might get questioned if put under the microscope. with that in mind, @droazen , @cmnbroad : do you have any general thoughts on the bar for porting this to GATK4? Can I leave the GATK3 code alone, or is every piece of code getting the level of review we just did for the core changes for this?