Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilterIntervals now filters out any singleton intervals #6559

Merged
merged 2 commits into from
Jun 12, 2020

Conversation

asmirnov239
Copy link
Collaborator

@asmirnov239 asmirnov239 commented Apr 21, 2020

This implements suggestion #2 proposed by @samuelklee from issue #5852

@asmirnov239
Copy link
Collaborator Author

@samuelklee @mwalker174 Could you please review? Also does either of you know if ScatterIntervals task in CNVTasks WDL has a potential of producing singleton intervals?

@gatk-bot
Copy link

Travis reported job failures from build 30090
Failures in the following jobs:

Test Type JDK Job ID Logs
python openjdk8 30090.5 logs

@samuelklee
Copy link
Contributor

samuelklee commented Apr 21, 2020

@asmirnov239 looks like you are getting an NPE---remember that intervals are resolved after argument validation, so you need to do the check later.

Also, good point, I think you can get a singleton after scattering if you get unlucky with your shards.
Perhaps change the check to a filtering step in GermlineCNVCaller?

@asmirnov239
Copy link
Collaborator Author

asmirnov239 commented Apr 21, 2020

@samuelklee Thank you, that's fixed now.

Regarding ScatterIntervals issue - what do you think about changing IntervalListTools logic to not produce singletons instead of filtering them out in GermlineCNVCaller?

@samuelklee
Copy link
Contributor

samuelklee commented Apr 21, 2020

@asmirnov239 I think that we probably want all shard sizes to be as close to equal as possible, so that we avoid any possible issues arising from different model capacities across shards.

Actually, is this even an issue at the GermlineCNVCaller step? Maybe we just need to worry about it during ploidy/postprocessing?

@asmirnov239 asmirnov239 force-pushed the as_filter_single_interval_contigs branch from 151fed2 to 8ac73b5 Compare April 22, 2020 21:45
@asmirnov239
Copy link
Collaborator Author

@samuelklee You're right - I tested it and it's only an issue in postprocessing , so I'm only making a check there. Neither GermlineCNVCaller or DetermineContigPloidy code fails due to this issue.

It should be all set to be reviewed now.

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor nitpicky stuff. Otherwise, thanks, looks good to me!

@@ -431,6 +432,17 @@ private SimpleIntervalCollection filterIntervals() {
countNumberPassing(mask), numIntersectedIntervals));
}

//finally, filter intervals that are solitary in their corresponding contigs
final Map<String, Long> contigToCountMap = IntStream.range(0, numIntersectedIntervals)
.filter(i -> !mask[i]).mapToObj(i -> intersectedIntervals.getRecords().get(i))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put mapToObj on its own line for consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

final Map<String, Long> contigToCountMap = IntStream.range(0, numIntersectedIntervals)
.filter(i -> !mask[i]).mapToObj(i -> intersectedIntervals.getRecords().get(i))
.collect(Collectors.groupingBy(SimpleInterval::getContig, Collectors.counting()));
IntStream.range(0, numIntersectedIntervals).filter(i -> !mask[i]).forEach(i -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break up filter and forEach onto their own lines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.filter(i -> !mask[i]).mapToObj(i -> intersectedIntervals.getRecords().get(i))
.collect(Collectors.groupingBy(SimpleInterval::getContig, Collectors.counting()));
IntStream.range(0, numIntersectedIntervals).filter(i -> !mask[i]).forEach(i -> {
final long value = contigToCountMap.get(intersectedIntervals.getRecords().get(i).getContig());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just name this variable count. Perhaps even intervalCount, and rename the map to contigToIntervalCountMap.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and done

.collect(Collectors.groupingBy(SimpleInterval::getContig, Collectors.counting()));
IntStream.range(0, numIntersectedIntervals).filter(i -> !mask[i]).forEach(i -> {
final long value = contigToCountMap.get(intersectedIntervals.getRecords().get(i).getContig());
if (value <= 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit weird to have <= since it should never happen (and I can't imagine future code changes that would make it possible), but no biggie.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to ==

IntStream.range(0, numIntersectedIntervals).filter(i -> !mask[i]).forEach(i -> {
final long value = contigToCountMap.get(intersectedIntervals.getRecords().get(i).getContig());
if (value <= 1) {
mask[i] = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a log message here indicating when a contig is dropped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

concatenatedIntervalList.addAll(intervalCollections.get(i).getIntervals());
});

final Map<String, Long> contigToCountMap = concatenatedIntervalList.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make the variable names and code here consistent with the code in FilterIntervals.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I did that as much as possible

@@ -654,4 +653,25 @@ private static File getCopyNumberSegmentsFile(final File pythonSegmenterOutputPa
}
return unsortedIntervalCollectionsFromModels;
}

/**
* Validate that the union of shard's interval lists does not have singleton intervals, i.e. intervals that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate that the concatenation of the sharded interval lists does not have singleton intervals..., perhaps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@@ -27,6 +27,7 @@
import org.testng.annotations.Test;

import java.io.File;
import java.lang.reflect.AnnotatedElement;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray import?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure how that got in there. Thanks!

@@ -71,6 +72,16 @@
new AnnotationMap(Arrays.asList(
Pair.of(CopyNumberAnnotations.GC_CONTENT, 0.95),
Pair.of(CopyNumberAnnotations.MAPPABILITY, 0.95),
Pair.of(CopyNumberAnnotations.SEGMENTAL_DUPLICATION_CONTENT, 0.5)))),
new AnnotatedInterval(new SimpleInterval("20", 51, 60),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

White space. Thanks for adding these tests!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the other ones had tabs as well.. fixed!

* other tools (e.g., {@link DetermineGermlineContigPloidy} and {@link GermlineCNVCaller}) to mask intervals from
* analysis.
* Annotation-based filters will be applied first, followed by count-based filters. In the end, any singleton intervals
* (i.e. those being by themselves on their corresponding contigs) found after applying other filters will be filtered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comma after i.e..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @asmirnov239, I have a few questions/comments.

IntStream.range(0, numIntersectedIntervals).filter(i -> !mask[i]).forEach(i -> {
final long value = contigToCountMap.get(intersectedIntervals.getRecords().get(i).getContig());
if (value <= 1) {
mask[i] = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log a warning here, something like:

Filtered singleton interval on contig: "chrX"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -246,7 +243,7 @@
public void onStartup() {
super.onStartup();
/* check for successful import of gcnvkernel */
PythonScriptExecutor.checkPythonEnvironmentForPackage("gcnvkernel");
PythonScriptExecutor.checkPythonEnvironmentForPackage("numpy");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this was to get it to run in IDE. Thanks for catching this!

* Validate that the union of shard's interval lists does not have singleton intervals, i.e. intervals that
* are the only ones on their corresponding contigs.
*/
private void checkForSingletonIntervalAbsence(final List<SimpleIntervalCollection> intervalCollections){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue you can just call this checkForSingletonInterval

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

Comment on lines 662 to 675
final List<SimpleInterval> concatenatedIntervalList = new ArrayList<>();
IntStream.range(0, numShards).forEach(i -> {
concatenatedIntervalList.addAll(intervalCollections.get(i).getIntervals());
});

final Map<String, Long> contigToCountMap = concatenatedIntervalList.stream()
.collect(Collectors.groupingBy(SimpleInterval::getContig, Collectors.counting()));
contigToCountMap.keySet().forEach(c -> {
if (contigToCountMap.get(c) == 1) {
throw new IllegalArgumentException(
String.format("Records contain a singleton interval on contig (%s)." +
" Please run FilterIntervals tool first.", c));
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified a bit:

        intervalCollections.stream()
                .flatMap(list -> list.getIntervals().stream())
                .collect(Collectors.groupingBy(SimpleInterval::getContig, Collectors.counting()))
                .entrySet().stream()
                .forEach(entry -> { 
                    if (entry.getValue() == 1) { 
                        throw new IllegalArgumentException(
                                String.format("Records contain a singleton interval on contig (%s)." + 
                                        " Please run FilterIntervals tool first.", entry.getKey()));
            }
        });

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I used this

{intervalsWithExtraIntervalFile, Collections.emptyList(), annotatedIntervalsFile, 0.1, 0.9, 0.1, 0.9, 0.1, 0.9, Arrays.asList(2, 5)},
{intervalsWithExtraIntervalFile, Collections.singletonList("20:1-10"), annotatedIntervalsFile, 0., 1., 0., 1., 0., 1., Arrays.asList(1, 2, 3, 4, 5)},
{intervalsWithExtraIntervalFile, Arrays.asList("20:1-15", "20:35-45"), annotatedIntervalsFile, 0., 1., 0., 1., 0., 1., Arrays.asList(2, 5)},
{intervalsWithExtraIntervalFile, Collections.singletonList("20:25-50"), annotatedIntervalsFile, 0.1, 0.9, 0., 1., 0., 1., Arrays.asList(0, 1, 5)}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does one of these test for the singleton case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of them actually do because I added an interval from contig "21" which should be filtered out in all of these.

@gatk-bot
Copy link

gatk-bot commented Jun 2, 2020

Travis reported job failures from build 30489
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30489.11 logs
integration openjdk8 30489.2 logs

@asmirnov239 asmirnov239 merged commit 189ea45 into master Jun 12, 2020
@asmirnov239 asmirnov239 deleted the as_filter_single_interval_contigs branch June 12, 2020 19:42
jonn-smith pushed a commit that referenced this pull request Jul 14, 2020
* FilterIntervals now filters out any singleton intervals, that have no other intervals on their contigs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants