Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First Version of a weight-based splitter #7643

Merged
merged 10 commits into from
Feb 11, 2022
Merged

First Version of a weight-based splitter #7643

merged 10 commits into from
Feb 11, 2022

Conversation

kcibul
Copy link
Contributor

@kcibul kcibul commented Jan 20, 2022

See Issue #7622 for more details

In addition to unit tests, here is how I validated:

##
# Split the WGS list
##
rm -rf test_split

./gatk --java-options "-Xmx4g $DEBUG" \
 WeightedSplitIntervals \
    --scatter-count 100 \
    --weight-bed-file gvs_vet_weights_1kb.bed \
    -R gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \
    --dont-mix-contigs true \
    -L wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list \
    --output test_split

##
# merge all the intervals lists back into one
##
IL=""
for f in test_split/*-scattered.interval_list; do
    IL="${IL} -I $f "
done
./gatk IntervalListTools --ACTION UNION $IL -O test_split/merged.interval_list

#
# compare it to the original
##
./gatk CompareIntervalLists \
-R gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \
-L wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list \
-L2 test_split/merged.interval_list

##
# A visual check to see that the ordering is the same, and that the only splits
# are across file boundaries
##
cat  test_split/*-scattered.interval_list | grep -v "@" | cut -f1-3 > test_split/combined.txt
cat  wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list | grep -v "@" | cut -f1-3 > test_split/orig.txt
diff -y test_split/orig.txt test_split/combined.txt

@kcibul
Copy link
Contributor Author

kcibul commented Jan 20, 2022

If we like this -- we also need to

  • build a jar
  • update the WDL to use this tool (and the Jar)
  • Put the BED files someplace public/widely accessible (likely just the 1kb version)
  • Run an E2E on QuickStart, merge the VCFs and compare (and see no differences)
  • If we want to validate evenness we need to run with a lot of shards and enough data that they are interesting. Maybe Stroke 10k
  • In parallel if we could turn some of the above script into integration tests that would be awesome

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kcibul Sorry for the extreme lateness, and the extreme number of comments. I ended up having a lot more to say than I expected. It really needs at least 1 integration test to show that it runs without exploding. If you're marking it as either a beta too or experimental we can get away with somewhat less responding to nitpicks and lower depth of testing, although if we're delivering things based on it we might at least want a test that asserts that a complex input interval list and weights file doesn't result in any LOST intervals anywhere.

Feel free to disregard what you think is over the top.

System.out.println("Total Weight:"+totalWeight + " scatterCount: " + scatterCount + " target:" + targetWeightPerScatter);

final int maxNumberOfPlaces = Math.max((int)Math.floor(Math.log10(scatterCount-1))+1, numDigits);
final String formatString = "%0" + maxNumberOfPlaces + "d";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, recursive format strings.

}
}
// write the final list
currentList.uniqued().sorted().write(new File(outputDir, prefix + String.format(formatString, scatterPiece++) + extension));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make this file name construction into a function since it's repeated. We could move it into the newly created OutputScatterArgumentCollection if we went that route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, but this doesn't feel a whole lot cleaner 🤷

@kcibul kcibul merged commit b2b2a83 into ah_var_store Feb 11, 2022
@kcibul kcibul deleted the kc_splitter branch February 11, 2022 19:17
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants