Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Performance Improvements #7686

Merged
merged 14 commits into from
Apr 7, 2022
Merged

Extract Performance Improvements #7686

merged 14 commits into from
Apr 7, 2022

Conversation

kcibul
Copy link
Contributor

@kcibul kcibul commented Feb 18, 2022

Three main performance optimizations:

  1. Avro Parsing: More efficient parsing and representation of primitive types in Avro-based records (ExtractCohortRecord, ReferenceRecord). We previously called toString() and then parseLong() on everything, even though it was already the right datatype

  2. Inferred State: we keep track of which samples have been seen, so that later we can determine which samples have not been seen for each site. The data structures here were slow with 100k samples and lots of variants. Moved to using a TreeSet and BitSet

  3. Reference Genotypes: Add reference genotypes in bulk (via ReferenceGenotypeInfo, rather than a heavy Variant Context) rather than one at a time

More Details from profiling

https://docs.google.com/spreadsheets/d/1aA7LKgPsaELiGurw95qVX1PwGt54I5rn1h_fAAhkPMo/edit#gid=0

@kcibul kcibul marked this pull request as ready for review April 6, 2022 19:22
@@ -15,7 +15,7 @@ workflow GvsExtractCallset {

File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
File interval_weights_bed = "gs://broad-public-datasets/gvs/weights/gvs_vet_weights_1kb.bed"
File gatk_override = "gs://broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/ah_var_store_20220406/gatk-package-4.2.0.0-480-gb62026a-SNAPSHOT-local.jar"
File gatk_override = "gs:////broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra slashes?

Suggested change
File gatk_override = "gs:////broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar"
File gatk_override = "gs://broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol I love me some extra slashes!

throw new GATKException("Sample Ids > " + Integer.MAX_VALUE + " are not supported");
}

this.sampleIdsToExtractBitSet = new BitSet(sampleIdsToExtract.last().intValue());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you already have it in a local?

Suggested change
this.sampleIdsToExtractBitSet = new BitSet(sampleIdsToExtract.last().intValue());
this.sampleIdsToExtractBitSet = new BitSet(maxSampleId.intValue());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines -509 to -511
case "u": // unknown GQ used for array data
unmergedCalls.add(createRefSiteVariantContext(sampleName, contig, currentPosition, refAllele));
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we'll really never see a "u" anymore (especially given the explodey default)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah 'u' isn't a state we encode anywhere… it was for arrays support which we removed ages ago

Comment on lines 4 to 5
private String sampleName;
private int GQ;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my IntelliJ points out these could be final 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

throw new GATKException("Sample Ids > " + Integer.MAX_VALUE + " are not supported");
}

this.sampleIdsToExtractBitSet = new BitSet((int) maxSampleId);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add +1… this is zero-based

Copy link
Collaborator

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks for the walk through

samplesNotEncountered.xor(sampleIdsToExtractBitSet);

// Iterate through the samples not encountered
for (int sampleId = samplesNotEncountered.nextSetBit(0); sampleId >= 0; sampleId = samplesNotEncountered.nextSetBit(sampleId+1)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this for loop kind of confusing - might be clearer to:
for (long sampleId : samplesNotEncountered.toLongArray()) {
(and then you wouldn't need to Long.valueOf(sampleId) on line 600.

But that might not end up scaling so well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah -- it is a little confusing, but I think what you're proposing would give back an array of longs that back the bitset, and then you're iterate through those values. I'm going to pretend a long is 8-bits for a minute. If you made a BitSet(8) and then set bits 0,1,2 you would get back a single long with a value of "7" (bits 11100000).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yeah, I completely misunderstood what that method does. By the name, I though it returned [0L,1L, 2L], which would be useful I think.


int length = Math.toIntExact((Long) genericRecord.get("length"));
this.end = this.start + length - 1;
this.endLocation = this.location + + length - 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "+ +"??

@kcibul kcibul merged commit 1f490f0 into ah_var_store Apr 7, 2022
@kcibul kcibul deleted the kc_extract_perf branch April 7, 2022 20:37
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants