Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP extract for ranges #7640

Merged
merged 8 commits into from
Jan 19, 2022
Merged

WIP extract for ranges #7640

merged 8 commits into from
Jan 19, 2022

Conversation

kcibul
Copy link
Contributor

@kcibul kcibul commented Jan 18, 2022

Some notes from the 10k tieout:

Prepare Step

  • ~20 min per full ref_ranges table to insert
  • ~7 min per full vet table to insert
  • "bytes scanned" are same as data table size

Extract

Original Run - 293 min

  • 103 minutes pulling down data, scanning 237 GB

    • 43 min on 20m vet records (20:26 - > 21:09)
    • 60 min on 291m vet records (21:09 -> 22:10)
  • 190 minutes writing the VCF

Prepare Extract with minor tuning of sorting - 134 min

  • 25 minutes pulling down data ( faster), scanning 10 GB (50x reduction)

    • 4 min on 20m vet records(02:43 -> 02:47) - NOTE 103s of that was sorting (44s) and spilling to disk (59 s)
    • 21 min on 291m vet records (02:47 -> 03:08) - NOTE 9 min of that was sorting (6 min) and spilling to disk (3 min)
  • 109 minutes writing the VCF (this is the change to pre-sort the sample set merged to ah_var_store on 1/12/22)

Tieout is identical

kcibul@kc-specops-tiny:~/stroke_tieout$ md5sum gold.jointcallset_0.vcf.gz
496178eae4afe63c4391d8eba64a9947  gold.jointcallset_0.vcf.gz

kcibul@kc-specops-tiny:~/stroke_tieout$ md5sum trial.full.jointcallset_0.vcf.gz
496178eae4afe63c4391d8eba64a9947  trial.full.jointcallset_0.vcf.gz

@@ -321,7 +330,7 @@ public int compare( GenericRecord o1, GenericRecord o2 ) {

for (final GenericRecord queryRow : avroReader) {
long location = (Long) queryRow.get(SchemaUtils.LOCATION_FIELD_NAME);
int length = Integer.parseInt(queryRow.get(SchemaUtils.LENGTH_FIELD_NAME).toString());
int length = ((Long) queryRow.get(SchemaUtils.LENGTH_FIELD_NAME)).intValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for my edu only) Is it better to not convert to a string in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's expensive… to convert a number to a string and then parse the string to get back another number. the result from get is already a Long we just have to cast it as such. BigQuery doesn't return int, but we know it is an int and want it as such so we call intValue() on it.

}

private SortingCollection<GenericRecord> createSortedReferenceRangeCollectionFromExtractTableBigQuery(final String projectID,
final String fqRefTable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spacing looks odd here

Copy link

@rsasch rsasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏻

@kcibul kcibul merged commit 3aa74a5 into ah_var_store Jan 19, 2022
@kcibul kcibul deleted the kc_ranges_prepare branch January 19, 2022 15:25
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants