Ah spike writeapi #7530

ahaessly · 2021-10-28T16:08:13Z

This PR merges the BQ Write API with the ref ranges table and refactors the Import wdl to be single sample.
I pulled out the create tables into it's own wdl. It doesn't make sense to try this for each individual sample. For now it need to be run separately. We are also not setting the is_loaded field in the sample_info table - this needs to be resolved when we decide how we want to track that info.

See https://docs.google.com/document/d/1_ox38x7YjSeQx1I-6K_6kB4TTlonaEah2LRGevN9GmM/edit# for details

gatk-bot · 2021-10-29T21:32:23Z

Travis reported job failures from build 36659
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	36659.1	logs
cloud	openjdk11	36659.14	logs
integration	openjdk11	36659.12	logs
integration	openjdk8	36659.2	logs

src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/PendingBQWriter.java

src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/PetCreator.java

src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/VetCreator.java

kcibul

I love this!

gatk-bot · 2021-11-02T15:24:20Z

Travis reported job failures from build 36699
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	36699.1	logs
cloud	openjdk11	36699.14	logs
integration	openjdk11	36699.12	logs

gatk-bot · 2021-11-02T16:55:05Z

Travis reported job failures from build 36704
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	36704.1	logs
cloud	openjdk11	36704.14	logs
integration	openjdk11	36704.12	logs
integration	openjdk8	36704.2	logs

gatk-bot · 2021-11-03T19:11:24Z

Travis reported job failures from build 36712
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	36712.1	logs
cloud	openjdk11	36712.14	logs
integration	openjdk11	36712.12	logs
integration	openjdk8	36712.2	logs

gatk-bot · 2021-11-03T21:21:10Z

Travis reported job failures from build 36716
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	36716.1	logs
cloud	openjdk11	36716.14	logs
integration	openjdk11	36716.12	logs
integration	openjdk8	36716.2	logs

gatk-bot · 2021-11-04T15:41:03Z

Travis reported job failures from build 36725
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	36725.12	logs
integration	openjdk8	36725.2	logs

gatk-bot · 2021-11-04T16:01:10Z

Travis reported job failures from build 36729
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	36729.12	logs
integration	openjdk8	36729.2	logs

gatk-bot · 2021-11-04T17:40:43Z

Travis reported job failures from build 36733
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	36733.12	logs
integration	openjdk8	36733.2	logs

gatk-bot · 2021-11-04T17:54:17Z

Travis reported job failures from build 36735
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	36735.12	logs
integration	openjdk8	36735.2	logs

gatk-bot · 2021-11-04T18:04:54Z

Travis reported job failures from build 36739
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	36739.12	logs
integration	openjdk8	36739.2	logs

gatk-bot · 2021-11-04T18:08:16Z

Travis reported job failures from build 36737
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	36737.12	logs
integration	openjdk8	36737.2	logs

gatk-bot · 2021-11-04T22:21:58Z

Travis reported job failures from build 36747
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	36747.12	logs
integration	openjdk8	36747.2	logs

kcibul

First part of review -- everything BUT the java code :D

kcibul · 2021-11-10T11:09:53Z

scripts/variantstore/TERRA_QUICKSTART.md


-A sample set for the quickstart has already been created with 10 samples and paths to re-blocked gVCFs for each sample.  Run the two import workflows against this sample set by selecting "sample_set" as the root entity type ("Step 1") and `gvs_demo-10` for the data ("Step 2").  If you are creating your own sample set, note that the sample table should have a column for the re-blocked gVCFs (`hg38_reblocked_gvcf` or `reblocked_gvcf_path`) and their index files need to be in the same location.
+## 2. Reblock samples
+Run the GvsAoUReblockGvcf workflow on the samples


Can this just be the WARP pipeline? Feels off to have people running an AoU specific pipelines in our quickstart

Good point. I intended this workspace to be the template for the Gvs AoU specific workspace. I'll change the name of the markdown file for this use.

kcibul · 2021-11-10T11:20:27Z

scripts/variantstore/TERRA_QUICKSTART.md

-To optimize the internal queries, each sample must have a unique and consecutive integer ID assigned. Run the `GvsAssignIds` workflow, which will create an appropriate ID for each sample in the sample set and update the BigQuery dataset with the sample name to ID mapping info.
+
+## 3. Load data
+### 3.1 Assign Gvs IDs (suggest renaming to GvsPrepareSamples)


FWIW - I prefer the Assign IDs name since that's what this step is doing.

It also takes care of creating the tables in BQ so I wanted that part to be more obvious. I'm fine leaving the name as is for now.

kcibul · 2021-11-10T11:23:03Z

scripts/variantstore/TERRA_QUICKSTART.md


 These are the required parameters which must be supplied to the workflow:

 | Parameter      | Description |
 | ----------------- | ----------- |
 | project_id | The name of the google project containing the dataset |
 | dataset_name      | The name of the dataset you created above       |
+| external\_sample_names      | Sample ids from the data model that will be mapped to gvs ids       |
+| workspace_namespace      | The namespace of the workspace       |


This part feels funky -- maybe there's not a better way to do it but this means you must run this workflow inside Terra (thinking of other users) but also IF you make a mistake and put the wrong workspace name here… it could mess up your data in the other workspace.

Is there not a way to map these results back onto the data model using Terra features? If not.. maybe we should rethink about how IDs are allocated.

ToL: why do we need these IDs in the data model anyway? Feels like an implementation detail we shouldn't expose our users to…

Again, I meant for this documentation to be specific to the workspace we use for AoU. I think it is useful to have this information in the workspace (just given how many times I've had to get the mapping between the two for tracking - but maybe we won't have the same issues that I had to deal with previously). It can be removed if it really isn't helpful.

kcibul · 2021-11-10T11:23:42Z

scripts/variantstore/TERRA_QUICKSTART.md

-| output_directory | A unique GCS path to be used for loading, can be in the workspace bucket.  E.g. `gs://fc-124-12-132-123-31/gvs/demo1`)
-
+| external\_sample_name | The name of the sample from the data model |
+| gvs\_sample_id | The gvs_id from the data model |


Can we just grab this from BigQuery? One less thing for the user to do/mess up :D

Since we are now running ingest per sample, I didn't want each vm to have to make a query to the sample table for the id. Any ideas of other ways were we won't run into query limits?

kcibul · 2021-11-10T11:24:38Z

scripts/variantstore/TERRA_QUICKSTART.md


-**NOTE**: if your workflow fails, you will need to manually remove a lockfile from the output directory.  It is called LOCKFILE, and can be removed with `gsutil rm`


soooo nice to not have to deal with this or the loading directory!!!

kcibul · 2021-11-10T11:34:08Z

scripts/variantstore/wdl/GvsImportSample.wdl

+      preemptible_tries = preemptible_tries
+  }
+
+    call ImportSample {


indenting seems off

kcibul · 2021-11-10T11:35:18Z

scripts/variantstore/wdl/GvsImportSample.wdl

+    # check the INFORMATION_SCHEMA.PARTITIONS table to see if any of input sample names/ids have data loaded into their partitions
+    # this returns the list of sample names that do already have data loaded
+    bq --location=US --project_id=~{project_id} query --format=csv --use_legacy_sql=false \
+      "SELECT distinct(partition_id) FROM ${INFO_SCHEMA_TABLE} p WHERE (p.partition_id = CAST(~{gvs_sample_id} AS STRING)) AND p.total_logical_bytes > 0 AND (table_name like 'pet_%' OR table_name like 'vet_%')" | \


this should also check the ref_ranges_% tables for data

kcibul · 2021-11-10T11:35:46Z

scripts/variantstore/wdl/GvsImportSample.wdl

+    String docker
+  }
+
+  Int disk_size = if defined(drop_state) then 50 else 75


Do we need this much disk now that we are streaming?

kcibul · 2021-11-10T11:39:49Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/CreateVariantIngestFiles.java

-    private PetTsvCreator petTsvCreator;
-    private VetTsvCreator vetTsvCreator;
-    private SampleInfoTsvCreator sampleInfoTsvCreator;
+    private PetCreator petCreator;


Love the refactoring! Maybe this could be "RefCreator" since it could be either Pet or ReferenceRanges ?

kcibul · 2021-11-10T11:40:49Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/PetCreator.java

@@ -41,11 +46,14 @@
    private final SAMSequenceDictionary seqDictionary;
    private final Set<GQStateEnum> gqStatesToIgnore = new HashSet<GQStateEnum>();
    private GenomeLocSortedSet coverageLocSortedSet;
-    private final static String PET_FILETYPE_PREFIX = "pet_";
+    private static String PET_FILETYPE_PREFIX = "pet_";


no longer final?

kcibul · 2021-11-15T20:33:57Z

scripts/variantstore/wdl/GvsImportSample.wdl


-
+  String table_where_clause = vet_check + ref_check + pet_check


I'm not following the logic here… if I say loading vet and ref_ranges… won't these "false"es get into the WHERE clause when you just concatenate them?

kcibul · 2021-11-16T17:32:04Z

scripts/variantstore/wdl/GvsImportSample.wdl

+}
+
+
+task CheckForDuplicateData {


In running, I noticed this step call-cached… Should it? Should it be volatile instead?

.dockstore.yml

kcibul · 2021-12-07T18:32:34Z

scripts/variantstore/wdl/GvsCreateFilterSet.wdl

@@ -315,7 +316,7 @@ task GetNumSamplesLoaded {

        echo "project_id = ~{project_id}" > ~/.bigqueryrc
        bq query --location=US --project_id=~{project_id} --format=csv --use_legacy_sql=false \
-        "SELECT COUNT(*) as num_rows FROM ~{fq_sample_table} WHERE is_loaded = true" > num_rows.csv
+        "SELECT COUNT(*) as num_rows FROM ~{fq_sample_table}" > num_rows.csv


I'm confused about why we don't need to check this anymore in the (default) case where the table is sample_info

this must have been a hold over from when I thought I was going to remove this field and forgot to put this back in. adding back.

scripts/variantstore/wdl/GvsImportSample.wdl

updated quickstart

* added sample_load_status table creation * log load status and verify sample is not loaded

* used _default stream instead * default to new jar

* safety checks for restarts via preemptibles * branch to dockstore * fixed NPE

* add retries