Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ah spike writeapi #7530

Merged
merged 57 commits into from
Dec 8, 2021
Merged

Ah spike writeapi #7530

merged 57 commits into from
Dec 8, 2021

Conversation

ahaessly
Copy link
Contributor

@ahaessly ahaessly commented Oct 28, 2021

This PR merges the BQ Write API with the ref ranges table and refactors the Import wdl to be single sample.
I pulled out the create tables into it's own wdl. It doesn't make sense to try this for each individual sample. For now it need to be run separately. We are also not setting the is_loaded field in the sample_info table - this needs to be resolved when we decide how we want to track that info.

See https://docs.google.com/document/d/1_ox38x7YjSeQx1I-6K_6kB4TTlonaEah2LRGevN9GmM/edit# for details

@gatk-bot
Copy link

gatk-bot commented Oct 29, 2021

Travis reported job failures from build 36659
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 36659.1 logs
cloud openjdk11 36659.14 logs
integration openjdk11 36659.12 logs
integration openjdk8 36659.2 logs

Copy link
Contributor

@kcibul kcibul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this!

@gatk-bot
Copy link

gatk-bot commented Nov 2, 2021

Travis reported job failures from build 36699
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 36699.1 logs
cloud openjdk11 36699.14 logs
integration openjdk11 36699.12 logs

@gatk-bot
Copy link

gatk-bot commented Nov 2, 2021

Travis reported job failures from build 36704
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 36704.1 logs
cloud openjdk11 36704.14 logs
integration openjdk11 36704.12 logs
integration openjdk8 36704.2 logs

@gatk-bot
Copy link

gatk-bot commented Nov 3, 2021

Travis reported job failures from build 36712
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 36712.1 logs
cloud openjdk11 36712.14 logs
integration openjdk11 36712.12 logs
integration openjdk8 36712.2 logs

@gatk-bot
Copy link

gatk-bot commented Nov 3, 2021

Travis reported job failures from build 36716
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 36716.1 logs
cloud openjdk11 36716.14 logs
integration openjdk11 36716.12 logs
integration openjdk8 36716.2 logs

@gatk-bot
Copy link

gatk-bot commented Nov 4, 2021

Travis reported job failures from build 36725
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 36725.12 logs
integration openjdk8 36725.2 logs

@gatk-bot
Copy link

gatk-bot commented Nov 4, 2021

Travis reported job failures from build 36729
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 36729.12 logs
integration openjdk8 36729.2 logs

@gatk-bot
Copy link

gatk-bot commented Nov 4, 2021

Travis reported job failures from build 36733
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 36733.12 logs
integration openjdk8 36733.2 logs

@gatk-bot
Copy link

gatk-bot commented Nov 4, 2021

Travis reported job failures from build 36735
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 36735.12 logs
integration openjdk8 36735.2 logs

@gatk-bot
Copy link

gatk-bot commented Nov 4, 2021

Travis reported job failures from build 36739
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 36739.12 logs
integration openjdk8 36739.2 logs

@gatk-bot
Copy link

gatk-bot commented Nov 4, 2021

Travis reported job failures from build 36737
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 36737.12 logs
integration openjdk8 36737.2 logs

@ahaessly ahaessly marked this pull request as ready for review November 4, 2021 21:44
@gatk-bot
Copy link

gatk-bot commented Nov 4, 2021

Travis reported job failures from build 36747
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 36747.12 logs
integration openjdk8 36747.2 logs

Copy link
Contributor

@kcibul kcibul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part of review -- everything BUT the java code :D


A sample set for the quickstart has already been created with 10 samples and paths to re-blocked gVCFs for each sample. Run the two import workflows against this sample set by selecting "sample_set" as the root entity type ("Step 1") and `gvs_demo-10` for the data ("Step 2"). If you are creating your own sample set, note that the sample table should have a column for the re-blocked gVCFs (`hg38_reblocked_gvcf` or `reblocked_gvcf_path`) and their index files need to be in the same location.
## 2. Reblock samples
Run the GvsAoUReblockGvcf workflow on the samples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be the WARP pipeline? Feels off to have people running an AoU specific pipelines in our quickstart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I intended this workspace to be the template for the Gvs AoU specific workspace. I'll change the name of the markdown file for this use.

To optimize the internal queries, each sample must have a unique and consecutive integer ID assigned. Run the `GvsAssignIds` workflow, which will create an appropriate ID for each sample in the sample set and update the BigQuery dataset with the sample name to ID mapping info.

## 3. Load data
### 3.1 Assign Gvs IDs (suggest renaming to GvsPrepareSamples)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW - I prefer the Assign IDs name since that's what this step is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also takes care of creating the tables in BQ so I wanted that part to be more obvious. I'm fine leaving the name as is for now.


These are the required parameters which must be supplied to the workflow:

| Parameter | Description |
| ----------------- | ----------- |
| project_id | The name of the google project containing the dataset |
| dataset_name | The name of the dataset you created above |
| external\_sample_names | Sample ids from the data model that will be mapped to gvs ids |
| workspace_namespace | The namespace of the workspace |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part feels funky -- maybe there's not a better way to do it but this means you must run this workflow inside Terra (thinking of other users) but also IF you make a mistake and put the wrong workspace name here… it could mess up your data in the other workspace.

Is there not a way to map these results back onto the data model using Terra features? If not.. maybe we should rethink about how IDs are allocated.

ToL: why do we need these IDs in the data model anyway? Feels like an implementation detail we shouldn't expose our users to…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I meant for this documentation to be specific to the workspace we use for AoU. I think it is useful to have this information in the workspace (just given how many times I've had to get the mapping between the two for tracking - but maybe we won't have the same issues that I had to deal with previously). It can be removed if it really isn't helpful.

| output_directory | A unique GCS path to be used for loading, can be in the workspace bucket. E.g. `gs://fc-124-12-132-123-31/gvs/demo1`)

| external\_sample_name | The name of the sample from the data model |
| gvs\_sample_id | The gvs_id from the data model |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just grab this from BigQuery? One less thing for the user to do/mess up :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are now running ingest per sample, I didn't want each vm to have to make a query to the sample table for the id. Any ideas of other ways were we won't run into query limits?


**NOTE**: if your workflow fails, you will need to manually remove a lockfile from the output directory. It is called LOCKFILE, and can be removed with `gsutil rm`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

soooo nice to not have to deal with this or the loading directory!!!

preemptible_tries = preemptible_tries
}

call ImportSample {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indenting seems off

# check the INFORMATION_SCHEMA.PARTITIONS table to see if any of input sample names/ids have data loaded into their partitions
# this returns the list of sample names that do already have data loaded
bq --location=US --project_id=~{project_id} query --format=csv --use_legacy_sql=false \
"SELECT distinct(partition_id) FROM ${INFO_SCHEMA_TABLE} p WHERE (p.partition_id = CAST(~{gvs_sample_id} AS STRING)) AND p.total_logical_bytes > 0 AND (table_name like 'pet_%' OR table_name like 'vet_%')" | \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should also check the ref_ranges_% tables for data

String docker
}

Int disk_size = if defined(drop_state) then 50 else 75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this much disk now that we are streaming?

private PetTsvCreator petTsvCreator;
private VetTsvCreator vetTsvCreator;
private SampleInfoTsvCreator sampleInfoTsvCreator;
private PetCreator petCreator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the refactoring! Maybe this could be "RefCreator" since it could be either Pet or ReferenceRanges ?

@@ -41,11 +46,14 @@
private final SAMSequenceDictionary seqDictionary;
private final Set<GQStateEnum> gqStatesToIgnore = new HashSet<GQStateEnum>();
private GenomeLocSortedSet coverageLocSortedSet;
private final static String PET_FILETYPE_PREFIX = "pet_";
private static String PET_FILETYPE_PREFIX = "pet_";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer final?



String table_where_clause = vet_check + ref_check + pet_check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following the logic here… if I say loading vet and ref_ranges… won't these "false"es get into the WHERE clause when you just concatenate them?

}


task CheckForDuplicateData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In running, I noticed this step call-cached… Should it? Should it be volatile instead?

.dockstore.yml Outdated Show resolved Hide resolved
@@ -315,7 +316,7 @@ task GetNumSamplesLoaded {

echo "project_id = ~{project_id}" > ~/.bigqueryrc
bq query --location=US --project_id=~{project_id} --format=csv --use_legacy_sql=false \
"SELECT COUNT(*) as num_rows FROM ~{fq_sample_table} WHERE is_loaded = true" > num_rows.csv
"SELECT COUNT(*) as num_rows FROM ~{fq_sample_table}" > num_rows.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about why we don't need to check this anymore in the (default) case where the table is sample_info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must have been a hold over from when I thought I was going to remove this field and forgot to put this back in. adding back.

scripts/variantstore/wdl/GvsImportSample.wdl Outdated Show resolved Hide resolved
@ahaessly ahaessly merged commit ac1dc9c into ah_var_store Dec 8, 2021
@ahaessly ahaessly deleted the ah_spike_writeapi branch December 8, 2021 15:40
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants