Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

313 Cleanup Extract Cohort params #7293

Merged
merged 26 commits into from
Jun 24, 2021

Conversation

RoriCremer
Copy link
Contributor

@RoriCremer RoriCremer commented Jun 7, 2021

There are params in Extract Cohort that can be tightened up

Extract Tool has two params that are not used by Extract Features and are already in Extract Cohort

The filter_set_name is used in the BQ filtering tables and looks like we can set it to be completely required for any type of filtering.

There are 3 BQ filter tables -- 2 are needed for filtering (no matter what?) and 1 (tranches) is needed for thresholding and sensitivity calculations?

Genotype level filtering is true by default, but this doesn't seem like it should effect things after this cleanup. Though technically it should only be true if a filter_set_name has been specified. I will add another comment in the body of the code, but I would like to add this safety gate explicitly

Disable gnarly doesn't need to be a passed in param---so we'll rip it out for now

SNP and INDEL truth sensitivity and SNP and INDEL Lod scores are cumbersome to have to worry about passing in, but I dont see a better alternative. Should there be additional validation on these (where if they are specified, but no filter_set_name is, then they throw an error?)

@RoriCremer RoriCremer marked this pull request as ready for review June 7, 2021 16:45
ExtractCohort already has both of these params (with different input names) accounted for and ExtractFeatures doesn't use these params at all
@RoriCremer RoriCremer force-pushed the rc-313-cleanup-extract-params branch from 14df450 to 9b24bf4 Compare June 17, 2021 15:46
Copy link
Contributor

@ahaessly ahaessly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one required change - rename onlyAnnotate

@@ -35,7 +35,9 @@
private static final Logger logger = LogManager.getLogger(ExtractCohort.class);
private ExtractCohortEngine engine;

@Argument(
public enum FilteringType { NONE, GENOTYPE, SITES};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does selecting GENOTYPE also imply that we are doing SITE level filtering? So maybe the enum should be {NONE, SITES_ONLY, FULL} (or instead of FULL maybe SITES_AND_GT)?

// set headers
extraHeaderLines.add(FilterSensitivityTools.getTruthSensitivityHeader(truthSensitivitySNPThreshold, vqsLodSNPThreshold, GATKVCFConstants.SNP));
extraHeaderLines.add(FilterSensitivityTools.getTruthSensitivityHeader(truthSensitivityINDELThreshold, vqsLodINDELThreshold, GATKVCFConstants.INDEL));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not for this PR, but I'm wondering if we should make FilterSensititvityTools an class where we create instances. We would pass it all the params on construction, it would do the validation. It could also add the headers and then we just ask it for the 2 thresholds so all the logic is encapsulated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true---it does seem like a lot of params to be passing around

if (!this.disableGnarlyGenotyper && genotypedVC == null) {
return;
}
final VariantContext genotypedVC = annotatedVC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could name the variable in the previous line genotypedVC and then get rid of this line

@@ -489,7 +474,7 @@ private VariantContext filterSiteByAlleleSpecificVQSLOD(VariantContext mergedVC,
builder.attribute(GATKVCFConstants.AS_VQS_LOD_KEY, relevantVqsLodMap.values().stream().map(val -> val.equals(Double.NaN) ? VCFConstants.EMPTY_INFO_FIELD : val.toString()).collect(Collectors.toList()));
builder.attribute(GATKVCFConstants.AS_YNG_STATUS_KEY, new ArrayList<>(relevantYngMap.values()));

if (!onlyAnnotate) {
if (filteringType.equals(ExtractCohort.FilteringType.SITES)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this finding the variable filteringType? the parameter passed in is onlyAnnotate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooof--nice catch, thank you!

optional = true
)
private boolean performGenotypeVQSLODFiltering = true;
private FilteringType filteringType = FilteringType.NONE; // TODO do we want to make sure this is always / required?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be FilteringType.GENOTYPE by default, and if they don't supply a filter_set we can either throw an error or change the value for this to be NONE (maybe the former)

@@ -460,7 +460,7 @@ private void finalizeCurrentVariant(final List<VariantContext> unmergedCalls,
}
}

private VariantContext filterSiteByAlleleSpecificVQSLOD(VariantContext mergedVC, HashMap<Allele, HashMap<Allele, Double>> vqsLodMap, HashMap<Allele, HashMap<Allele, String>> yngMap, boolean onlyAnnotate) {
private VariantContext filterSiteByAlleleSpecificVQSLOD(VariantContext mergedVC, HashMap<Allele, HashMap<Allele, Double>> vqsLodMap, HashMap<Allele, HashMap<Allele, String>> yngMap, ExtractCohort.FilteringType onlyAnnotate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onlyAnnotate was somewhat confusing before -- I think now this could just be flteringType

@@ -160,7 +153,7 @@ public void traverse() {
}
final String rowRestrictionWithFilterSetName = rowRestriction + " AND " + SchemaUtils.FILTER_SET_NAME + " = '" + filterSetName + "'";

boolean noVqslodFilteringRequested = (filterSetInfoTableRef == null);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wasn't this stopping the NPEs before?
this.filterSetInfoTableRef = filterSetInfoTableName == null || "".equals(filterSetInfoTableName) ? null : new TableReference(filterSetInfoTableName, SchemaUtils.YNG_FIELDS);
so we shouldn't be throwing it unless we are somehow setting the filterSetInfoTableName? Does this ternary need parens?

SITES("sites"),
NONE("");

String value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this.. you don't need the value so the enum could be

public enum VQSLODFilteringType { GENOTYPE, SITES, NONE };

fullName = "vqslod-filter-genotypes",
doc = "Should VQSLOD filtering be applied at the genotype level",
fullName = "vqslod-filter-by-site",
doc = "If VQSLOD filtering is applied, it should be at a site level. Default is false",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice docs

Set<VCFHeaderLine> extraHeaderLines = new HashSet<>();

if (filterSetInfoTableName != null) { // filter using vqslod-- default to GENOTYPE unless SITES specifically selected
if (performSiteSpecificVQSLODFiltering) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be a ternary instead

vqslodfilteringType = performSiteSpecificVQSLODFiltering ?  VQSLODFilteringType.SITES : VQSLODFilteringType.GENOTYPE;

@@ -122,13 +118,13 @@ public ExtractCohortEngine(final String projectID,
this.traversalIntervals = traversalIntervals;
this.minLocation = minLocation;
this.maxLocation = maxLocation;
this.filterSetInfoTableRef = filterSetInfoTableName == null || "".equals(filterSetInfoTableName) ? null : new TableReference(filterSetInfoTableName, SchemaUtils.YNG_FIELDS);
this.filterSetSiteTableRef = filterSetSiteTableName == null || "".equals(filterSetSiteTableName) ? null : new TableReference(filterSetSiteTableName, SchemaUtils.FILTER_SET_SITE_FIELDS);
this.filterSetInfoTableRef = (filterSetInfoTableName == null || "".equals(filterSetInfoTableName)) ? null : new TableReference(filterSetInfoTableName, SchemaUtils.YNG_FIELDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just test the enum here?

@gatk-bot
Copy link

gatk-bot commented Jun 24, 2021

Travis reported job failures from build 34712
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 34712.13 logs
unit openjdk8 34712.3 logs

@gatk-bot
Copy link

gatk-bot commented Jun 24, 2021

Travis reported job failures from build 34718
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 34718.13 logs
unit openjdk8 34718.3 logs

@RoriCremer RoriCremer merged commit 2dc9d8f into ah_var_store Jun 24, 2021
@RoriCremer RoriCremer deleted the rc-313-cleanup-extract-params branch June 24, 2021 21:59
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants