Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature support: stripped-down replacement for the old GATK ROD system #224

Merged
merged 1 commit into from
Feb 26, 2015

Conversation

droazen
Copy link
Collaborator

@droazen droazen commented Feb 23, 2015

Replicates most of the functionality of the old ROD system in ~5% of the
code. The incomprehensible tangle of nested iterators, bindings, views, states,
tracks, trackers, builders etc., etc., is gone, replaced by about 4 core classes:
FeatureContext, FeatureDataSource, FeatureInput, and FeatureManager.

FeatureContext: This is tool-facing interface (replaces RefMetaDataTracker).
Allows particular sources of Features to be queried.

FeatureDataSource: Handles the low-level details of querying a source of Features.
Uses a caching scheme optimized for the use case of queries over
intervals with gradually increasing start/stop positions.

FeatureInput: This is used to declare Feature arguments in tools (replaces RodBinding).
The engine discovers all FeatureInput arguments declared in the tool's class
hierarchy, and initializes data sources for each one that was specified
on the command line.

FeatureManager: Manages the pool of data sources, as well as codec and file format
discovery and type checking.

-ReadWalker interface has changed: apply() now takes a FeatureContext argument
(will be null if there are no sources of Features).

-Included an example tool PrintReadsWithVariants to demonstrate use of the new
ReadWalker interface.

-Since Feature files must be indexed in order to query them, I have provided a
tool IndexFeatureFile that can index any Feature-containing file.

-Made required changes to the argument-parsing system. Feature argument discovery
is as de-coupled as possible from the main arg parser.

-Made required changes to BQSR, and eliminated the temporary HACKRefMetaDataTracker.

-Comprehensive tests

@lbergelson
Copy link
Member

yay!

On Mon, Feb 23, 2015 at 6:02 PM, droazen [email protected] wrote:

Replicates most of the functionality of the old ROD system in ~5% of the
code. The incomprehensible tangle of nested iterators, bindings, views,
states,
tracks, trackers, builders etc., etc., is gone, replaced by about 4 core
classes:
FeatureContext, FeatureDataSource, FeatureInput, and FeatureManager.

FeatureContext: This is tool-facing interface (replaces
RefMetaDataTracker).
Allows particular sources of Features to be queried.

FeatureDataSource: Handles the low-level details of querying a source of
Features.
Uses a caching scheme optimized for the use case of queries over
intervals with gradually increasing start/stop positions.

FeatureInput: This is used to declared Feature arguments in tools
(replaces RodBinding).
The engine discovers all FeatureInput arguments declared in the tool's
class
hierarchy, and initializes data sources for each one that was specified
on the command line.

FeatureManager: Manages the pool of data sources, as well as codec and
file format
discovery and type checking.

-ReadWalker interface has changed: apply() now takes a FeatureContext
argument
(will be null if there are no sources of Features).

-Included an example tool PrintReadsWithVariants to demonstrate use of the
new
ReadWalker interface.

-Since Feature files must be indexed in order to query them, I have
provided a
tool IndexFeatureFile that can index any Feature-containing file.

-Made required changes to the argument-parsing system. Feature argument
discovery
is as de-coupled as possible from the main arg parser.

-Made required changes to BQSR, and eliminated the temporary
HACKRefMetaDataTracker.

-Comprehensive tests

You can view, comment on, or merge this pull request online at:

#224
Commit Summary

  • Feature support: stripped-down replacement for the old GATK ROD
    system

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#224.

@droazen
Copy link
Collaborator Author

droazen commented Feb 23, 2015

There are currently a few known issues that reviewers should bear in mind (these will become tickets shortly):

-All codecs marked as implementing ReferenceDependentFeatureCodec are currently non-functional. I need to either refactor them to not require a GenomeLocParser or delete them entirely.

-The IndexFeatureFile tool is not currently working on block-compressed files -- will fix this soon.

-IndexFeatureFile needs integration tests (will work on this during code review)

-BQSR needs integration tests for the case of multiple simultaneous known sites files (now that it supports them!)

-All codecs should move to tribble, and must implement canDecode() correctly (this is a new requirement, since it's no longer possible to manually request a particular codec). Most canDecode() implementations can be file-extension-based; only things like VCF format detection need to examine file contents to determine file type.

-FeatureDataSource supports querying by interval, as well as full traversals, but not full traversal by a set of intervals (yet). This latter feature will be needed for the VariantWalker traversal.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-2.67%) to 74.54% when pulling d006568b0244cc85ea9092d0b96d2b7d29230bd5 on dr_feature_support into f93c4b6 on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-2.64%) to 74.58% when pulling d3acbf0f56c789f9ac98239e6f85001afbc65720 on dr_feature_support into f93c4b6 on master.

Set<Class<?>> concreteClassSet = new HashSet<>();

for ( Class<?> clazz : classes ) {
if ( ! Modifier.isAbstract(clazz.getModifiers()) && ! Modifier.isInterface(clazz.getModifiers()) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract a static method isConcrete(Class c) and use it here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-2.64%) to 74.58% when pulling 39b2902dae243d7dea480313ac1ab4b6d7bb15d9 on dr_feature_support into f93c4b6 on master.

// If the Collection's parameterized type is itself parameterized (eg., List<Foo<Bar>>),
// return the raw type of the outer parameter (Foo.class, in this example) to avoid a
// ClassCastException. Otherwise, return the Collection's type parameter directly as a Class.
return (Class<?>) (genericTypes[0] instanceof ParameterizedType ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment why the 0-th items in genericTypes is special

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not special -- it's simply the first (and only) type parameter (notice that a few lines earlier we've checked that the Collection has only a single type parameter). Will clarify in the comments.

* Gets all Features from the sources represented by the provided FeatureInput arguments that overlap
* this FeatureContext's query interval.
*
* Returned Features are not guaranteed to be in any particular order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they may also not be unique

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a note to that effect

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-2.64%) to 74.58% when pulling 11667a7fb7f7273d77bc25c34aa12a4862f71706 on dr_feature_support into f93c4b6 on master.

public String get(String columnName) {
int position = keys.indexOf(columnName);
if (position < 0) throw new IllegalArgumentException("We don't have a column named " + columnName);
return values.get(position);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should delegate to getValue. and the two methods should have unified names

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am hesitant to make changes to codecs that are not currently covered by tests. This kind of refactoring should wait until TableCodecUnitTest exists.

* @return true if all records overlapping the provided interval are already contained in our cache, otherwise false
*/
public boolean cacheHit( final GenomeLoc interval ) {
return cachedContig != null &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a method on GenomeLoc that implements this functionality?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use GenomeLoc here due to the sequence dictionary requirement (we can't assume the presence of a sequence dictionary in this part of the engine).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a todo or a note in #61 and we can fix it when we pull the overlap methods up out of GenomeLoc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean #100, but will do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that wast the one I meant..

/**
* True if the file backing this data source has an accompanying index file, false if it doesn't
*/
private boolean hasIndex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be made final. Maybe other fields too. It's always a good idea to make things final if possible. Makes reading code simpler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@akiezun
Copy link
Contributor

akiezun commented Feb 25, 2015

i like it. i want to get rid of many of those codecs though, let's make sure that only the useful ones survive

* Gets all Features from the source represented by the provided FeatureInput argument that overlap
* this FeatureContext's query interval.
*
* Returned Features are not guaranteed to be in any particular order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it should probably return a set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Features are not guaranteed to be unique, so a set would be inappropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiSet then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require a 3rd-party collections framework plus updating all client code that we port. Not worth it.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-2.62%) to 74.61% when pulling 5e7efc1c114c403a8157878dd1874016540902ae on dr_feature_support into 993ba00 on master.

@droazen
Copy link
Collaborator Author

droazen commented Feb 26, 2015

All comments addressed -- squashing and merging now.

Replicates most of the functionality of the old ROD system in ~5% of the
code. The incomprehensible tangle of nested iterators, bindings, views, states,
tracks, trackers, builders etc., etc., is gone, replaced by about 4 core classes:
FeatureContext, FeatureDataSource, FeatureInput, and FeatureManager.

FeatureContext: This is tool-facing interface (replaces RefMetaDataTracker).
                Allows particular sources of Features to be queried.

FeatureDataSource: Handles the low-level details of querying a source of Features.
                   Uses a caching scheme optimized for the use case of queries over
                   intervals with gradually increasing start/stop positions.

FeatureInput: This is used to declare Feature arguments in tools (replaces RodBinding).
              The engine discovers all FeatureInput arguments declared in the tool's class
              hierarchy, and initializes data sources for each one that was specified
              on the command line.

FeatureManager: Manages the pool of data sources, as well as codec and file format
                discovery and type checking.

-ReadWalker interface has changed: apply() now takes a FeatureContext argument
 (will be null if there are no sources of Features).

-Included an example tool PrintReadsWithVariants to demonstrate use of the new
 ReadWalker interface.

-Since Feature files must be indexed in order to query them, I have provided a
 tool IndexFeatureFile that can index any Feature-containing file.

-Made required changes to the argument-parsing system. Feature argument discovery
 is as de-coupled as possible from the main arg parser.

-Made required changes to BQSR, and eliminated the temporary HACKRefMetaDataTracker.

-Comprehensive tests
@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-2.62%) to 74.61% when pulling f7c1911 on dr_feature_support into 993ba00 on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-2.62%) to 74.61% when pulling f7c1911 on dr_feature_support into 993ba00 on master.

droazen added a commit that referenced this pull request Feb 26, 2015
Feature support: stripped-down replacement for the old GATK ROD system
@droazen droazen merged commit 69f84cd into master Feb 26, 2015
@droazen droazen deleted the dr_feature_support branch February 26, 2015 20:39
lbergelson pushed a commit that referenced this pull request May 31, 2017
…nts_in_separate_plots

Plot acnv segments in separate plots.  Closes #224 and #522.
schaluva added a commit that referenced this pull request Apr 2, 2021
* add bash code to chunk files at 15tb limit

* update dockstore.yml

* fix yml formatting

* fill in importLoadTable inputs json

* take out variable in comment

* remove unnecessary files in task

* printf for scientific notation

* scientific notation

* increase memory to check if memory issue

* remove gsutil copy log

* fix gsutil cp destination directory

* decrease memory after testing for increased memory fails

* testing bq job id

* test sed pattern matching

* take out mv command for testing purposes

* update move to done files command to within each set

* troubleshoot bq load command

* update project for inputs json

* revert parsing of job id

* test

* write load status to separate file

* sed file to get job ID

* test2

* fix bq wait status

* separate wait status command into pieces

* reorganize output block

* move wait command into main loop

* test taking out wait command entirely

* add task jsut for bq wait testing

* set project for bq wait

* put everything back now that wait works with project

* re commit with updated comments and cleaned up echo commands

* take out call to test task

* mv files from original dir to set dirs instead of copy

* trouble shoot wait_status

* change back to cp for testing in copying sets step

* take out output file for bq wait

* add back input file to wait statement

* add in more tmp files in output

* try to capture success in bq wait

* reorg output file for while loop

* fix sed command

* clean up for final test - mv files from original to sets

* update full import genomes wdl

* take out un-used inputs in load table task

* put back in the variables that control calls in order

* fixing lockfile exists comment

* add quotes to escape backtick in echo

* take out hard coded pet table in gsutil du

* single quote echo to block expression expansion in lockfile

* multi line echo comment for lockfile

* fix missing quotes

* set dockstore yml back to what it should be for ah_var_store branch

* remove testing wdl and json files

* set single file with all sum and sets

* fix sed variable format

* testing pattern matching

* fix reading wrong input file in loop

* adding output files

* test

* fix quotes on awk statement

* awk

* clean up comments and print statements

* edit comments
mmorgantaylor pushed a commit that referenced this pull request Apr 6, 2021
* add bash code to chunk files at 15tb limit

* update dockstore.yml

* fix yml formatting

* fill in importLoadTable inputs json

* take out variable in comment

* remove unnecessary files in task

* printf for scientific notation

* scientific notation

* increase memory to check if memory issue

* remove gsutil copy log

* fix gsutil cp destination directory

* decrease memory after testing for increased memory fails

* testing bq job id

* test sed pattern matching

* take out mv command for testing purposes

* update move to done files command to within each set

* troubleshoot bq load command

* update project for inputs json

* revert parsing of job id

* test

* write load status to separate file

* sed file to get job ID

* test2

* fix bq wait status

* separate wait status command into pieces

* reorganize output block

* move wait command into main loop

* test taking out wait command entirely

* add task jsut for bq wait testing

* set project for bq wait

* put everything back now that wait works with project

* re commit with updated comments and cleaned up echo commands

* take out call to test task

* mv files from original dir to set dirs instead of copy

* trouble shoot wait_status

* change back to cp for testing in copying sets step

* take out output file for bq wait

* add back input file to wait statement

* add in more tmp files in output

* try to capture success in bq wait

* reorg output file for while loop

* fix sed command

* clean up for final test - mv files from original to sets

* update full import genomes wdl

* take out un-used inputs in load table task

* put back in the variables that control calls in order

* fixing lockfile exists comment

* add quotes to escape backtick in echo

* take out hard coded pet table in gsutil du

* single quote echo to block expression expansion in lockfile

* multi line echo comment for lockfile

* fix missing quotes

* set dockstore yml back to what it should be for ah_var_store branch

* remove testing wdl and json files

* set single file with all sum and sets

* fix sed variable format

* testing pattern matching

* fix reading wrong input file in loop

* adding output files

* test

* fix quotes on awk statement

* awk

* clean up comments and print statements

* edit comments
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants