Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container & documentation #3

Open
Zethson opened this issue Oct 25, 2021 · 12 comments
Open

Container & documentation #3

Zethson opened this issue Oct 25, 2021 · 12 comments

Comments

@Zethson
Copy link

Zethson commented Oct 25, 2021

Hi,

would it be possible that you provide a container with a full environment to get the pipeline running, please? A small example dataset would also help a lot.
Moreover, the pipeline parameters and flags etc are not document yet.

Thank you very much!

@petervangalen
Copy link
Collaborator

Do the recent updates answer your question? I believe the small example dataset is present in tests and can be run along these lines:

maegtk bcall -i data/test_maester.bam -o test_maester -z

@Zethson
Copy link
Author

Zethson commented Dec 7, 2021

Currently doing other things. Will revisit this at a later point. Thank you for replying though!

@JZL
Copy link

JZL commented Dec 14, 2021

On a similar note, adding the R package dependencies to the FAQ would be helpful. I was moving fast so didn't realize it also needed SummarizedExperiment so it almost finished and then failed

For what it's worth, I didn't do a full container but used conda-forge to install the dependencies for only this project, and it was nice. You could even set up a maegatk conda(-forge) package and it would take care of installing all the binary dependencies for future users

@caleblareau
Copy link
Owner

added some notes on the R package install-- thanks for that note @JZL

I don't have any direct experience making a conda package but would welcome a pull request from anyone for this!

@JZL
Copy link

JZL commented Dec 17, 2021

Hi,

These are just some scattered thoughts getting it to run. I can make a more formal PR with these changes in the future

To get this running for future people. You also need to install samtools and run bwa index ***.fa first, else it fails.

Additionally:

  • pysam, bwa mem, and samtools sort support a threads argument which seemed to help speed it up a little. (Although I learned the hard way that if you do this within multi_file_manager it causes a thread bomb an crashes the machine!)
  • In
    if read_barcode in bc:
    within writePassingReads where it checks if read_barcode is in bc and then gets its index, when bc is a list it iteratively checks all the items for each call. Which is a lot of iterative checking for every read of big bams. I didn't profile the difference but I thought I might as well define bcDict = dict((x, i) for i, x in enumerate(bc)) and then use the dictionary to check if the read_barcode is in bc and to get the index in constant time

@alecpnkw
Copy link

Do the recent updates answer your question? I believe the small example dataset is present in tests and can be run along these lines:

maegtk bcall -i data/test_maester.bam -o test_maester -z

I am having a similar issue when trying to run this small test dataset as described. Is there a specific snakemake version that maegatk has been tested with? Trying to isolate what is going wrong for me. The pipeline seems to run fine for the 'splitting' part, but then fails to submit the snakemake job and perform the remaining analyses:

Thu Mar 28 22:36:19 EDT 2024: maegatk v0.2.0
Thu Mar 28 22:36:19 EDT 2024: Found bam file: tests/data/test_maester.bam for genotyping.
Thu Mar 28 22:36:19 EDT 2024: Will determine barcodes with at least: 100 mitochondrial reads.
Thu Mar 28 22:36:19 EDT 2024: User specified mitochondrial genome matches .bam file
Thu Mar 28 22:36:23 EDT 2024: Finished determining/splitting barcodes for genotyping.
Thu Mar 28 22:36:23 EDT 2024: Genotyping samples with 2 threads

where the above is the end of the log message. If I look in "final/" this is all I see:

[final]$ ls
barcodeQuants.tsv  chrM_refAllele.txt  passingBarcodes.tsv

My environment for running maegatk looks like this and is based on your description in the readme.

name: "maegatk"
channels: 
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - openjdk
  - python >= 3.8
  - snakemake == 6.15.5
  - bwa == 0.7.17
  - samtools >= 1.9
  - freebayes
  - r-base == 4.0.3 # dplyr, data.table, Matrix, GenomicRanges, SummarizedExperiment
  - pip:
    - maegatk == 0.2.0
    - ruamel.yaml <= 0.18.0

Thanks!

@noranekonobokkusu
Copy link
Collaborator

@alecpnkw do you have any log files in logs/? My snakemake version is 7.18.2, but I don't think maegatk was ever tested with a particular version of it.

If nothing informative comes from logs/, try modifying snakecmd_scatter and snakecmd_gather commands in your cli.py by providing a full path to snakemake in case your system doesn't see it, like

snakecmd_scatter = '/full/path/to/snakemake'+snakeclust+' --snakefile ' + script_dir + '/bin/snake/Snakefile.maegatk.Scatter --cores '+ncores+' --config cfp="'  + y_s + '" --stats '+snake_stats + snake_log_out

If that won't work either, try this:
Following this issue, try upgrading ruamel.yaml to the current version, then make the modifications described below, then remove the output directory and rerun it from scratch:

  1. Replace your entire cli.py with the code from https://github.com/caleblareau/maegatk/blob/master/maegatk/cli.py

  2. Modify oneSample_maegatk.py:
    change this
    from ruamel import yaml
    into this
    from ruamel.yaml import YAML
    then change this
    config = yaml.load(stream, Loader=yaml.Loader)
    into this

yaml = YAML(typ='rt')
config = yaml.load(stream)

@alecpnkw
Copy link

Thanks, knowing that a later snakemake versions runs in helpful. I'd like to try to keep the source the same as v0.2.0 for deployment reasons but happy to test if I can't get anything else to work. Based on the issues you linked I tweaked my environment to change the ruamel.yaml version to 0.17 and now am getting better results (!). It looks like the final output is there (final/maegatk.rds), and final/maegatk.depthTable.txt looks like this:

TGGTAGTGAACC-1  57.35
GGGTTCGCCTCC-1  66.34
TTCCTACGCAAT-1  55.25
CCCGTCGTGGTA-1  53.17
CCACAAAACATG-1  48.12
GTACCCACAGCC-1  43.3
GGCGCTAATGAA-1  67.39
ACAATTAGCACT-1  51.39
CTAACCCGGAAT-1  42.79
GACCGTGCATTT-1  52.64

Still, I am not getting any logs inside logs/rmdupslogs and getting dumped cores that I think correspond to the java jobs. Should I be seeing logs for rmdups in the test output?

@noranekonobokkusu
Copy link
Collaborator

It seems your test job has finished successfully.
When I suggested checking the logs, I was referring to .log files, like maegatk.snakemake_scatter.log or maegatk.snakemake_gather.log which are usually helpful. I am not the original developer of maegatk so my knowledge on this part is rather limited and I cannot comment on the expected content of rmdupslogs/, but my rmdupslogs/ is empty too (at least for the test dataset).

@alecpnkw
Copy link

Awesome, thanks. Is there someplace I can access a test final/maegatk.rds file to make sure I'm generating all the intended assay slots and coldata?

@Spirometra
Copy link

@alecpnkw do you have any log files in logs/? My snakemake version is 7.18.2, but I don't think maegatk was ever tested with a particular version of it.

If nothing informative comes from logs/, try modifying snakecmd_scatter and snakecmd_gather commands in your cli.py by providing a full path to snakemake in case your system doesn't see it, like

snakecmd_scatter = '/full/path/to/snakemake'+snakeclust+' --snakefile ' + script_dir + '/bin/snake/Snakefile.maegatk.Scatter --cores '+ncores+' --config cfp="'  + y_s + '" --stats '+snake_stats + snake_log_out

If that won't work either, try this: Following this issue, try upgrading ruamel.yaml to the current version, then make the modifications described below, then remove the output directory and rerun it from scratch:

  1. Replace your entire cli.py with the code from https://github.com/caleblareau/maegatk/blob/master/maegatk/cli.py
  2. Modify oneSample_maegatk.py:
    change this
    from ruamel import yaml
    into this
    from ruamel.yaml import YAML
    then change this
    config = yaml.load(stream, Loader=yaml.Loader)
    into this
yaml = YAML(typ='rt')
config = yaml.load(stream)

Hi there @noranekonobokkusu. I came across similar problems using the test data as you did in issue #5 . First I uninstalled ruamel.yaml (current version 0.18.6) and then installed ruamel.yaml==0.17 and still get the same issues, so back to 0.18.6. I've tried snakemake 6.6.1 and 7.18.2. I tried specifying snakemake path in the updated cli.py file, and the modification to the oneSample_maegatk.py as suggested. Still I get the same error output:

Error in checkGrep(grep(".A.txt", files)) :
Improper folder specification; file missing / extra file present. See documentation
Calls: importMito -> checkGrep
Execution halted

base.maegatk.log does not show errors, however I do not see the .rds file in /final. but in maegatk.snakemake_gather.log I get an error about "Error in rule make_final_sparse_matrices"and in maegatk.snakemake_scatter.log some Java and samtools error.
I've tried version Java 1.7.0_261 and Java/1.8.0_144 (this is with 1.8.0_144).
See attached log files.

base.maegatk.log
maegatk.snakemake_gather (1).log
maegatk.snakemake_scatter (1).log

What would your recommendation be to get this to run? I'd really like to use this tool!

@noranekonobokkusu
Copy link
Collaborator

Hi @Spirometra sorry for the late reply.
I can see bwa is not available in your path (from .scatter.log):
sh: bwa: command not found
Try adding BWA to your environment and rerunning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants