Skip to content

Releases: iquasere/reCOGnizer

Databases' names changed to CD-batch search options

12 Sep 15:26
Compare
Choose a tag to compare

Databases' names inputted to the --databases has changed to accomodate the options present at CDD Batch Search. The new options are:

  • NCBI_Curated
  • Pfam
  • SMART
  • KOG
  • COG
  • PRK
  • TIGR

Domains now follow the lists at the PN files provided by NCBI

Domains related to the NCBI_Curated and PRK databases were not all being considered when building databases. This has been fixed, in accordance to the PN files provided with cdd.tar.gz.

Database construction reimplemented to use the PN files provided by CDD

If those are not available, reCOGnizer will still build the PN files, but with more added domains.

This should fix #19. But lets see.

Also removed deprecated parameters

--download-resources and --skip-downloaded parameters now will result in error when specified.

Fix on regex search of EC numbers

30 Dec 13:10
Compare
Choose a tag to compare

re.escape is required for handling the regex search where strings are being concatenated.

E.g. to consider the literal ) when searching for (1.1.1.1), in the function in question.

This problem was caused by using the new r"regex" format.

Simpler download of databases and more robust COG2KO conversion

28 Dec 11:47
Compare
Choose a tag to compare

Much simpler download of databases

reCOgnizer relied on --download-resources and --skip-downloaded parameters for setting up its databases.

--download-resources instructed reCOgnizer to download the files required for its execution, and --skip-downloaded instructed it to ignore already downloaded files, if there had simply been the mistake of removing one file.

Now, reCOGnizer relies on the recognizer_dwnl.timestamp to check if databases have already been downloaded. If the file exists, it skips installation. If the file doesn't exist, reCOGnizer will remove all available files, and download everything.

COG2KO conversion more reliable

Previously, reCOGnizer built the cog2ko conversion as a collection of all KOs available for each protein mapping to the specific COG.

Now, reCOGnizer uses a similar approach to cog2ec conversion, where it will only assign a KO to a COG where over half of instances of that COG have that particular KO.

This obtains a more reliable COG2KO conversion, while keeping KOs for a considerable number of COGs.

Also removes the intermediate ssv files outputted during construction of the cog2ko database.

New parameters --test-run and --output-rpsbproc-columns will usually not be needed

--test-run parameter had to be implemented as consequence of a simpler database downloading. When set, reCOGnizer runs in an abnormal fashion, which is required for the tests at GitHub. reCOGnizer will move the cdd.tar.gz file available in the repo, and use it as a valid cdd.tar.gz file.

--output-rpsbproc-columns will output the Superfamilies, Sites, Motifs columns, which are usually empty for almost all annotations.

Removed some unnecessary files

recognizer.log was produced at working directory. It only included rpsblast outputs, mainly for error assessment. Users can obtain that information by running reCOGnizer with the --debug parameter, and manually running the faulty commands.

taxonomy.rdf was obtained as part of building taxonomy.tsv. Now, reCOgnizer removes it after it outlived its usefulness.

Some fixes

reCOGnizer was not reporting the download of files when the --quiet flag was set, except when the files had already been downloaded, and it removed them.

Also updated regexes to new format, the r'regex' format.

Fixed KOG outputting

08 Nov 14:39
Compare
Choose a tag to compare

rpsbproc doesn't work with the KOG database.
reCOGnizer's KOG report is now made directly from BLAST 6.

Fix when only downloading resources

15 Sep 14:02
Compare
Choose a tag to compare

reCOGnizer wasn't properly checking if --file parameter had been imputed. Therefore, reCOGnizer still attempeted to perform annotation and searched for annotation outputs, when no --file argument was specified.

Now, it's working properly.

Custom databases workflow now multithreaded

29 Aug 14:27
Compare
Choose a tag to compare

Now works multithreaded

Removed -db parameter. Incorporated into -dbs.
--custom-database changed to --custom-databases to reflect this change.
Added input sanitization for custom/default databases. Only custom or default databases can be used at the same time.

Also some necessary changes on the tests

latest image of miniconda is not funcitonal, fixed version on 22.11.1.
Added test for custom-database-workflow.
Tests now simultaneous, instead of one at a time.

Fixed several annoyances

20 Apr 12:10
Compare
Choose a tag to compare

No more need to confirm you don't want to gunzip download resource files

If --skip-downloaded was set, reCOGnizer will both skip the downloading and gunzipping.

No more FutureWarning when trying to sum COGs

.sum(numeric_only=True) fixed that.

reCOGnizer is called without ".py"

18 Apr 10:18
Compare
Choose a tag to compare

Now called as "recognizer"

reCOGnizer was always called through the shell as recognizer.py. Now, is called with recognizer.

Now removes intermediate folders

Unused directories - tmp, rpsbproc, et al, whose files were removed, are now themselves removed.

Also, several fixes

Fixed conversion COG2KO.
Fixed future warning - xlsx_report.save() to xlsx_report.close().

Updated documentation

Added a nice interactive krona plot.
Also corrected the parameters, and talked about the taxonomy thing.

Fix on outputting COG categories

03 Oct 16:36
Compare
Choose a tag to compare

Due to reformatting how reCOGnizer outputs information, its capacity for outputting COG categories was damaged.

It is fixed now.

Increase maximum SMPs per database

08 Aug 19:43
Compare
Choose a tag to compare

Set option -max_smp_vol 1000000 for the makeprofiledb command.

Context: the blast package had an update, and the makeprofiledb tool now outputs a database for each 1000 HMM profiles by default.