Skip to content

Latest commit

 

History

History
433 lines (291 loc) · 17.7 KB

CONTRIBUTING.rst

File metadata and controls

433 lines (291 loc) · 17.7 KB

Contributing

Welcome and thank you for considering a contribution to AAanalysis! We are an open-source project focusing on interpretable protein prediction. Your involvement is invaluable to us. Contributions can be made in the following ways:

  • Filing bug reports or feature suggestions on our GitHub issue tracker.
  • Submitting improvements via Pull Requests.
  • Participating in project discussions.

Newcomers can start by tackling issues labeled good first issue. Please email [email protected] for further questions or suggestions?

Objectives

  • Establish a comprehensive toolkit for interpretable, sequence-based protein prediction.
  • Enable robust learning from small and unbalanced datasets, common in life sciences.
  • Integrate seamlessly with machine learning and explainable AI libraries such as scikit-learn and SHAP.
  • Offer flexible interoperability with other Python packages like biopython.

Non-goals

  • Reimplementation of existing solutions.
  • Ignoring the biological context.
  • Reliance on opaque, black-box models.

Principles

  • Algorithms should be biologically inspired and combine empirical insights with cutting-edge computational methods.
  • We emphasize fair, accountable, and transparent machine learning, as detailed in Interpretable Machine Learning with Python.
  • We're committed to offering diverse evaluation metrics and interpretable visualizations, aiming to extend to other aspects of explainable AI such as causal inference.

For effective bug reports, please include a Minimal Reproducible Example (MRE):

  • Minimal: Include the least amount of code to demonstrate the issue.
  • Self-contained: Ensure all necessary data and imports are included.
  • Reproducible: Confirm the example reliably replicates the issue.

Further guidelines can be found here.

Latest Version

To install the latest development version using pip, execute the following:

pip install git+https://github.com/breimanntools/aaanalysis.git@master

Local Development Environment

Fork and Clone the Repository

  1. Fork the repository
  2. Clone your fork:
git clone https://github.com/YOUR_USERNAME/aaanalysis.git

Install Dependencies

Navigate to the project folder and set up the Python environment.

  1. Navigate to project folder:
cd aaanalysis

2a. Using conda for Environment Setup

Create and activate a new conda environment named 'venv', using Python 3.9:

conda create -n venv python=3.9
conda activate venv

2b. Using venv for Environment Setup

Alternatively, create and activate a virtual environment within the project folder using venv:

python -m venv venv
source venv/bin/activate  # Use `venv\Scripts\activate` on Windows

3a. Installing Dependencies with poetry

Install dependencies as defined in 'pyproject.toml' using poetry:

poetry install

3b. Installing Dependencies with pip

Alternatively, use pip to install dependencies from 'requirements.txt' and additional development requirements:

pip install -r requirements.txt
pip install -r docs/source/requirements_dev.txt

General Notes

  • Additional Requirement: Some non-Python utilities might to be need installed separately, such as Pandoc.
  • Manage Dependencies: Ensure dependencies are updated as specified in 'pyproject.toml' or 'requirements.txt' after pulling updates from the repository.

Run Unit Tests

We utilize pytest and hypothesis.

pytest "Name of directory/file/code to be tested"

This will execute all the test cases in the tests/ directory. Check out our README on testing. See further useful commands in our Project Cheat Sheet.

For substantial changes, start by opening an issue for discussion. For minor changes like typos, submit a pull request directly.

Ensure your pull request:

  • Is focused and concise.
  • Has a descriptive and clear branch name like fix/data-loading-issue or doc/update-readme.
  • Is up-to-date with the master branch and passes all tests.

Preview Changes

To preview documentation changes in pull requests, follow the "docs/readthedocs.org" check link under "All checks have passed".

GitHub Push

Before pushing code changes to GitHub, test your changes and updated any relevant documentation. It's recommended to work on a separate branch for your changes. Follow these steps for pushing to GitHub:

  1. Create a Branch: If not already done, create a new branch:

    git checkout -b your-branch-name
  2. Stage, Commit, and Push: Stage your changes, commit with a clear message, and push to the branch:

    git add .
    git commit -m "Describe your changes"
    git push origin your-branch-name
  3. Open a Pull Request: Visit the GitHub repository to create a pull request for your branch.

For more detailed instructions, see the official GitHub documentation.

Documentation is a crucial part of the project. If you make any modifications to the documentation, please ensure they render correctly.

Naming Conventions

We strive for consistency of our public interfaces with well-established libraries like scikit-learn, pandas, matplotlib, and seaborn.

Class Templates

We primarily use two class templates for organizing our codebase:

  • Wrapper: Designed to extend models from libraries like scikit-learn. These classes contain .fit and .eval methods for model training and evaluation, respectively.
  • Tool: Standalone classes that focus on specialized tasks, such as feature engineering for protein prediction. They feature .run and .eval methods to carry out the complete processing pipeline and generate various evaluation metrics.

The remaining classes should fulfill two further purposes, without being directly implemented using class inheritance.

  • Data visualization: Supplementary plotting classes for Wrapper and Tool classes, named accordingly using a Plot suffix (e.g., 'CPPPlot'). These classes implement an .eval method to visualize the key evaluation measures.
  • Analysis support: Supportive pre-processing classes for Wrapper and Tool classes.

Function and Method Naming

We semi-strictly adhere to the naming conventions established by the aforementioned libraries. Functions/Methods processing data values should correspond with the names specified in our primary pd.DataFrame columns, as defined in aaanalysis/_utils/_utils_constants.py.

Code Philosophy

We aim for a modular, robust, and easily extendable codebase. Therefore, we adhere to flat class hierarchies (i.e., only inheriting from Wrapper or Tool is recommended) and functional programming principles, as outlined in A Philosophy of Software Design. Our goal is to provide a user-friendly public interface using concise description and Python type hints (see also this Python Enhancement Proposal PEP 484 or the Robust Python book). For the validation of user inputs, we use comprehensive checking functions with descriptive error messages.

Documentation Style

Documentation Layers

This project's documentation is organized across four distinct layers, each with a specific focus and level of detail:

  • Docstrings: Concise code description, with minimal usage examples and references to other layers (in 'See also').
  • Usage Principles: Bird's-eye view with background and key principles, reflecting by selected code examples.
  • Tutorial: Close-up on public interface, as step-by-step guide on essential usage with medium detail.
  • Tables: Close-up on data or other tabular overviews, with detailed explanation of columns and critical values.

See our reference order here (exceptions confirm the rules):

/docs/source/_artwork/diagrams/ref_order.png

The API showcases Docstrings for our public objects and functions. Within these docstrings, scientific References may be mentioned in their extended sections. For additional links in docstrings, use the See Also section in this order: Usage Principles, Tables, Tutorials. Only include External library references when absolutely necessary. Note that the Usage Principles documentation is open for direct linking to References, Tutorials, and Tables, which can as well include links to References.

Building the Docs

To generate the documentation locally:

  • Go to the docs directory.
  • Run make html.
cd docs
make html
  • Open _build/html/index.html in a browser.

Building new PyPi package version

To create a new version of the AAanalysis package for PyPi using Poetry, perform the following steps:

  1. Ensure Poetry is installed: Run pip install poetry if you haven't installed Poetry.
  2. Update Version: Update the version number (MAJOR.MINOR.PATCH) in the 'pyproject.toml' file, where:
    • MAJOR version is incremented for incompatible API changes,
    • MINOR version is incremented for functionality in a backward-compatible manner, and
    • PATCH version is incremented for backward-compatible bug fixes.
  3. Build the Package: Navigate to your project's root directory and execute poetry build to create the distribution files in the dist folder.
  4. Publish to PyPI: Upload the final version to PyPI with poetry publish. You will need to input your PyPI username and API token.
  5. Verify the Upload: Check that your package is correctly listed on PyPI, ensuring the information and files are accurate.

To optimize testing, use ChatGPT with the template below and fill in the blank spaces between START OF CODE and END OF CODE. Examples of testing templates can be found here: AAanalysis unit tests.

"
Generate test functions for a given TARGET FUNCTION using the style of the provided TESTING TEMPLATE. Please take your time to ensure thoroughness and accuracy.

Inputs:
TARGET FUNCTION:
- START OF CODE
-----------------------------------------
[your code here]
-----------------------------------------
- END OF CODE

TESTING TEMPLATE:
- START OF CODE
-----------------------------------------
[your code]
-----------------------------------------
- END OF CODE

**Key Directive**: For the Normal Cases Test Class, EACH function MUST test ONLY ONE individual parameter of the TARGET FUNCTION using Hypothesis for property-based testing. This is crucial.

Requirements:

1. Normal Cases Test Class:
- Name: 'Test[TARGET FUNCTION NAME]'.
- Objective: Test EACH parameter *INDIVIDUALLY*.
- Tests: Test EACH parameter, at least 10 positive and 10 negative tests for this class.

2. Complex Cases Test Class:
- Name: 'Test[TARGET FUNCTION NAME]Complex'.
- Objective: Test combinations of the TARGET FUNCTION parameters.
- Tests: At least 5 positive and 5 negative that intricately challenge the TARGET FUNCTION.

3. General Guidelines:
- Use Hypothesis for property-based testing, but test parameters individually for the Normal Cases Test Class .
- Tests should be clear, concise, and non-redundant.
- Code must be complete, without placeholders like 'TODO', 'Fill this', or 'Add ...'.
- Explain potential issues in the TARGET FUNCTION.

Output Expectations:
- Two test classes: one for normal cases (individual parameters) and one for complex cases (combinations).
- In Normal Cases, one function = one parameter tested.
- Aim for at least 30 unique tests, totaling 150+ lines of code.

Reminder: In Normal Cases, it's crucial to test parameters individually. Take your time and carefully create the Python code for all cases!
"

ChatGPT has a token limit, which may truncate responses. To continue, simply ask 'continue processing' or something similar. Repeat as necessary and compile the results.

We recommend the following workflow:

  1. Repeat the prompt in new ChatGPT sessions until most of the positive test cases are covered.
  2. Adjust the testing script manually such that all positive tests are passed.
  3. Continue in the same session, sharing the revised script, and request the creation of negative tests.
  4. Finally, provide the complete testing script, including positive and negative cases, and request the development of complex test cases

Test Guided Development (TGD)

Leverage ChatGPT to generate testing scripts and refine your code's functionality and its interface. If ChatGPT struggles or produces erroneous tests, it often indicates ambiguities or complexities in your function's logic, variable naming, or documentation gaps, especially regarding edge cases. Address these insights to ensure intuitive and robust code design through the TGD approach.

Essential Strategies for Effective TGD:

  • Isolated Functionality Testing: Test one function or method at a time, adhering to unit testing principles. Provide an entire and well-documented function. The better the docstring, the more comprehensive our automatically generated tests will be.
  • Isolated Test Sessions: Start each test scenario in a new ChatGPT session to maintain clarity and prevent context overlap, ensuring focused and relevant test generation.
  • Consistent Template Usage: Align your test creation with existing templates for similar functionalities, utilizing them as a structured guide to maintain consistency in your test design.
  • Initial Test Baseline: Aim for an initial set of tests where about 25% pass, providing a foundational baseline that identifies primary areas for iterative improvement in both tests and code.
  • Iterative Refinement and Simplification: Use ChatGPT-generated tests to iteratively refine your code, especially if repeated test failures indicate areas needing clarification or simplification in your function's design.

Through an iterative TGD process, you can systematically uncover and address any subtleties or complexities in your code, paving the way for a more robust and user-friendly application.