GitXplorerGitXplorer
t

rnd-it-molbio-fidelity-umi

public
4 stars
3 forks
0 issues

Commits

List of commits on branch master.
Unverified
afad41a2bc70556282abc47a32123a7d38379020

Update readme, contributing.

KKarolisM committed 4 years ago
Unverified
4783e6113e68483091e70b578d83f3b7a7ca37a3

Add disclaimer.

KKarolisM committed 4 years ago
Unverified
b4658ad8b5ae124ae3cf75c2fc670da0acc28899

Add license.

KKarolisM committed 4 years ago
Unverified
1becc0a6150e741da0d59c8b032f4c2bbc2a9c1c

Merge branch 'master' of github.com:thermofisher/rnd-it-molbio-fidelity-umi

KKarolisM committed 4 years ago
Unverified
492b74fa62c781b42b8f450be57c75f739084ac3

Fix <README>: spell.

KKarolisM committed 4 years ago
Unverified
a856d11ad99cea086f74c6498c64d761a55efcc1

Fix <README>: spell.

KKarolisM committed 4 years ago

README

The README file for this repository.

MuA-based Molecular Indexing for Rare Mutation Detection by Next-Generation Sequencing

Disclaimer

This repository is intended to be used as a supplement for scholarly publication.
As such, this repository is made available “as-is”. Maintenance and support for this repository is not available at this time.

Description

This code was used to generate MuA-based Molecular Indexing for Rare Mutation Detection by Next-Generation Sequencing publication data.
Workflow is designed to evaluate and calculate true errors based on either on Unique Molecular Identifier (UMI) or mapping duplicates clusters.
This workflow is written in snakemake workflow manager.
Error calculation package is written in julia programming language.
Workflow also uses:

  1. umi-tools to extract and process UMI sequence information.

  2. bwa-mem algorithm for short pair end reads alignment to the reference.

  3. sambamba and samtools for aligned data sorting and filtering respectively.

DAG of a UMI-based errors calculation workflow rules: main workflow

Workflow setup

  1. Install conda:

    wget -P miniconda https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh &&
    chmod 755 ./miniconda/Miniconda3-latest-Linux-x86_64.sh &&
    ./miniconda/Miniconda3-latest-Linux-x86_64.sh
  2. Add path of miniconda to .bashrc if not selected option to add automatically during installation:

    cd ~/ && pth=$(pwd) &&
    echo "PATH=$PATH:$pth/miniconda3/bin" >> ~/.bashrc
  3. Add channels to conda:

    conda config --add channels conda-forge
    conda config --add channels bioconda
    conda config --add channels anaconda
  4. Install all by running:

    bash envs/install_all.sh -n <environment_name>
  5. To delete environment run:

    conda env remove -n <environment_name> -p <environment_path>

Usage

  1. To run custom analysis first configure config.yam file.

  2. Activate created environment:

    source activate <environment_name>
  3. Run snakemake:

    snakemake -j 24 --configfile config.yaml --use-conda

    You can provide rule names for snakemake command as follows: snakemake --configfile config.yaml --use-conda -j 24 trim download umi_fidelity

Testing

  1. To run all tests:

    bash tests/run_tests.sh --all --name test_env

    New conda env will be created and after tests removed.

  2. For further options check help:

    bash tests/run_tests.sh --help

Configuration file explained

Workflow checks if must have parameters were provided. All other if not provided will be used from defaul.yaml file.
Thus your config file must contain at least these parameters:

MAIN:
    INPUT_DIR:
    SAMPLES_ID:
    TMP_DIR:
    OUT_DIR:
    LANE_R:
    REFERENCE:

Configuration file in details

  1. Provide an existing path to your sequencing data.

    MAIN:
        INPUT_DIR: /some/dir/to/seq/data
  2. For workflow to be able to find your sample sequencig files provide prefix / sample name of sequencing file you want to analyse.

    MAIN:
        SAMPLES_ID: sample_1 sample_2  # sample_1_S1_L001_R1_001.fastq.gz

    Omit Illumina sequencing file naming suffix. Workflow searches for either *fastq.gz, *fq.gz or *.sfq files.

  3. Provide regex to be able to find files split by lanes:

    MAIN:
        LANE_RE: "L\\d\\d\\d_"  # This would translate to L001 lane number for example.

    Illumina sequencing might split data by separate lanes. It is usually 'L\d' or 'L\d\d\d' where '\d' is an integer.
    All files which do match sample id, lane, and read direction will be mergeg to generate single sequencing file which is not split by lanes.

  4. Default trimming program is bbduk. All parameter names listed in configuration file are exactly the same as it would be calling it from CLI.

  5. To extract UMI from raw reads data you need to provide UMI sequence structure:

    UMI_TOOLS:
        umi_structure: NNNNNNNNNNNNNNNN

    To extract UMI from the second paired-end read you need to pass --bc-pattern2 parameter for umi-tools

    UMI_TOOLS:
        additional_extract_params: " --bc-pattern2=NNNNNNNNNNNNNNNN "

    You can also pass other umi-tools supported options here as a tring separated by spaces.

  6. Available parameters for error calculations:

    FIDELITY:
    1. Filter reads by alignment quality:

      alignment_quality: 60
    2. Filter bases by base quality:

      base_quality: 30          # min quality for mutation to be considered in read. [30]
    3. Define minimum cluster size (cluster smaller in size will be discarded from the further analysis):

      cluster_size: 3

      If naive algorithm is selected then this filter option is omitted.
      Cluster in this case is defined either by umi sequences + mapping position (umi algorithm)
      or just mapping position (position algorithm)

    4. Setting allowed maximum number of errors detected in read:

      max_mutations_per_read: 3
    5. If true error is found between many clusters it might indicate that reference sequence is not correct.
      To omit such errors provide upper threshold:

      referece_mutation: 0.95   # how many same mutations at position to consider as reference mistake.

      If same error was found in 95% of umi cluster cases, consider it as a reference assembly mistake.

    6. These two options should not be chaged if using umi-tools for UMI sequences extraction and grouping:

      umi_split_string: "_"
      umi_split_position: 2     # position of umi after spliting readname by umi string. 1 based.
    7. You can translate nucleotide mutations by providing these parameters:

      genes_csv: Some_File.csv
      genetic_code: 11

      In that case you need to pass a csv file with 'Chromosome,Start,End,Gene,Strand'. Coordinates are one based and inclusive.
      With a third option you can tell which table to use for translation (BioSequences.ncbi_trans_table).

    8. You can also pass additional script parameters:

      additional_params: "--algorithm umi"

      Supported algorithms: naive, position, umi, paired. More can be check with CLI --help option.

    9. To report mutations provide:

      report_mutations: True

      All found mutations will be reported to a csv file.

    10. You can also apply additional filtering on mutations reporting:

      min_cluster_counts: 1     # Minimum count of clusters for the mutation to be reported to a single_mutations file.
      min_mutation_coverage: 1  # Minimum confident coverage for the mutation to be reported to a single_mutations file.

License

MIT