iDES: Software and Documentation


Overview

iDES (Integrated Digital Error Suppression) is a method for the suppression of stereotypical background artifacts in high-throughput sequencing data. This page provides Perl implementations and documentation for input file conversion, background database construction, background polishing, and quality control statistics.


Steps:

  1. Format input files (required)
  2. Create background database (required)
  3. Perform background polishing
  4. Background error report (optional)

Usage

1. Convert input BAM files to frequency (FREQ) file format.

Command

perl ides-bam2freq.pl [options] MySample.bam(s) genome.fa targets.bed


Input


  1. MySample.bam(s): Single position-sorted BAM file or directory of position-sorted BAM files consisting of paired-end reads (with or without de-duplication). BAM index (BAI) file(s) should be present in the same directory.

  2. genome.fa is a reference genome in FASTA format (e.g., hg19.fa).

  3. targets.bed is a standard 3-column BED (chr start end) used to restrict FREQ file(s) to genomic regions of interest.

Options (defaults in parentheses)

  • -o  <dir>
  • Output directory (same as input BAM file directory).

  • -q  <int>
  • Phred quality filter (30).

  • -t  <int>
  • Number of CPUs to use for processing >1 input BAM (1).

  • -a
  • Disable requirement for properly paired reads.

  • -b
  • Disable mpileup base alignment quality adjustment.


    Output


  • MySample.[paired/allreads].Q(0-40+).freq.txt = Single FREQ file created for each input BAM file.

  • FieldDescription
    ChrChromosome
    POSGenomic position
    DEPTHTotal depth/reads
    REFReference allele
    R(+/-)Number of +/- strand reads supporting reference allele
    A/C/T/G(+/-)Number of +/- strand reads supporting alternate alleles


    2. Create nucleotide substitution background database.

    Command

    perl ides-makedb.pl [options] dir


    Input


    1. dir: Directory of source FREQ files for background database. While pre-duplication removal data appear to provide a better substrate for learning background distributions, de-duplicated data can also be modeled.

    Options (defaults in parentheses)

  • -o  <dir>
  • Output directory (same as input FREQ file directory).

  • -m  <0-100>
  • Maximum allele frequency for training database (100).

  • -d  <int>
  • Minimum total depth required for each genomic position (20).

  • -n  <int>
  • Minimum number of input samples required (6).

  • -a  <str>
  • Add custom name to background database.

  • -p
  • Print Weibull parameter estimation errors.


    Output


  • ides-bgdb.txt = background database.

  • FieldDescription
    ChrChromosome
    PosPosition
    RefReference allele
    VarVariant (alternate) allele
    NumPosSamplesNumber of samples harboring a given variant allele
    TotalSamplesTotal number of samples analyzed for a given variant allele
    FracSamplesFraction of samples harboring a given variant allele
    FracBothStrandsFraction of samples harboring a given variant allele with dual (+/-) strand support
    MeanReadsMean number of reads supporting a given variant across all evaluable input samples
    MedianReadsMedian number of reads supporting a given variant across all evaluable input samples
    StdReadsStandard deviation of reads supporting a given variant across all evaluable input samples
    MeanAFMean allele frequency of a given variant, calculated across all evaluable input samples
    MedianAFMedian allele frequency of a given variant, calculated across all evaluable input samples
    StdAFStandard deviation allele frequency of a given variant, calculated across all evaluable input samples
    W_ShapeEstimated shape parameter of Weibull distribution
    W_ScaleEstimated scale parameter of Weibull distribution
    W_CorrCorrelation between Weibull distribution and observed non-zero allele fractions within a QQ-plot
    W_PvalP-value corresponding to W_Corr


    3. Perform background polishing.

    Command

    perl ides-polishbg.pl [options] MySample.freq(s) ides-bgdb.txt


    Input


    1. MySample.freq(s): Single FREQ file or directory of FREQ files.

    2. ides-bgdb.txt: Background database.

    Options (defaults in parentheses)

  • -o  <dir>
  • Output directory (same as input FREQ file directory).

  • -f  <0-1>
  • Minimum fraction of non-zero background samples needed for polishing (0.2).

  • -n  <int>
  • Minimum number of non-zero background samples needed for polishing (2).

  • -m  <int>
  • Minimum number of total background samples (4).

  • -w  <int>
  • Minimum number of non-zero samples needed for Weibull modeling (5).

  • -a  <0-100>
  • Maximum allele frequency cutoff for polishing (5).

  • -r  <int>
  • Maximum number of supporting reads cutoff for polishing (10).

  • -p  <0-1>
  • Nominal p-value threshold for background polishing (0.05).

  • -t  <int>
  • Number of CPUs to use for processing >1 input FREQ file (1).

  • -b
  • Do background polishing on previously polished FREQ file(s).

  • -d  <dir>
  • Directory of duplex-supported FREQ files
    (note: file names may contain ".duplex." tag, but must otherwise
    match input FREQ file names to ensure proper pairing).

  • -d  <file>
  • Matching duplex-supported FREQ file.


    Output


  • MySample.[paired/allreads].Q(0-40+).freq.rmbg.txt = Background polished FREQ file(s).

  • FieldDescription
    ChrChromosome
    POSGenomic position
    DEPTHTotal depth/reads
    REFReference allele
    R(+/-)Number of +/- strand reads supporting reference allele
    A/C/T/G(+/-)Number of +/- strand reads supporting alternate alleles


    4. Generate background error report.

    Command

    perl ides-bgreport.pl [options] input.freq


    Input


    1. MySample.freq: Single FREQ file (with or without background polishing).

    Output


  • STDOUT = background statistics.

  • FieldDescription
    No. positionsNumber of genomic positions analyzed
    No. positions without errorsNumber of error-free genomic positions analyzed
    Percent positions without errorsPercentage of error-free genomic positions
    No. bases sequencedTotal number of bases analyzed
    No. errorsTotal number of reads supporting non-reference alleles (i.e., errors) analyzed
    Percent errorsGlobal error rate (errors per base)
    Subst., Positions, Errors, %ErrorsBase substitution type, No. positions with that base substitution, No. reads (i.e., errors) supporting that base substitution, Percentage of all errors due to that base substitution

    Download

    Please note that you will need to accept the terms of the license in order to download iDES. To download version 1.1, click here. Unzip contents (iDES Perl code and license) into the same directory. 

    Release Notes

  • 1.1  
  • Latest release. Download.

    Requirements

  • Unix operating system (Linux, Mac OS X, etc.)

  • Reference genome (FASTA format; e.g., hg19.2bit or hg38.fa).

    To extract a FASTA file from a 2BIT file, download twoBitToFa from the appropriate system folder, then run without arguments and follow usage instructions.

  • R (tested with version 3+), with the following external dependency: fitdistrplus

    To install 'fitdistrplus' from R terminal:

    install.packages('fitdistrplus')

  • Perl 5, with the following external dependencies: Statistics::Descriptive, Statistics::R, Proc::Fork.

    To install from CPAN, issue the following command (e.g., Statistics::Descriptive):

    sudo cpan Statistics::Descriptive

    Other Perl dependencies are included in the Perl 5 Core Modules and should already be installed: Getopt::Std, List::Util qw(max min), Cwd 'abs_path', File::Basename, File::Spec, POSIX.

  • SAMtools 0.1.20+

    After compiling, either run 'make install' or find samtools executable and copy/link/move to PATH (i.e., /usr/bin).


  • Example

    Perform background polishing on pre- and post- barcode-collapsed FREQ files.

    1. Download background database and example FREQ files from Newman et al. (2016).
    2. Move tarball (48MB) into working directory and unpack:

      tar -zxvf example.tar.gz

      Contents:

      1. ides-bgdb.txt
      2. Sample1.non-deduped.freq.paired.Q30.txt
      3. Sample1.barcode-deduped.freq.paired.Q30.txt
      4. Sample2.non-deduped.freq.paired.Q30.txt
      5. Sample2.barcode-deduped.freq.paired.Q30.txt

    3. Create input directory and move FREQ files.
    4. mkdir input
      mv *Q30.txt input

    5. Create output directory.
    6. mkdir output

    7. Run background polishing (make sure all dependencies are installed first).
    8. perl ides-polishbg.pl -o output -t 4 input ides-bgdb.txt

    9. Collect error statistics.
    10. for i in input/*Q30.txt; do echo $i && perl ides-bgreport.pl $i; done;

      for i in output/*Q30.rmbg.txt; do echo $i && perl ides-bgreport.pl $i; done;

    11. Expected output (MD5)
    12. FileError rate (%)
      pre-polishing
      Error rate (%)
      post-polishing
      Sample1.non-deduped0.0250.009
      Sample2.non-deduped0.0240.0076
      Sample1.barcode-deduped0.0110.002
      Sample2.barcode-deduped0.00970.0015

    FAQ

    1. Under construction

    Reference

    Newman AM*, Lovejoy AJ*, Klass DM*, Kurtz DM, Chabon JJ, Scherer FM, Stehr H, Liu CL, Bratman SV, Say C, Zhou L, Carter JN, West RB, Sledge GW, Shrager JB, Loo, Jr BW, Neal JW, Wakelee HA, Diehn M# and AA Alizadeh# (2016) Integrated digital error suppression for improved detection of circulating tumor DNA.


    Funding

    This work was supported by grants from the Department of Defense (A.M.N., M.D., A.A.A.), the National Cancer Institute (A.M.N., 1K99CA187192-01A1; M.D., A.A.A., R01CA188298), the US National Institutes of Health Director’s New Innovator Award Program (M.D., 1-DP2-CA186569), the Ludwig Institute for Cancer Research (M.D., A.A.A.), the CRK Faculty Scholar Fund (M.D.), V-Foundation (A.A.A.), Damon Runyon Cancer Research Foundation (A.A.A.) and a grant from both the Siebel Stem Cell Institute and the Thomas and Stacey Siebel Foundation (A.M.N.).



    Questions/Comments

    Contact us