iDES: Software and Documentation

Overview

iDES (Integrated Digital Error Suppression) is a method for the suppression of stereotypical background artifacts in high-throughput sequencing data. This page provides Perl implementations and documentation for input file conversion, background database construction, background polishing, and quality control statistics.

Steps:

Format input files (required)
Create background database (required)
Perform background polishing
Background error report (optional)

Usage

1. Convert input BAM files to frequency (FREQ) file format.

Command

perl ides-bam2freq.pl [options] MySample.bam(s) genome.fa targets.bed

Input

MySample.bam(s): Single position-sorted BAM file or directory of position-sorted BAM files consisting of paired-end reads (with or without de-duplication). BAM index (BAI) file(s) should be present in the same directory.

genome.fa is a reference genome in FASTA format (e.g., hg19.fa).

targets.bed is a standard 3-column BED (chr start end) used to restrict FREQ file(s) to genomic regions of interest.

Options (defaults in parentheses)

-o <dir>	Output directory (same as input BAM file directory).
-q <int>	Phred quality filter (30).
-t <int>	Number of CPUs to use for processing >1 input BAM (1).
-a	Disable requirement for properly paired reads.
-b	Disable mpileup base alignment quality adjustment.

Output

MySample.[paired/allreads].Q(0-40+).freq.txt = Single FREQ file created for each input BAM file.

Field	Description
Chr	Chromosome
POS	Genomic position
DEPTH	Total depth/reads
REF	Reference allele
R(+/-)	Number of +/- strand reads supporting reference allele
A/C/T/G(+/-)	Number of +/- strand reads supporting alternate alleles

2. Create nucleotide substitution background database.

Command

perl ides-makedb.pl [options] dir

Input

dir: Directory of source FREQ files for background database. While pre-duplication removal data appear to provide a better substrate for learning background distributions, de-duplicated data can also be modeled.

Options (defaults in parentheses)

-o <dir>	Output directory (same as input FREQ file directory).
-m <0-100>	Maximum allele frequency for training database (100).
-d <int>	Minimum total depth required for each genomic position (20).
-n <int>	Minimum number of input samples required (6).
-a <str>	Add custom name to background database.
-p	Print Weibull parameter estimation errors.

Output

ides-bgdb.txt = background database.

Field	Description
Chr	Chromosome
Pos	Position
Ref	Reference allele
Var	Variant (alternate) allele
NumPosSamples	Number of samples harboring a given variant allele
TotalSamples	Total number of samples analyzed for a given variant allele
FracSamples	Fraction of samples harboring a given variant allele
FracBothStrands	Fraction of samples harboring a given variant allele with dual (+/-) strand support
MeanReads	Mean number of reads supporting a given variant across all evaluable input samples
MedianReads	Median number of reads supporting a given variant across all evaluable input samples
StdReads	Standard deviation of reads supporting a given variant across all evaluable input samples
MeanAF	Mean allele frequency of a given variant, calculated across all evaluable input samples
MedianAF	Median allele frequency of a given variant, calculated across all evaluable input samples
StdAF	Standard deviation allele frequency of a given variant, calculated across all evaluable input samples
W_Shape	Estimated shape parameter of Weibull distribution
W_Scale	Estimated scale parameter of Weibull distribution
W_Corr	Correlation between Weibull distribution and observed non-zero allele fractions within a QQ-plot
W_Pval	P-value corresponding to W_Corr

3. Perform background polishing.

Command

perl ides-polishbg.pl [options] MySample.freq(s) ides-bgdb.txt

Input

MySample.freq(s): Single FREQ file or directory of FREQ files.

ides-bgdb.txt: Background database.

Options (defaults in parentheses)

-o <dir>	Output directory (same as input FREQ file directory).
-f <0-1>	Minimum fraction of non-zero background samples needed for polishing (0.2).
-n <int>	Minimum number of non-zero background samples needed for polishing (2).
-m <int>	Minimum number of total background samples (4).
-w <int>	Minimum number of non-zero samples needed for Weibull modeling (5).
-a <0-100>	Maximum allele frequency cutoff for polishing (5).
-r <int>	Maximum number of supporting reads cutoff for polishing (10).
-p <0-1>	Nominal p-value threshold for background polishing (0.05).
-t <int>	Number of CPUs to use for processing >1 input FREQ file (1).
-b	Do background polishing on previously polished FREQ file(s).
-d <dir>	Directory of duplex-supported FREQ files (note: file names may contain ".duplex." tag, but must otherwise match input FREQ file names to ensure proper pairing).
-d <file>	Matching duplex-supported FREQ file.

Output

MySample.[paired/allreads].Q(0-40+).freq.rmbg.txt = Background polished FREQ file(s).

Field	Description
Chr	Chromosome
POS	Genomic position
DEPTH	Total depth/reads
REF	Reference allele
R(+/-)	Number of +/- strand reads supporting reference allele
A/C/T/G(+/-)	Number of +/- strand reads supporting alternate alleles

4. Generate background error report.

Command

perl ides-bgreport.pl [options] input.freq

Input

MySample.freq: Single FREQ file (with or without background polishing).

Output

STDOUT = background statistics.

Field	Description
No. positions	Number of genomic positions analyzed
No. positions without errors	Number of error-free genomic positions analyzed
Percent positions without errors	Percentage of error-free genomic positions
No. bases sequenced	Total number of bases analyzed
No. errors	Total number of reads supporting non-reference alleles (i.e., errors) analyzed
Percent errors	Global error rate (errors per base)
Subst., Positions, Errors, %Errors	Base substitution type, No. positions with that base substitution, No. reads (i.e., errors) supporting that base substitution, Percentage of all errors due to that base substitution

Download

Please note that you will need to accept the terms of the license in order to download iDES. To download version 1.1, click here. Unzip contents (iDES Perl code and license) into the same directory.

iDES Releases...

Release Notes

1.1

Latest release. Download.

Requirements

Unix operating system (Linux, Mac OS X, etc.)

Reference genome (FASTA format; e.g., hg19.2bit or hg38.fa).

To extract a FASTA file from a 2BIT file, download twoBitToFa from the appropriate system folder, then run without arguments and follow usage instructions.

R (tested with version 3+), with the following external dependency: fitdistrplus

To install 'fitdistrplus' from R terminal:

install.packages('fitdistrplus')

Perl 5, with the following external dependencies: Statistics::Descriptive, Statistics::R, Proc::Fork.

To install from CPAN, issue the following command (e.g., Statistics::Descriptive):

sudo cpan Statistics::Descriptive

Other Perl dependencies are included in the Perl 5 Core Modules and should already be installed: Getopt::Std, List::Util qw(max min), Cwd 'abs_path', File::Basename, File::Spec, POSIX.

SAMtools 0.1.20+

After compiling, either run 'make install' or find samtools executable and copy/link/move to PATH (i.e., /usr/bin).

Example

Perform background polishing on pre- and post- barcode-collapsed FREQ files.

Download background database and example FREQ files from Newman et al. (2016).

Move tarball (48MB) into working directory and unpack:

tar -zxvf example.tar.gz

Contents:

ides-bgdb.txt
Sample1.non-deduped.freq.paired.Q30.txt
Sample1.barcode-deduped.freq.paired.Q30.txt
Sample2.non-deduped.freq.paired.Q30.txt
Sample2.barcode-deduped.freq.paired.Q30.txt

Create input directory and move FREQ files.

mkdir input
mv *Q30.txt input

Create output directory.

mkdir output

Run background polishing (make sure all dependencies are installed first).

perl ides-polishbg.pl -o output -t 4 input ides-bgdb.txt

Collect error statistics.

for i in input/*Q30.txt; do echo $i && perl ides-bgreport.pl $i; done;

for i in output/*Q30.rmbg.txt; do echo $i && perl ides-bgreport.pl $i; done;

Expected output (MD5)

File	Error rate (%) pre-polishing	Error rate (%) post-polishing
Sample1.non-deduped	0.025	0.009
Sample2.non-deduped	0.024	0.0076
Sample1.barcode-deduped	0.011	0.002
Sample2.barcode-deduped	0.0097	0.0015

FAQ

Under construction

Reference

Newman AM*, Lovejoy AJ*, Klass DM*, Kurtz DM, Chabon JJ, Scherer FM, Stehr H, Liu CL, Bratman SV, Say C, Zhou L, Carter JN, West RB, Sledge GW, Shrager JB, Loo, Jr BW, Neal JW, Wakelee HA, Diehn M^# and AA Alizadeh^# (2016) Integrated digital error suppression for improved detection of circulating tumor DNA.

Funding

This work was supported by grants from the Department of Defense (A.M.N., M.D., A.A.A.), the National Cancer Institute (A.M.N., 1K99CA187192-01A1; M.D., A.A.A., R01CA188298), the US National Institutes of Health Director’s New Innovator Award Program (M.D., 1-DP2-CA186569), the Ludwig Institute for Cancer Research (M.D., A.A.A.), the CRK Faculty Scholar Fund (M.D.), V-Foundation (A.A.A.), Damon Runyon Cancer Research Foundation (A.A.A.) and a grant from both the Siebel Stem Cell Institute and the Thomas and Stacey Siebel Foundation (A.M.N.).

iDES: Software and Documentation

Overview

Usage

1. Convert input BAM files to frequency (FREQ) file format.

2. Create nucleotide substitution background database.

3. Perform background polishing.

4. Generate background error report.

Download

Release Notes

Requirements

Example

FAQ

Reference

Funding

Questions/Comments