iDES: Software and Documentation
Overview
iDES (Integrated Digital Error Suppression) is a method for the suppression of stereotypical background artifacts in high-throughput sequencing data. This page provides Perl implementations and documentation for input file conversion, background database construction, background polishing, and quality control statistics.
Steps:
- Format input files (required)
- Create background database (required)
- Perform background polishing
- Background error report (optional)
Usage
1. Convert input BAM files to frequency (FREQ) file format.
perl ides-bam2freq.pl [options] MySample.bam(s) genome.fa targets.bed
Input
- MySample.bam(s): Single position-sorted BAM file or directory of position-sorted BAM files consisting of paired-end reads (with or without de-duplication). BAM index (BAI) file(s) should be present in the same directory.
- genome.fa is a reference genome in FASTA format (e.g., hg19.fa).
- targets.bed is a standard 3-column BED (chr start end) used to restrict FREQ file(s) to genomic regions of interest.
|
|
Output directory (same as input BAM file directory). |
|
|
Phred quality filter (30). |
Number of CPUs to use for processing >1 input BAM (1). | |
Disable requirement for properly paired reads. | |
Disable mpileup base alignment quality adjustment. |
Output
| Field | Description |
| Chr | Chromosome |
| POS | Genomic position |
| DEPTH | Total depth/reads |
| REF | Reference allele |
| R(+/-) | Number of +/- strand reads supporting reference allele |
| A/C/T/G(+/-) | Number of +/- strand reads supporting alternate alleles |
2. Create nucleotide substitution background database.
perl ides-makedb.pl [options] dir
Input
- dir: Directory of source FREQ files for background database. While pre-duplication removal data appear to provide a better substrate for learning background distributions, de-duplicated data can also be modeled.
|
|
Output directory (same as input FREQ file directory). |
|
|
Maximum allele frequency for training database (100). |
Minimum total depth required for each genomic position (20). | |
Minimum number of input samples required (6). | |
Add custom name to background database. | |
Print Weibull parameter estimation errors. |
Output
| Field | Description |
| Chr | Chromosome |
| Pos | Position |
| Ref | Reference allele |
| Var | Variant (alternate) allele |
| NumPosSamples | Number of samples harboring a given variant allele |
| TotalSamples | Total number of samples analyzed for a given variant allele |
| FracSamples | Fraction of samples harboring a given variant allele |
| FracBothStrands | Fraction of samples harboring a given variant allele with dual (+/-) strand support |
| MeanReads | Mean number of reads supporting a given variant across all evaluable input samples |
| MedianReads | Median number of reads supporting a given variant across all evaluable input samples |
| StdReads | Standard deviation of reads supporting a given variant across all evaluable input samples |
| MeanAF | Mean allele frequency of a given variant, calculated across all evaluable input samples |
| MedianAF | Median allele frequency of a given variant, calculated across all evaluable input samples |
| StdAF | Standard deviation allele frequency of a given variant, calculated across all evaluable input samples |
| W_Shape | Estimated shape parameter of Weibull distribution |
| W_Scale | Estimated scale parameter of Weibull distribution |
| W_Corr | Correlation between Weibull distribution and observed non-zero allele fractions within a QQ-plot |
| W_Pval | P-value corresponding to W_Corr |
3. Perform background polishing.
perl ides-polishbg.pl [options] MySample.freq(s) ides-bgdb.txt
Input
- MySample.freq(s): Single FREQ file or directory of FREQ files.
- ides-bgdb.txt: Background database.
|
|
Output directory (same as input FREQ file directory). |
|
|
Minimum fraction of non-zero background samples needed for polishing (0.2). |
Minimum number of non-zero background samples needed for polishing (2). | |
Minimum number of total background samples (4). | |
Minimum number of non-zero samples needed for Weibull modeling (5). | |
Maximum allele frequency cutoff for polishing (5). | |
Maximum number of supporting reads cutoff for polishing (10). | |
Nominal p-value threshold for background polishing (0.05). | |
Number of CPUs to use for processing >1 input FREQ file (1). | |
Do background polishing on previously polished FREQ file(s). | |
Directory of duplex-supported FREQ files | |
Matching duplex-supported FREQ file. |
Output
| Field | Description |
| Chr | Chromosome |
| POS | Genomic position |
| DEPTH | Total depth/reads |
| REF | Reference allele |
| R(+/-) | Number of +/- strand reads supporting reference allele |
| A/C/T/G(+/-) | Number of +/- strand reads supporting alternate alleles |
4. Generate background error report.
perl ides-bgreport.pl [options] input.freq
Input
- MySample.freq: Single FREQ file (with or without background polishing).
Output
| Field | Description |
| No. positions | Number of genomic positions analyzed |
| No. positions without errors | Number of error-free genomic positions analyzed |
| Percent positions without errors | Percentage of error-free genomic positions |
| No. bases sequenced | Total number of bases analyzed |
| No. errors | Total number of reads supporting non-reference alleles (i.e., errors) analyzed |
| Percent errors | Global error rate (errors per base) |
| Subst., Positions, Errors, %Errors | Base substitution type, No. positions with that base substitution, No. reads (i.e., errors) supporting that base substitution, Percentage of all errors due to that base substitution |
Download
Requirements
To extract a FASTA file from a 2BIT file, download twoBitToFa from the appropriate system folder, then run without arguments and follow usage instructions.
To install 'fitdistrplus' from R terminal:
install.packages('fitdistrplus')
To install from CPAN, issue the following command (e.g., Statistics::Descriptive):
sudo cpan Statistics::Descriptive
Other Perl dependencies are included in the Perl 5 Core Modules and should already be installed: Getopt::Std, List::Util qw(max min), Cwd 'abs_path', File::Basename, File::Spec, POSIX.
After compiling, either run 'make install' or find samtools executable and copy/link/move to PATH (i.e., /usr/bin).
Example
Perform background polishing on pre- and post- barcode-collapsed FREQ files.
- Download background database and example FREQ files from Newman et al. (2016).
- ides-bgdb.txt
- Sample1.non-deduped.freq.paired.Q30.txt
- Sample1.barcode-deduped.freq.paired.Q30.txt
- Sample2.non-deduped.freq.paired.Q30.txt
- Sample2.barcode-deduped.freq.paired.Q30.txt
- Create input directory and move FREQ files.
- Create output directory.
- Run background polishing (make sure all dependencies are installed first).
- Collect error statistics.
- Expected output (MD5)
Move tarball (48MB) into working directory and unpack:
tar -zxvf example.tar.gz
Contents:
mkdir input
mv *Q30.txt input
mkdir output
perl ides-polishbg.pl -o output -t 4 input ides-bgdb.txt
for i in input/*Q30.txt; do echo $i && perl ides-bgreport.pl $i; done;
for i in output/*Q30.rmbg.txt; do echo $i && perl ides-bgreport.pl $i; done;
| File | Error rate (%) pre-polishing | Error rate (%) post-polishing |
| Sample1.non-deduped | 0.025 | 0.009 |
| Sample2.non-deduped | 0.024 | 0.0076 |
| Sample1.barcode-deduped | 0.011 | 0.002 |
| Sample2.barcode-deduped | 0.0097 | 0.0015 |
FAQ
- Under construction
Reference
Newman AM*, Lovejoy AJ*, Klass DM*, Kurtz DM, Chabon JJ, Scherer FM, Stehr H, Liu CL, Bratman SV, Say C, Zhou L, Carter JN, West RB, Sledge GW, Shrager JB, Loo, Jr BW, Neal JW, Wakelee HA, Diehn M# and AA Alizadeh# (2016) Integrated digital error suppression for improved detection of circulating tumor DNA.
Funding
This work was supported by grants from the Department of Defense (A.M.N., M.D., A.A.A.), the National Cancer Institute (A.M.N., 1K99CA187192-01A1; M.D., A.A.A., R01CA188298), the US National Institutes of Health Director’s New Innovator Award Program (M.D., 1-DP2-CA186569), the Ludwig Institute for Cancer Research (M.D., A.A.A.), the CRK Faculty Scholar Fund (M.D.), V-Foundation (A.A.A.), Damon Runyon Cancer Research Foundation (A.A.A.) and a grant from both the Siebel Stem Cell Institute and the Thomas and Stacey Siebel Foundation (A.M.N.).