ORIGINAL RESEARCH

The impact of sequencing depth on accuracy of single nucleotide variant calls

About authors

Genotek Inc., Moscow

Correspondence should be addressed: Valery Ilyinsky
per. Nastavnichesky, d. 17, str. 1, Moscow, Russia, 105120; ur.ketoneg@yrelav

About paper

Acknowledgements: the authors thank Anna Davydova of Genotek for her helpful comments.

All authors' contribution to this work is equal: selection and analysis of literature, planning of the manuscript's structure, data interpretation, drafting of the manuscript, editing.

Received: 2017-06-22 Accepted: 2017-06-27 Published online: 2017-07-19
|

Although protein-coding regions make up only ~1 % of the human genome, they harbor 85 % of all disease-associated mutations [1]. In this light, clinical use is encouraged of whole-exome sequencing and other sequencing methods that employ specially designed enrichment panels targeting potentially mutant exon regions [2]

However, there are some challenges to the clinical application of whole-exome sequencing, one of them being the appropriate depth of coverage, i.e. the number of times each nucleotide has been sequenced, usually designated as 10x, 20x, 50x, etc. [3]. Good coverage ensures better identification of sequencing errors, increasing sequencing accuracy. There are two factors determining the choice of coverage depth. The first one is time and costs that are directly proportional to the number of reads performed. The second is the minimal number of reads needed to achieve the desired error tolerance. No consensus has been reached yet regarding the second factor.

Using the short-read sequencing technology by Illumina, Bently et al. discovered in 2008 that almost every homozygous single nucleotide variant (SNV) can be detected at 15x coverage, while for accurate heterozygous SNV calling 33x coverage is required [4]. Subsequently, a 33x sequencing depth was adopted as standard coverage for SNV detection [5, 6]. In 2011 Ajay et al. reported that accurate detection of 95 % of SNVs, as well as short insertions and deletions, required 50x coverage. However, further experiments that employed new, improved reagents and software for data processing produced the same yield at 35x [7]. In 2014 Fang et al. published an article demonstrating that 60x coverage is needed for accurate detection of 95 % of insertions and deletions [8].

Such discrepancy indicates that recommended sequencing depth is not something easily determined, as the number of reads per region needed for accurate variant detection depends on the read quality, which, in turn, depends on the technique applied or reagents used or the quality of sample preparation. For example, amplification of GC-rich regions during polymerase reaction (PCR) can be a problem, resulting in poor sequencing quality, urging the researcher to increase the number of reads. Currently, there are reagent kits for PCR that can improve reaction quality and thereby the quality of sequencing. In 2013 Meyner et al. discovered that depending on the reagents used, 95 % of SNPs can be detected either at 20x or 46x coverage [9]. In 2014 the same authors reported 14x coverage as sufficient for accurate detection of 95 % of SNPs [10]. Besides, Li et al. demonstrated that coverage depth also depends on the number of individual samples to be sequenced [11]. For example, for detection of mutations with frequency <0.2 %, 4x sequencing of 3,000 samples yields the same result as 30x sequencing of 2,000 samples. To sum up, there are more factors affecting sequencing quality than it might seem, and the number of reads can be efficiently reduced upon estimating a contribution of each factor or based on the study goal.

In this work we show that Genotek01 enrichment panel allows to reduce the depth of coverage to 12x to achieve accurate calling of heterozygous variants and SNVs, with only 0.5 % difference between  NGS and Sanger sequencing outcomes.

METHODS

DNA extraction, preparation and sequencing of DNA libraries

DNA was extracted from whole venous blood of patients with inherited diseases, using QIAmp DNA Mini Kit (Qiagen, Germany). Quality of genomic DNA was checked by agarose gel electrophoresis; among critical purity indicators was the absence of DNA degradation and RNA contamination. Concentration of the obtained DNA was measured by Qubit 3.0 Fluorometer (Life Technologies, USA). DNA libraries were prepared using NEBnext Ultra DNA library Prep Kit for Illumina (New England Biolabs, USA) using adaptor sequences for Illumina sequencing according to the manufacturer’s protocol. Dual indexed libraries were obtained using NEBNext Multiplex Oligos for Illumina (Dual Index Primers Set 1) by the same manufacturer. Quality control of the obtained DNA libraries was performed using Agilent 2100 Bioanalyzer (Agilent Technologies, USA). Target enrichment of the coding regions was carried out using MYbaits (MYacroarray, USA). For 100 bp paired-end sequencing, HiSeq 2500 System analyzer (Illumina, USA), HiSeq Rapid PE Cluster Kit v2 and HiSeq Rapid SBS Kit v2 (Illumina) were used following the manufacturer’s protocol.

Sanger sequencing

Amplicons were fluorescently labeled using BugDye Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher Scientific, USA) according to the manufacturer’s protocol. Sanger sequencing was performed on ABI PRISM 3500 Genetic Analyzer (Applied Biosystems, USA) according to the manufacturer’s protocol.

Bioinformatic analysis

The obtained reads were aligned to the reference genome hg19 in BWA. PCR duplicates were removed using SAMtools rmdup, variant calling was performed using Genome Analysis Tool Kit (GATK). We detected 89 mutations: 10 homo- and hemizygous, 79 heterozygous, 80 point mutations (SNPs) and 9 short insertions and deletions (indels). We also genotyped 200 nt-long regions to the left and right of the detected mutations. All positions in those regions were analyzed and then validated by Sanger sequencing, the gold-standard for detecting short mutations. Chromatography data were processed uniformly. Mutation calling was done using the original Genotek software based on BioPython and R packages (sangerseqR, seqinR, Biostrings and Rsubread). Genotypes obtained through NGS were validated by Sanger sequencing, and sensitivity and specificity were then calculated.

RESULTS

Validation of mutations by Sanger sequencing

Sanger sequencing did not confirm 15 of 89 mutations detected by variant calling, meaning that they either had a different genotype (compared to the genotype identified by NGS) or were absent. Eight of 15 unconfirmed mutations were identified by NGS as heterozygous, but Sanger validation classified them as homozygous. Of note, NGS-detected heterozygosity was supported by only one read with the reference allele (see fig. 1, the cluster of mutations in the lower right corner).                                                                                                                           

Simulation of various coverage depths

To determine the minimum depth of coverage, we ran a series of simulation tests decreasing the number of reads (bootstrapping) per mutation and the regions adjacent to it and also performed mutation calling. To estimate the error rate in the calls, we used Sanger-confirmed reference-matching homozygous positions.

Sequencing quality can be assessed using the Phred quality score (Q score) generated by the sequenator for each nucleotide [12]. However, this metric merely measures sequencing accuracy, which was insufficient for our purposes. We checked if each of the reads overlapping the position of interest supported the reference sequence, and if there were mismatches, we assumed an erroneous call.

We analyzed 372,443 nucleotides. Of them 276 did not match the reference sequence, while others did. Thus, the calculated error rate was 0.0741 %, equivalent to Q31 on the Phred quality score.

For 69 positions with confirmed heterozygous mutations, the percentage of reads supporting the alternate allele was estimated (fig. 2).

Based on these data and the frequency of erroneous calls, we calculated the frequency of combinations at various coverage depths, ranging from 2x to 50x, and the number of reads supporting the alternate allele, ranging from 0 to the maximum. The obtained data were used to calculate frequency of 2 error types: a truly heterozygous variant identified as homozygous reference and a truly heterozygous variant identified as homozygous alternate at different cut-off levels for reference and alternate homozygous variants. To filter out homozygous reference calls, we varied the number of reads from 2 to 10. To filter out homozygous alternate calls, we varied the percentage of reads supporting the alternate allele between 70, 80 and 90 % (see the table). We found that for short mutations (SNPs and indels) the accuracy of the applied method was as high as 99.7 %, with sensitivity of 98 % at 12x coverage. Lower coverage led to a considerable decrease in sensitivity (decreasing sigmoidal character) and therefore cannot be recommended. While planning a lab experiment, an average number of reads per base should be determined to achieve 12x coverage of the target region. Therefore, we plotted a correlation between an average depth of coverage and the percentage of the target region covered by 12 reads (fig. 3).

It was found that to cover >90 % of the target regions at least 12x depth, 40x coverage by deduplicated reads is required.

RESULTS

Sanger sequencing did not confirm 15 of 89 mutations detected by NGS; 8 of 15 unvalidated mutations were homozygous and not heterozygous as suggested by NGS. Such outcome is largely dependent on the error model employed by GATK, the software used to obtain a set of variants for the studied genome, which interprets single reads with reference or non-reference alleles differently during variant calling. GATK employs the reference confidence model in combination with cohort analysis [13, 14]. Therefore, if the obtained sequence matches the non-reference allele, GATK treats the nucleotide variants from this read as sequencing errors and ignores them when calculating a genotype. If the obtained sequence matches the reference allele, GATK considers the probability of error to be low and returns a heterozygous (not a homozygous) genotype. Besides, in our study the majority of mutations unconfirmed by Sanger sequencing were detected at a low depth of coverage (≤10x). The obtained results confirm that accurate mutation calls require deep sequencing in order to avoid single sequencing errors that could distort the obtained data [15]. The depth of coverage per base is a probabilistic value and can be calculated with reliable precision. We showed that the error rate for the data obtained by HiSeq 2500 System corresponds to the instrumental error. We also calculated the minimal coverage (12x) required for accurate sequencing. This value is lower than the one proposed by Bentley et al. [4], which may be due to the improved equipment and new reagents used in our study and, therefore, fewer sequencing errors. State-of-the-art bioinformatic methods also allow for better error filtering without loss of sensitivity.

CONCLUSIONS

Our work demonstrates that to achieve at least 90 % coverage of the target genome at >12x, 40x coverage by deduplicated reads is required. This value depends on the enrichment reagents and protocol applied, read types and lengths. Besides, depending on the protocols for library preparation and nucleic acid extraction, the degree of duplication in the obtained sequences may vary, which must be accounted for when calculating the desired number of nucleotides per sample.

КОММЕНТАРИИ (0)