10Jan2022

Why does gc content matter

These forms of PCR and more are described simply and briefly here. Base stacking: Kool ET. Annu Rev Biophys Biomol Struct. Indian Journal of Human Genetics Ethylene glycol with 1,2-propanediol: Zhizhou, Z. BioTechniques Formamide can dramatically improve the specificity of PCR. Nucleic Acids Research, Has this helped you? Then please share with your network. Very helpful article…cleared my GC-content and GC-content bias in genome or transcriptome sequencing experiments concepts.

You must be logged in to post a comment. This site uses Akismet to reduce spam. Rates stratified by dinucleotide counts are significantly different than singletons.

In particular, the dinucleotide on which fragment rates depend the most is the pair surrounding the fragment end the breakpoint , shown on the right. Fragments are much more likely to start within a CpG dinucleotide, than any other dinucleotide. Fragmentation effect.

A horizontal dotted line marks the relative abundance of the base at mappable positions. Local effects captured by the fragment model drive the GC curves found at larger scales. For all three bin sizes, the predicted counts black trace the observed loess line blue , and also capture some of the variability around the curve.

Aggregation of single location estimates. A — C Estimates based on the fragment GC curve black trace similar paths as loess cyan estimated on observed counts blue on multiple scales. D — F Estimates based on alternative models compared with observed counts on 1 kb bins. See Supplementary methods for details on how models for E and F were defined and estimated.

In contrast, models based on smaller portion of the fragment do not trace the observed curves. Figure 7 D shows the estimates from the read W 0, The methods of correction used for Figure 7 E and F are described in detail in Supplementary data.

Correction based on the fragment and fragment-length models remove most GC-dependent fragment count variation. The same holds for all bin sizes. Since adding length did not change the results greatly, we use the more parsimonious model for the rest of this work. We visualize the correction in a region of chromosome 1 which has no CN changes.

In Figure 8 A uncorrected but scaled 1 kb bin counts display large low-frequency variations, which can be mistaken for CN events. The fragment model removes these variations better than the loess model. In Figure 8 B, a histogram of corrected counts shows that the fragment correction produces tighter distribution of scaled counts around 1 compared with the loess model.

Corrected counts of normal sample. Each point represents counts from both libraries forward strand. A similar correction on the tumor data reveals a hidden CN both libraries, forward strand in Figure 9. GC curves for both the loess and fragment models were estimated from chromosome 1, and corrected counts for a CN gain on chromosome 2 are shown.

The CN gain is hidden in the uncorrected data due to low-frequency count variation driven by GC content. Both the fragment model correction and the loess correction reveal the CN gain. The fragment correction provides better separation between bands [see histograms in Figure 9 B ]. Also, it successfully corrects for different binning resolutions Supplementary Figure S3. Note that chromosome 1 was used for GC estimation because it does not seem to have large CN changes as seen in Figure 2.

CN gain from tumor sample. Counts and corrected counts at position 29 kb on chromosome 2. GC curves estimated on chromosome 1 which has no large CN changes. B Histogram of normalized counts at 28—30 mb underlined on left plots. The estimated GC effect and mappability explain most the variation in the fragment coverage of the normal genome though not all of it. The GC model removes most of the variability in the binned counts, much more so than corrections based only on mappability.

The RV of the fragment model is considerably smaller than that of the loess model. It is still larger than Poisson, though small areas with extremely high coverage cause most of this extra variance.

Computed on 1 kb bins from normal sample forward strand, library 1 , after removing outlier bins. For a comparison more robust to these high-coverage regions, we compare quantiles rather than variances. In Figure 10 , we compare the 0. The variation in bins with very low observed counts is largely explained by mappability. However, mappability cannot explain variation of higher counts, and the spread between the quantiles is approximately double that of the Poisson.

Models taking GC content into account produce much tighter spreads. The fragment-length model the green curve consistently leaves less variation around the estimated rates than the loess model blue. Comparison to Poisson variation. Models that predict better will have narrower vertical spreads.

Variation around the mean of the fragment model green , the loess blue and mappability black are compared to variation around a Poisson red. In the above analysis, we described a single tumor—normal pair produced by a single lab, but our results are general to many examined samples from multiple labs. In Figure 11 , we show four descriptive plots from a different data set based on HCC cell line, see Table 1 for details. The GC has a strong effect on fragment counts, and this relation is unimodal Figure 11 A.

A distinct difference is the lack of length dependence of the fragments data not shown. The AT preference near fragment ends is also missing, further proving that it is not the major source of the GC bias. Two additional sets of data are shown in the Supplementary Data. GC plots for Dataset 2. C GC curve at fragment model W 2, Large biases in fragment counts related to the GC composition of regions were found in the data sets we examined.

These observed effects have a recurring unimodal shape, but varied considerably between different samples. We have shown that this GC effect is mostly driven by the GC composition of the full fragment. Conditioning on the GC of the fragments captures the strongest bias, and removing this effect provides the best correction, compared with alternative GC windows. When single base pair predictions based on the fragment composition are aggregated, the results trace the observed GC dependence.

This cannot be said about local effects that take only the reads into account. This conclusion holds for various data sets, with different fragment length composition, read lengths and GC effect shapes. That the GC curve is unimodal is key to this analysis.

In all data sets shown, the rate of GC-poor or GC-rich fragments is significantly lower than average, in many cases zero. Unimodality was overlooked by Dohm et al.

Even in humans, it is hard to spot this effect if counts are binned by GC quantiles instead of GC values. Nevertheless, it is this departure from linearity that allowed pinpointing an optimal scale—the fragment size.

In that, unimodality gives us important clues as to the causes of the GC bias. While we have described other sequence-related biases, we believe they are not driving the strong coverage GC biases. These include an increased coverage when the ends are AT rich, and location-specific fragmentation biases near the fragment ends.

They are also surprisingly negligible in the context of larger bins. Still, they might locally mitigate the fragment GC effect: the effect of fragment length on GC curve seems to be associated with these biases. Our conclusions seem to complement those of Aird et al. We have shown this is indeed the case. It should be noted that even these optimized PCR protocols can still display significant biases and may require GC correction.

Our refined description of the GC effect is of practical value for GC correction. First of all, the non-linearity of the GC effect is a warning sign regarding two-sample correction methods. In the main example we study, the pair of normal and tumor samples do not have the same GC curves. We have seen this in additional data sets as well. Using normal counts to correct tumor counts could sometimes produce GC-related artifacts, which might lead to faulty segmentations. The GC effects of samples should be carefully studied before such corrections are made.

A single sample correction for GC requires a model, and we demonstrate the importance of choosing the best model. Overlapping windows smaller than the fragment fail to remove the bulk of the GC effect. Similarly, using read coverage rather than fragment count hurts the correction.

Instead, measuring fragment rate for single base pair positions, decouples the GC modeling from the downstream analysis.

Thus, it removes the lower threshold on the scale of analysis, providing single base pair estimates, which can be later smoothed by the researcher as needed or binned into uneven bins if needed. An important benefit of DNA-seq over previous technologies is that simply repeating the experiment can increase the resolution of the analysis.

Our model assures that this increased resolution does not hurt the GC correction. Unlike other bias correction methods, such as BEADS 14 , we generate weights predicted fragment rates for the genomic location rather than for the observed reads.

Mappable genomic positions are stratified according to the GC of a hypothetical fragment, and rates per GC stratum are estimated by counting the fragments at those same positions. Estimating predicted rates for both covered and uncovered locations can help detect deletions, and these predicted rates form a natural input for downstream analysis using heterogeneous Poisson models. This procedure can be critical when length information is unavailable i. In this work, we estimated DNA abundance from non-tumor genomes, implicitly assuming that abundance of DNA along the genome is uniform.

It is true that CN variation may occur in non-tumor sequences; these jumps are rare however, and by random sampling we hope to average over any large CN changes. That the windows are small should reduce the dependence between GC and specific positions in the genome. From our experience, estimating GC curves using small windows turned out to be surprisingly robust to CN changes on tumor data as displayed above. To extend this method to other applications or protocols would require identifying regions in which the signal of interest is not expected to vary, and perhaps co-estimation of the abundance and the GC effect.

That said, for CN purposes there is enough data to get stable estimates of the GC effect. Our prediction accounts for a large portion of the variation, but residual variation is still present.

Additional inhomogeneities in fragment rates include unexplained hot spots or zero-counts, as well as milder low and high frequency variation in the counts.

The first two categories may be due to errors in the annotation of the genome or amplification artifacts. The latter point to existence of additional factors that affect fragment rates, which is to be expected. We have discussed additional sequenced-related biases, including fragmentation and AT preference.

The tools developed here, primarily the total variation scores, allow analysts to further investigate these effects as needed. Nevertheless, by and large, our model successfully describes the bulk of the low-frequency variability, which confounds segmentation to CN regions. One effect that we have not deeply explored is the relation between sequencing error probability and the GC effect.

In the Supplumentary Data, we have shown evidence that the global GC of the fragment can effect the sequencing error probability.

Especially for longer reads, changing the parameterization of the mapping processes can sometimes produce different mappability patterns related to the GC composition.

There have been reports 11 that specific sequences in reads are more prone for errors, for example a GGC sequence. A better model for reads that are harder to sequence would allow better estimation of the fragment GC effect in the GC-rich regions, and improve the accuracy of the corrections.

Jointly correcting by the GC of the read as well as the GC of the fragment may be a useful approximation for this effect. Our analysis focused only on DNA-seq data from human subjects, but results from this work can be extended.

GC content biases were seen in additional experimental protocols using high-throughput sequencing. Moreover, when length of the fragments is constrained exon sequencing, RNA-seq , a model taking both GC and fragment length into account may prove important. Fitting the model for each application is a challenge; still we believe that all these applications can benefit from our refined GC model.

We would also like to thank Niels Richard Hansen for useful suggestions, and Kasper Hansen, John Weinstein, Laurent Jacob, Claudio Lottaz, and anonymous reviewers for their insightful comments on a draft of this manuscript. Google Scholar. Google Preview. Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide. Sign In or Create an Account.

Sign In. Advanced Search. Search Menu. Article Navigation. Close mobile search navigation Article Navigation. Volume Article Contents Abstract. Summarizing and correcting the GC content bias in high-throughput sequencing. Oxford Academic. Genomic DNA base composition GC content is predicted to significantly affect genome functioning and species ecology. GC content showed a quadratic relationship with genome size, with the decreases in GC content in larger genomes possibly being a consequence of the higher biochemical costs of GC base synthesis.

In polymerase chain reaction PCR experiments, the GC - content of short oligonucleotides known as primers is often used to predict their annealing temperature to the template DNA. A higher GC - content level indicates a relatively higher melting temperature. Why do primers need high GC content? GC bonds contribute more to the stability—i. Is GC stronger than at? How is GC content measured? The GC content calculation algorithm has been integrated into our Codon Optimization Software, which serves our protein expression services.

Under pressure, such as when exposed to heat, the GC-rich sequences can take far more abuse than GC-low sequences. What causes GC bias? Meaning few reads map to very low and very high GC regions. That is an empirical result. The major factor that causes it is likely "enrichment PCR" bias.

How do you find the GC content of a sequence? Steps Trace through the sequence and tally the number of cytosine C or guanine G nucleotides.

sirocatwei1978's Ownd

0コメント

1000 / 1000