Close window

chemagen is now part of Revvity
Visit Revvity.com to see our full portfolio

Close window

Batch effects and the reproducibility of genomic studies

Batch effects and the reproducibility of genomics studies

Batch effects and the reproducibility of genomics studies

Data reproducibility has recently become one of the most debated topics in the scientific community (1,2). The lack of reproducibility has been attributed to a combination of many different factors ranging from poor data collection practices to wrong and misleading analysis methodologies. Due to practical issues linked to the massive number of samples that are analyzed, the field of high-throughput genomics is particularly vulnerable to the problem of data reproducibility (3). A significant factor contributing to the lack of data reproducibility in genomics is represented by the so-called batch effects (4,5).

Batch effects are variations in data that are not caused by biological differences in the primary samples. These confounding artifacts are collectively named “batch effects” because the processing date, or batch, is used as a surrogate to indicate them. The causes of batch effects are of a technical nature and can vary from the trivial, such as sample mislabeling, to the complex, such as sample material degradation. Batch effects in the field of genomic analysis are especially troubling since the reduced cost of DNA sequencing has paved the way for its application in clinical settings where mistakes might have potentially disastrous consequences. One of the most compelling examples of the consequences of batch effects is represented by a 2007 report published in the Journal of Clinical Oncology (6). The article, retracted five years later, proposed a method to select personalized treatments for patients suffering from ovarian cancer. A more rigorous analysis of the data showed that the study suffered from batch effects to a level that invalidated its conclusions. Strikingly, this is not an isolated case: a 2010 Nature Review in Genetics analyzed nine independent genomic studies to conclude that in all of them the variation linked to the batch effects could account from 32 to 100% of the differences observed between control and experimental groups (7).  There have been many suggestions on how to alleviate the consequences of batch effects, from using good laboratory practices to randomizing sample processing. Possibly, the most effective way to deal with batch effects at the backend of the process is to use bioinformatics tools to identify affected data and eliminate them from the analysis (8, 9). The major downside of this option, however, is that the extensive manipulation of data might decrease the statistical power of the study making its conclusion less robust (10).

The alternative is to focus at the front end of the genomics workflow to eliminate as many potential sources of technical variability as possible. Manual sample handling is a significant cause of technical variability, especially if the number of specimens is high. Pipetting errors, switched tubes, different incubation times, differential manipulation of the samples are all potential causes of technical artifacts and cannot be controlled reliably. Automating the liquid handling and the nucleic acid extraction steps is a simple way to ensure consistent sample quality for downstream analysis (11). For applications such as next-generation sequencing, automated assay setup by a dedicated liquid handler is also suggested to diminish technical variability across samples. Another way to mitigate batch effects is to increase the pipeline throughput and process all relevant experimental samples on the same day. Fully automated workstations that can handle primary sample processing up to nucleic acid purification can be used to reduce batch effects successfully. However, this approach can be suboptimal if the automated nucleic acid purification instrument uses small arrays of probes to process samples serially, handling small batches of samples sequentially and introducing another potential source of batch effects. High-throughput parallel sample processing is therefore crucial to mitigate possible batch effects.

Batch effects and the reproducibility of genomics studies

Despite the progress in sample manipulation and data analysis in the fields of genomics, batch effects are still a significant hurdle in the interpretation of the analysis of large sets of data. Although we are still in the process of understanding all of the causes of batch effects, steps can be taken to address the issue and improve the consistency of genomics studies. The combination of automation, good laboratory practices, and bioinformatics are rapidly improving data reliability, opening the way for many new applications that will profoundly change the field of applied genomics.

  1. Challenges in irreproducible research. Nature (2014) Special Supplement https://www.nature.com/collections/wjsrmrdnsm
  2. Mobley A, Linder SK, Braeuer R, Zwelling L (2013) A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic. PLoS ONE 8(5): e63221
  3. Devailly G, Mantsoki A, Michoel T, Joshi A (2015) Variable reproducibility in genome-scale public data: A case study using ENCODE ChIP sequencing resource. FEBS Letters 589, pages 3866–3870
  4. Scherer A (2009) Batch Effects and Noise in Microarray Experiments: Sources and Solutions. Wiley Series in Probability and Statistics
  5. Parker HS, Leek JT (2012) The practical effect of batch on genomic prediction. Stat Appl Genet Mol Biol. 2012 ; 11(3): Article–10
  6. Dressman HK et al. (2007) An Integrated Genomic-Based Approach to Individualized Treatment of Patients With Advanced-Stage Ovarian Cancer. Journal of Clinical Oncology 25(5), pages 517-524, Retracted
  7. Leet JT, et al. (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews in Genetics 11, pages 733-739
  8. Tom JT, et al. (2017) Identifying and mitigating batch effects in whole genome sequencing data. BMC Bioinformatics 18:351
  9. Chen C, et al. (2011) Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods. PLoS ONE 6(2): e17238
  10. Nygaard V, Rødland EA, and Hovig E (2016) Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics 17(1), pages 29–39
  11. Schöfl G, et al. (2017) 2.7 million samples genotyped for HLA by next generation sequencing: lessons learned. BMC Genomics 18:161

Please note that product labelling (such as kit insert, product label, and kit box) may be different compared to the company branding. Please contact your local representative for further details.