Инд. авторы: Kolmykov S.K., Kondrakhin Y.V., Yevshin I.S., Sharipov R.N., Ryabova A.S., Kolpakov F.A.
Заглавие: Population size estimation for quality control of ChIP-Seq datasets
Библ. ссылка: Kolmykov S.K., Kondrakhin Y.V., Yevshin I.S., Sharipov R.N., Ryabova A.S., Kolpakov F.A. Population size estimation for quality control of ChIP-Seq datasets // PloS ONE. - 2019. - Vol.14. - Iss. 8. - Art.e0221760. - ISSN 1932-6203.
Внешние системы: DOI: 10.1371/journal.pone.0221760; РИНЦ: 41632832; PubMed: 31465497; SCOPUS: 2-s2.0-85071401703; WoS: 000485058200046;
Реферат: eng: Chromatin immunoprecipitation followed by sequencing, i.e. ChIP-Seq, is a widely used experimental technology for the identification of functional protein-DNA interactions. Nowadays, such databases as ENCODE, GTRD, ChIP-Atlas and ReMap systematically collect and annotate a large number of ChIP-Seq datasets. Comprehensive control of dataset quality is currently indispensable to select the most reliable data for further analysis. In addition to existing quality control metrics, we have developed two novel metrics that allow to control false positives and false negatives in ChIP-Seq datasets. For this purpose, we have adapted well-known population size estimate for determination of unknown number of genuine transcription factor binding regions. Determination of the proposed metrics was based on overlapping distinct binding sites derived from processing one ChIP-Seq experiment by different peak callers. Moreover, the metrics also can be useful for assessing quality of datasets obtained from processing distinct ChIP-Seq experiments by a given peak caller. We also have shown that these metrics appear to be useful not only for dataset selection but also for comparison of peak callers and identification of site motifs based on ChIP-Seq datasets. The developed algorithm for determination of the false positive control metric and false negative control metric for ChIP-Seq datasets was implemented as a plugin for a BioUML platform: https://ict.biouml.org/bioumlweb/chipseq_analysis.html.
Ключевые слова: DATABASE; CAPTURE-RECAPTURE; FACTOR-BINDING SITES;
Издано: 2019
Цитирование: 1. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012 Sep 6; 489:57–74. https://doi.org/10.1038/nature11247 PMID: 22955616 2. Yevshin I, Sharipov R, Kolmykov S, Kondrakhin Y, Kolpakov F. GTRD: a database on gene transcription regulation-2019 update. Nucleic Acids Res. 2019 Jan; 47(D1):D100–D105. https://doi.org/10.1093/nar/gky1128 PMID: 30445619 3. Oki S, Ohta T, Shioi G, Hatanaka H, Ogasawara O, Okuda Y, et al. ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data. EMBO reports. 2018 Nov 9; 19(12):e46255. https://doi.org/10.15252/embr.201846255 PMID: 30413482 4. Cheneby J, Gheorghe M, Artufel M, Mathelier A, Ballester B. ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP-seq experiments. Nucleic Acids Res. 2018 Jan 4; 46(D1):D267–D275. https://doi.org/10.1093/nar/gkx1092 PMID: 29126285 5. Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012; 22(9):1813–1831. https://doi.org/10.1101/gr.136184.111 PMID: 22955991 6. Chao A, Bunge J. Estimating the number of species in a stochastics abundance model. Biometrics. 2002 Sep; 58:531–539. PMID: 12229987 7. Woodward M. Epidemiology: Study Design and Data Analysis. London: Chapman and Hall/CRC; 2013. 8. Hope VD, Hickman M, Tilling K. Capturing crack cocaine use: estimating the prevalence of crack cocaine use in London using capture–recapture with covariates. Addiction. 2005 Sep 15; 100 (11):1701–1708. https://doi.org/10.1111/j.1360-0443.2005.01244.x PMID: 16277630 9. Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E. MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 2003 Jul 1; 31 (13):3576–3579. https://doi.org/10.1093/nar/gkg585 PMID: 12824369 10. Bailey TL, Johnson J, Grant CE, Noble WS. The MEME Suite. Nucleic Acids Res. 2015 July 1; 43(W1): W39–W49. https://doi.org/10.1093/nar/gkv416 PMID: 25953851 11. Kulakovskiy IV, Vorontsov IE, Yevshin IS, Soboleva AV, Kasianov AS, Ashoor H, et al. HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res. 2016 Jan 4; 44(D1):D116–D125. https://doi.org/10.1093/nar/gkv1249 PMID: 26586801 12. Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2018 Jan 4; 46(D1), D260–D266. https://doi.org/10.1093/nar/gkx1126 PMID: 29140473 13. Hume MA, Barrera LA, Gisselbrecht SS, Bulyk ML. UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2015 Jan 28; 43(D1):D117–D122. 14. Thomas R, Thomas S, Holloway AK, Pollard KS. Features that define the best ChIP-Seq peak calling algorithms. Brief Bioinform. 2017 May; 18(3):441–450. https://doi.org/10.1093/bib/bbw035 PMID: 27169896 15. Laajala TD, Raghav S, Tuomela S, Lahesmaa R, Aittokallio T, Elo LL. A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments. BMC Genomics. 2009 Dec 18; 10(1):618. 16. Harmanci A, Rozowsky J, Gerstein M. MUSIC: identification of enriched regions in Chip-Seq experiments using a mappability-corrected multiscale signal processing framework. Genome Biol. 2014 Oct 8; 15(10):474. https://doi.org/10.1186/s13059-014-0474-3 PMID: 25292436 17. Koohy H, Down TA, Spivakov M, Hubbard T. A comparison of peak callers used for DNase-Seq data. PLoS ONE. 2014 May 8; 9(5):e96303. https://doi.org/10.1371/journal.pone.0096303 PMID: 24810143 18. Micsinai M, Parisi F, Strino F, Asp P, Dynlacht BD, Kluger Y. Picking ChIP-seq peak detectors for analyzing chromatin modification experiments, Nucleic Acids Res. 2012 May 1; 40(9):e70. https://doi.org/10.1093/nar/gks048 PMID: 22307239 19. Guo Y, Mahony S, Gifford DK. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput. Biol. 2012 Aug 9; 8(8):e1002638. https://doi.org/10.1371/journal.pcbi.1002638 PMID: 22912568 20. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008 Sep 17; 9(9):R137. https://doi.org/10.1186/gb-2008-9-9-r137PMID: 18798982 21. Zhang X, Robertson G, Krzywinski M, Ning K, Droit A, Jones S, et al. PICS: probabilistic inference for ChIP-seq. Biometrics. 2011 Mar 14; 67(1):151–163. https://doi.org/10.1111/j.1541-0420.2010.01441.xPMID: 20528864 22. Narlikar L, Jothi R. ChIP-Seq data analysis: identification of protein-DNA binding sites with SISSRs peak-finder. Methods Mol. Biol. 2011 Nov 18; 802:305–322. 23. Chao A. Estimating the population size for capture–recapture data with unequal catchability. Biometrics. 1987 Dec; 43(4):783–791. PMID: 3427163 24. Lanumteang K, Bohning D. An extension of Chao’s estimator of population size based on the first three capture frequency counts. Comput. Stat. Data An. 2011 Feb 22; 55(7):2302–2311. 25. Zelterman D. Robust estimation in truncated discrete distributions with application to capture-recapture experiments. J. Stat. Plan. Inf. 1988 Mar 25; 18(2):225–237. 26. McCrea RS, Morgan BJT. Analysis of Capture-Recapture Data. London: Chapman and Hall/CRC; 2014. 27. Chapman DH. Some properties of the hypergeometric distribution with applications to zoological surveys. Univ. Calif. Publ. Stat. 1951; 1:131–160. 28. Yevshin I, Sharipov R, Valeev T, Kel A, Kolpakov F. GTRD: a database of transcription factor binding sites identified by ChIP-seq experiments. Nucleic Acids Res. 2017 Jan; 45(D1):D61–D67 https://doi.org/10.1093/nar/gkw951 PMID: 27924024 29. Kulakovskiy IV, Boeva VA, Favorov AV, Makeev VJ. Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics. 2010 Oct 15; 26(20):2622–3. https://doi.org/10.1093/bioinformatics/btq488PMID: 20736340 30. Kolpakov F, Akberdin I, Kashapov T, Kolmykov S, Kondrakhin Y, Kutumova E, et al. BioUML: an integrated environment for systems biology and collaborative analysis of biomedical data. Nucleic Acids Res [Preprint]. 2019 May 27. Available from: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz440/5498754 https://doi.org/10.1093/nar/gkz440.