Skip to main content
Utah's Foremost Platform for Undergraduate Research Presentation
2018 Abstracts

Predicting Transcription Factor Binding Sites Across Multiple Cell Lines

Lucas Pinto; Dane Jo; Ashton Omdahl; Megan McGhie; Caroline Tyler; Shun Sambongi; Caleb Cranney, Brigham Young University

Transcription factors (TFs) are proteins that bind to specific sites in the genome and act as regulators of the cellular transcription mechanism. Where TFs bind informs how genetic networks are regulated and has proven useful in uncovering the molecular mechanism of many diseases. Currently, a primary experimental technique by which TF binding sites (TFBSs) are located is through chromatin immunoprecipitation followed by DNA sequencing (ChIP-seq) analysis. While this method is currently the gold standard for finding TFBSs, its use is limited by cost, high number of possible conditions for cell types and transcription factors, and other experimental limitations. Thus, methodological or analytical advancements that could provide this essential generalizability while maintaining the high reliability of ChIP-seq would be a great asset to biology research communities. Although ChIP-Seq may not be feasible across multiple cell-types, gene expression profiles (through RNA-seq), chromatin accessibility regions (through DNase-seq), genomic data, and DNA shape data are all readily obtained for nearly all cell types and at a lower cost. Given these conditions, we propose a method for predicting transcription factor binding probability using these four data types. The data used for our paper (including ChIP-seq) are provided through an open-source organization called DREAM. Our method is a supervised learning approach. For each individual transcription factor, we extract as many possible relevant features from the four data types used to train the learning algorithm. Based on these features, the learning algorithm implements (a) a voting mechanism based on the cell type it was trained on, and (b) trains and tests on all cell types. Our extracted features are for designated values in genomic regions of a two-hundred base pair sliding window that spans nearly the entire genome. Our classification implements a random forest of decision trees. With our current selection of features, our performance for predicting TF binding probability ranges from 75 - 99% based on which data values was used. As we further pursue features to improve our prediction accuracy, we anticipate extracting additional results from various datasets as well as narrowing gene clusters into TF-specific modules.