An Interactive, Taxonomy-Driven Tool for Genetic Sequence Database Mining

Jarom Schow, Utah Valley University

Biology

DNA and protein sequence data from GenBank and other publicly available databases can be used to perform phlyogenetic analysis. However, the process of assembling data sets for taxa of interest using GenBank is a time consuming and labor-intensive manual process. To improve this process, we have developed a new set of software tools that identifies, organizes, and presents existing sequence data in a way to facilitate data set creation for organisms of interest. The software provides an interactive, taxonomy-driven user interface for viewing and selecting available gene sequence data and exporting it to common genetic analysis file formats. To identify available genetic data, the user selects one or more taxa (species, genus, family, etc.) of interest. The software then identifies all available sequence data for every member of the given taxa. The sequences are sorted by gene and taxon to determine availability and data coverage. Results are then displayed using a hierarchical taxonomy and list of sequence data organized by gene and availability. This enables the user to quickly identify which genes and taxa currently have the best coverage and select the desired data for export. A local database implemented with BioSQL and populated with sequence data from GenBank and taxa from the NCBI taxonomy database was used to access and organize the data. The software is written in C++ using the Qt framework for speed, robustness and cross-platform interoperability.