Authors: Barbara Jetton, Carl E Hjelmen
Mentors: Carl E Hjelmen
Insitution: Utah Valley University
This project’s emphasis is the creation of an accessible and reusable tool to be used in broader scientific inquires of evolutionary relatedness. As of January 2023, GenBank contains 2.9 billion nucleotide sequences representing 504,000 distinct species. Despite this abundance of data, comprehensive and up to date phylogenies are lacking, impeding investigation into genetic histories and trait evolution. To address this problem, I am developing an open-source pipeline to expedite the construction of these evolutionary trees. I have a specific aim of creating a phylogeny for the order Diptera (flies) in order to investigate the evolution of the chromosome numbers for over 2500 species with chromosome count data on karyotype.org. I use R code, and the packages “reutils”, “ape”, and “seqinr”, to create reusable universal scripts which pull accession numbers from NCBI GenBank for each species based on the requested gene names. A second script was built to use the curated accession numbers to pull FASTA sequence data for each gene and write a multi-FASTA file for each gene, resulting in a comprehensive dataset necessary for alignment and phylogenetic tree construction. This effort will result in updated insights on the evolutionary history of Diptera related to chromosome numbers and can be used in further research in comparative biology. Additionally, these scripts can be used to investigate and reconstruct phylogenetic information for any species group with sequence data available on GenBank.