PhyloFisher is a software package, written in Python3, that contains a protocol designed for phylogenomic dataset assembly and data exploration. This software package aids in the construction and curation of protein sequence-based phylogenomic datasets, conducts post-assembly analyses, and allows visualization of the results. In addition, PhyloFisher currently includes a manually curated starting dataset of 240 proteins from 304 eukaryotic taxa representing the full breadth of known diversity in the eukaryotic tree of life. Importantly, this dataset also includes identified paralogs of each of the 240 proteins from all investigated taxa which is crucial for the identification of probable orthologs. Although PhyloFisher includes this pan-eukaryotic dataset, the tool is flexible and can work with any dataset consisting of protein sequences derived from eukaryotes. The combination of all of the foregoing features makes PhyloFisher a broadly-useful, user-friendly software tool for sophisticated phylogenomic analyses of eukaryotes.
VIDEO DESCRIPTION
Check our this video seminar about PhyloFisher
https://www.youtube.com/watch?v=TcKhLiUxBTU&t=1s
Presented by Matthew Brown on 5/14/2020 as part of the Protist Genomics Seminar Series.
https://www.youtube.com/watch?v=MJfb5CalWPA
Presented by Martin Kolisko on 9/17/2021 as part of the EXLIR Friday Coffee Science Series.
DOWNLOAD
INSTALLATION: http://github.com/TheBrownLab/PhyloFisher
DATASET DOWNLOAD: https://doi.org/10.6084/m9.figshare.15141900.v1
HOW TO CITE PhyloFisher?
Tice AK*, Žihala D*, Pánek T*, Jones RE", Salomaki E", Nenarokov S, Burki F, Eliáš M, Eme L, Roger AJ, Rokas A, Shen X, Strassert JFH, Kolísko M, Brown MW. 2021. PhyloFisher: A phylogenomic package for resolving eukaryotic relationships. PLoS Biology. 19(8): e3001365. * & " =Contributed equally
UTILITIES PROVIDED IN PhyloFisher
aa_comp_calculator.py: Calculation of amino acid composition and hierarchic clustering of the data using Euclidean distances, in order to examine if amino acid composition may contribute to the groupings inferred in the phylogeny.
astral_runner.py: Generates input files and infers a coalescent-based species tree given a set of single ortholog trees and bootstrap trees using ASTRAL-III.
bipartition_examiner.py: Calculates the observed occurrences of clades of interest in bootstrap trees.
fast_site_removal.py: The fastest evolving sites are expected to be the most prone to phylogenetic signal saturation and systematic model misspecification in phylogenomic analyses. This tool will remove the fastest evolving sites within the phylogenomic supermatrix in a stepwise fashion, leading to a user-defined set of new matrices.
fast_tax_removal.py: Removes the fastest evolving taxa, based on tip-to-tip branch length. This tool will remove the fastest evolving taxa within the phylogenomic supermatrix in a stepwise fashion, leading to a user-defined set of new matrices with these taxa removed.
genetic_code.py: Checks stop-to-sense and sense-to-sense codon reassignment signal in transcriptome/genome data.
heterotachy.py: Within-site rate variation (heterotachy) has been shown to cause artifactual relationships in molecular phylogenetic reconstructions. This tool will remove the most heterotacheous sites within a phylogenomic supermatrix in a stepwise fashion, leading to a user-defined set of new matrices.
mammal_modeler.py: Generates a MAMMaL site heterogeneous model from a user input tree and supermatrix with estimated frequencies for a user defined number of classes using the methods described in Susko et al. 2018.
random_sample_iteration.py: Randomly subsamples the gene set included in the supermatrix into a set of new matrices. It constructs supermatrices from randomly sampled genes with user-defined options. These include sampling all genes in a random fashion within a user-defined sampling confidence interval and the percentage of subsampling a user requires per sampling step.
rtc_binner.py: Calculates the relative tree certainty score (RTC) in RAxML of each single ortholog tree and bins them based on their RTC scoring into top 25%, 50%, and top 75% sets. Supermatrices are constructed from these bins of orthologs.
SR4_class_recoder.py: Attempts to minimize phylogenetic saturation by recoding input supermatrix into the four-character state scheme of SR4, based on amino acid binning Susko and Roger 2007.
taxon_collapser.py: Allows users to combine multiple operational taxonomic units into one single taxon. For example, if a user has multiple proteomes derived from single-cell libraries from a taxon or multiple strains of the same species (or genus etc.), a user may decide to collapse all these strains/libraries into a single hybrid taxon.
FUNDING
This project was supported primarily by the United States National Science Foundation (NSF) Division of Environmental Biology (DEB) grants 1456054 and 2100888 (http://www.nsf.gov), awarded to MWB. Support for TP’s postdoctoral stay in MWB’s laboratory was supported by the J.W. Fulbright Commission of Czech Republic awarded to TP. ME and MK labs are supported by the Czech Science Foundation (https://gacr.cz/) (grants 18-18699S and 18-28103S, respectively) and by the ‘Centre for Research of Pathogenicity and Virulence of Parasites’ (European Regional Development funds (https://opvvv.msmt.cz/)), CZ.02.1.01/0.0/0.0/16_019/0000759). ES was supported by International Mobilities of Researchers of the Biology Centre through Ministerstvo školství, mládeže a tělovýchovy České republiky (MSMT) (https://www.msmt.cz/) (CZ.02.2.69/0.0/0.0/16_027/0008357) and by the Marie Skłodowska-Curie Actions Individual Fellowship CZ SMART program through MSMT (CZ.02.2.69/0.0/0.0/20_079/0017809). Research on phylogenomics in AR’s lab is supported by the National Science Foundation (http://www.nsf.gov) (DEB-1442113). LE is supported by a grant from the European Research Council (https://erc.europa.eu/) (ERC Starting grant 803151). FB thanks Science for Life Laboratory for supporting the work of JFHS in his laboratory, and JFHS thanks the German Research Foundation (https://www.dfg.de/) (DFG; STR1349/2-1, project # 432453260) for support. MK thanks IT4Innovations National Super Computer Center (https://www.it4i.cz/), Technical University of Ostrava, Ostrava, Czech Republic (project #Open-20-18) for providing computational resources.
PhyloFisher DEVELOPMENT TEAM
Software Developers
David Žihala, PhD - Google Scholar | Robert Jones | Serafim Nenarokov
Tree Thinkers
Alexander K Tice, PhD - Google Scholar | Eric Salomaki, PhD - Google Scholar
Core PI Team
Matthew W. Brown, PhD - Google Scholar | Martin Kolisko, PhD - Google Scholar
Fabien Burki, PhD - Google Scholar | Marek Eliáš, PhD - Google Scholar
Laura Eme, PhD - Google Scholar | Tomáš Pánek, PhD - Google Scholar | Andrew Roger, PhD - Google Scholar
Antonis Rokas - Google Scholar | Xing-Xing Shen - Google Scholar
THIRD PARTY SOFTWARE/DATA INCLUDED
If you use PhyloFisher, please also cite these invaluable resources.
ASTRAL-III (Zhang et al., 2018) - Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics. 2018;19: 153. doi:10.1186/s12859-018-2129-y
BLAST v. 2.9.0 (Camacho et al., 2009) - Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: Architecture and applications. BMC Bioinformatics, 10(1), 421. https://doi.org/10.1186/1471-2105-10-421
BMGE v. 1.1.2 (Criscuolo & Gribaldo, 2010) - Criscuolo, A., & Gribaldo, S. (2010). BMGE (Block Mapping and Gathering with Entropy): A new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology, 10(1), 210. https://doi.org/10.1186/1471-2148-10-210
CD-HIT v. 4.8.1 (Fu et al., 2012) - Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150–3152. https://doi.org/10.1093/bioinformatics/bts565
DIAMOND v. 09.24 (Buchfink et al., 2015) - Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. https://doi.org/10.1038/nmeth.3176
DIST_EST v.1.0 (Susko et al., 2003) - Susko, E., & Roger, A. J. (2007). On Reduced Amino Acid Alphabets for Phylogenetic Inference. Molecular Biology and Evolution, 24(9), 2139–2150. https://doi.org/10.1093/molbev/msm144
DIVVIER v. 1.01 (Ali et al., 2019) - Ali, R. H., Bogusz, M., & Whelan, S. (2019). Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments. Molecular Biology and Evolution, 36(10), 2340–2351. https://doi.org/10.1093/molbev/msz142
ETE3 3.1.1 (Huerta-Cepas et al., 2016) - Huerta-Cepas, J., Serra, F., & Bork, P. (2016). ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Molecular Biology and Evolution, 33(6), 1635–1638. https://doi.org/10.1093/molbev/msw046
FastTree v. 2.1.11 (Price et al., 2010) - Price, M. N., Dehal, P. S., & Arkin, A. P. (2010). FastTree 2—Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5(3), 1–10. a9h.
HMMER v. 3.2.1 (Mistry et al., 2013) - Mistry, J., Finn, R. D., Eddy, S. R., Bateman, A., & Punta, M. (2013). Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Research, 41(12), e121–e121. https://doi.org/10.1093/nar/gkt263
MAFFT v.7.455 (Katoh & Standley, 2013) - Katoh, K., & Standley, D. M. (2013). MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution, 30(4), 772–780. https://doi.org/10.1093/molbev/mst010
MAMMaL v1.1.1 (Susko et al., 2018) - Susko, E., Lincker, L. and Roger, A.J. (2018). Accelerated Estimation of Frequency Classes in Site-heterogeneous Profile Mixture Models. Molecular Biology and Evolution. 9:1266-1283.
OrthoMCL v 5.0 (Chen et al., 2006) - Chen, F., Mackey, A. J., Stoeckert, C. J., Jr, & Roos, D. S. (2006). OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Research, 34(suppl_1), D363–D368. https://doi.org/10.1093/nar/gkj123
PREQUAL v. 1.02 (Whelan et al., 2018) - Whelan, S., Irisarri, I., & Burki, F. (2018). PREQUAL: detecting non-homologous characters in sets of unaligned homologous sequences. Bioinformatics, 34(22), 3929–3930. https://doi.org/10.1093/bioinformatics/bty448
RAxML v. 8.2.12 (Stamatakis, 2014) - Stamatakis, A. (2014). RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9), 1312–1313. https://doi.org/10.1093/bioinformatics/btu033
trimAl v.1.4.rev15 (Capella-Gutiérrez et al., 2009) - Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: A tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25(15), 1972–1973. https://doi.org/10.1093/bioinformatics/btp348
LOGOS