NGPHYLO: Tools for large scale phylogenetic analysis

Project funded by FCT (PTDC/CCI-BIO/29676/2017), from 01/10/2018 to 30/09/2022.

Abstract

The current ability to rapidly sequence whole microbial genomes is revolutionizing microbiology and epidemiological surveillance, with high impact on the identification of antimicrobial resistance genes and virulence factors which can have direct clinical application in the treatment of patients, or on the detection of outbreaks in hospital settings or in the food industry, e.g., by monitoring the spread of antimicrobial resistance, an ever growing concern. It has allowed also more complex phylogenetic analyses based on the whole genome data. The bottleneck has however shifted to data analysis problems. From a computational point of view, a growing concern is how algorithms and tools can be scaled up to analyse thousands of genetic loci in thousands of isolates. This project aims then to: (1) research and design efficient and scalable data structures and algorithms that allow phylogenetic analyses at large scales; (2) develop tools suitable for processing large scale phylogenetic analysis, deployable in cloud and HPC environments; (3) make tools available as reusable components, enabling the construction of more complex parametrizable pipeline workflows; (4) develop and integrate intuitive and user-friendly interfaces. The research team reflects the multidisciplinary character of the project. The team gathers researchers from two national research institutes, INESC-ID and IMM. Prior collaborations of this team in bioinformatics and computational biology focused on typing data analysis, phylogenetic inference and the development of software tools for these tasks.

Results

Papers and preprints

  1. L. Rita, A. P. Francisco, J. Carriço and V. Borges: Community Finding with Applications on Phylogenetic Networks. In XV Mediterranean Conference on Medical and Biological Engineering and Computing – MEDICON 2019, IFMBE Proceedings 76, September 2019.
  2. M. E. Coimbra, A. P. Francisco, L. M. S. Russo, G. De Bernardo, S. Ladra and G. Navarro: On Dynamic Succinct Graph Representations. In 2020 Data Compression Conference (DCC), IEEE, March 2020.
  3. L. M. S. Russo, A. S. D. Correia, G. Navarro and A. P. Francisco: Approximating Optimal Bidirectional Macro Schemes. In 2020 Data Compression Conference (DCC), IEEE, March 2020.
  4. L. M. S. Russo, A. P. Francisco and T. Rocher: Incremental Multiple Longest Common Sub-Sequences. May 2020 (preprint).
  5. D. M. Costa, A. P. Francisco and L. M. S. Russo: Hardness of Modern Games. May 2020 (preprint).
  6. L. M. S. Russo and A. P. Francisco: Small Longest Tandem Scattered Subsequences. June 2020 (preprint).
  7. L. M. S. Russo, T. Dietz, J. R. Figueira, A. P. Francisco and S. Ruzika: Sparsifying parity-check matrices. Applied Soft Computing, 96:106601, November 2020.
  8. R. Mamede, P. Vila-Cerqueira, M. Silva, J. A. Carriço and M. Ramirez: Chewie Nomenclature Server (chewie-NS): a deployable nomenclature server for easy sharing of core and whole genome MLST schemas. Nucleic Acids Research, 49(D1):D660–D666, January 2021.
  9. M. E. Coimbra, A. P. Francisco and L. Veiga: An analysis of the graph processing landscape. Big Data, 8(55), April 2021 (arXiv).
  10. C. Vaz, M. Nascimento, J. A. Carriço, T. Rocher and A. P. Francisco: Distance-based phylogenetic inference from typing data: a unifying view. Briefings in Bioinformatics, 22(3):bbaa147, May 2021.
  11. L. M. S. Russo: Range minimum queries in minimal space. Theoretical Computer Science, 909:19-38, March 2022 (arXiv).
  12. A. P. Francisco, T. Gagie, D. Köppl, S. Ladra and G. Navarro: Graph Compression for Adjacency-Matrix Multiplication. SN Computer Science, 3(193), March 2022.
  13. M. E. Coimbra, J. Hrotkó, A. P. Francisco, L. M. S. Russo, G. de Bernardo, S. Ladra and G. Navarro: A practical succinct dynamic graph representation. Information and Computation, 285(B):104862, May 2022.
  14. J. N. F. Alves, L. M. S. Russo and A. P. Francisco: Cache-Oblivious Hilbert Curve based Blocking-Scheme for Matrix Transposition. ACM Transactions on Mathematical Software, August 2022.
  15. L. M. S. Russo, D. Castro, A. Ilic, P. Romano and A. D. Correia: Stochastic simulated annealing for directed feedback vertex set. Applied Soft Computing, 129:109607, November 2022.
  16. L. M. S. Russo, D. Costa, R. Henriques, H. Bannai and A. P. Francisco: Order-preserving pattern matching indeterminate strings. Information and Computation, 289(A):104924, November 2022 (arXiv).
  17. M. Luís and C. Vaz: FLOWViZ: Framework for Phylogenetic Processing. November 2022 (preprint).

Theses

  1. L. Rita: Community Finding with Applications on Phylogentic Networks. MSc thesis, IST, Universidade de Lisboa, June 2019.
  2. A. S. Teixeira: Complex networks analysis from an edge perspective. PhD thesis, IST, Universidade de Lisboa, July 2019.
  3. J. Espada: Large scale phylogenetic inference from noisy data based on minimum weight spanning arborescences. MSc thesis, IST, Universidade de Lisboa, October 2019.
  4. J. F. Alves: Cache-Oblivious Nested Loops Based on Hilbert Curves. MSc thesis, IST, Universidade de Lisboa, November 2019.
  5. M. Oliveira e Costa: Simulation based approach to bacterial evolution. MSc thesis, IST, Universidade de Lisboa, December 2019.
  6. J. Hrotkó: A graph algorithm library based on compact data structures. MSc thesis, IST, Universidade de Lisboa, January 2021.
  7. L. B. Silva: Library of efficient algorithms for phylogenic analysis. MSc thesis, IST, Universidade de Lisboa, January 2021.
  8. B. Lourenço: A framework for large scale phylogenetic analysis. MSc thesis, IST, Universidade de Lisboa, January 2021.
  9. M. E. Coimbra: Graph Processing: Distributed Frameworks, Approximations and Compact Data Structures. PhD thesis, IST, Universidade de Lisboa, September 2021.
  10. I. Sousa, F. Filipe and A. Baptista: Visualização Radial e Dendograma de árvores filogenéticas. BSc thesis, ISEL, Instituto Politécnico de Lisboa, September 2021.
  11. V. Revés: Force Direct Visualization for Phylogenetic Tree. BSc thesis, ISEL, Instituto Politécnico de Lisboa, September 2021.
  12. F. Pesquita: Learning Data Structures. MSc thesis, IST, Universidade de Lisboa, November 2021.
  13. M. Caires: Computing Geodesic Tree Distances with the GTP Algorithm. MSc thesis, IST, Universidade de Lisboa, December 2021.
  14. F. Sena: Fast matroid intersection and applications. MSc thesis, IST, Universidade de Lisboa, June 2022.
  15. M. Luís: FLOWViZ Framework for Phylogenetic Processing. MSc thesis, ISEL, Instituto Politécnico de Lisboa, December 2022.

Comunications

  1. A. S. Teixeira: The impact of network structure on biological and social outcomes. Max Planck Institute for Evolutionary Biology Seminar Series, August 2021.
  2. C. Vaz and A. P. Francisco: Towards the optimization of large-scale phylogenetic trees. In International Conference on Computational Statistics (COMPSTAT), August 2022.
  3. A. P. Francisco, P. T. Monteiro, G. Ribeiro, A. S. Teixeira: On edge significance in large phylogenetic (spanning) trees. In International Conference on Computational Statistics (COMPSTAT), August 2022.
  4. M. Luís and C. Vaz: FLOWViZ: Framework for Phylogenetic Processing (pp 206-217). In Inforum, September 2022.
  5. C. Vaz, A. Correia. I. Sousa, A. Baptista, F. Filipe, M. Ferreira, and A. P. Francisco: A JavaScript library for interactive data visualization in phylogenetics (pp 218-229). In Inforum, September 2022.

Tools and libraries

  1. sdk2tree (C++ library): Dynamic and static k2tree graph data structure implementations.
  2. Chewie-NS: Enabling the use of gene-by-gene typing methods through a public and centralized service.
  3. phylolib: A library of efficient algorithms for phylogenetic analysis.
  4. phyloDB: A framework for large scale phylogenetic analysis.
  5. FLOWViZ: An integration framework for phylogenetic processing.
  6. phytotree: Radial and Dendogram Visualization for Phylogenetic Trees
  7. Force Direct Visualization for Phylogenetic Trees.

The mailing lists

ngphylo
General discussion concerning this project.

To subscribe to a given list, just send mail to mailing-list-name+help@thor.inesc-id.pt to know how to do it (where mailing-list-name is the name of your preferred list).