The abundance of genome data being produced by new sequencing techniques has clearly put the field of bioinformatics in the spotlight. Bioinformatics is an interdisciplinary field that develops and improves on methods for storing, retrieving, organizing and analyzing biological data. A such, a major activity in bioinformatics is to develop software tools to generate useful biological knowledge. In this context, my background in electrical and computer engineering coupled with the extensive experience in working on currently open research issues in Life Sciences, has provided me with a unique perspective in both the algorithmic aspects of approaches in bioinformatics, as well as some insights towards potential future directions.

My earliest work was primarily focused on the design and application of data mining algorithms, applicable both to research questions in bioinformatics such as protein classification ((Psomopoulos et al., 2004; Polychroniadou et al., 2006; Gkekas et al., 2008)) and molecular dynamics ((Mprouza et al., 2008)) as well as within e-infrastructures ((Gkekas et al., 2008; Psomopoulos & Mitkas, 2009; Psomopoulos & Mitkas, 2010)). These works consolidated the knowledge in the efficient design of novel data analysis and modeling algorithms as well as established a long-standing working relationship with European e-infrastructures, leading to my later selection as an EGI Champion in Bioinformatics in 2013.

Building on this knowledge and experience, the next stage in my research was oriented towards a better understanding of the current situation in Life Sciences, by addressing key questions in Comparative Genomics. One of the first novel approaches in this context was the definition of fuzzy phylogenetic profiles and their application in the detection of genomic idiosyncrasies ((Psomopoulos et al., 2013)), i.e. sets of genes found in a specific genome with peculiar phylogenetic properties, such as intra-genome correlations or inter-genome relationships. As a method, it is demonstrated to be extremely efficient, both in terms of computational complexity and high scalability with various uses, including as a validation approach for further studies. A second study focused on the definition of the pangenome, and the development of a computational workflow that defined the salient features of the Chlamydiales pangenome and its evolutionary history ((Psomopoulos et al., 2012)). This work included both the definition and development of a robust computational pipeline for pangenomes, as well as allowed for a deeper understanding of the biological concepts underlying the evolutionary histories of the bacterial order under study. Finally, building on both fuzzy phylogenetic profiles and pangenomes, one of the most recent works is PathTrace, an efficient algorithm for parsimony-based reconstructions of the evolutionary history of individual metabolic pathways (paper under review). By deploying a pangenome-driven approach, it is demonstrated that the inferred patterns are largely insensitive to noise, as opposed to gene content reconstruction methods. In addition, there is a strong indication that the resulting reconstructions are closely correlated with the evolutionary distance of the taxa under study, suggesting that a diligent selection of target pangenomes is essential for maintaining cohesiveness of the method and consistency of the inference, serving as an internal control for an arbitrary selection of queries.

Going beyond the development and research on novel tools and algorithms in bioinformatics, a third research direction is towards the integration and automation of the bioinformatics workflows, especially within the reproducible environment provided by the e-infrastructures ((Kintsakis et al., 2016)). The long-standing and active involvement within two major e-infrastructures, i.e. EGI and RDA, allowed for some significant insights into the future steps for computational sciences, leading to a joint perspective article ((Duarte et al., 2015)). Additionally, the experience in working with large and publicly available datasets, as well as the role of co-chair in the RDA Data Discovery Paradigms Interest Group, led to an in-depth investigation of the use-cases, requirements and best practices for data repositories ((Gregory et al., 2018)), in collaboration with DANS, Elsevier, the Australian National Data Services and the University of Colorado. At the technical level, another research output in the direction is Hermes, a system for the seamless delivery of containerized bioinformatics workflows in hybrid cloud (HTC) environments ((Kintsakis et al., 2017)), which combined both the theoretical aspects of workflow management systems in bioinformatics, and leveraged the functionality provided by containers in a cloud environment. An important aspect of this work is the fact that Hermes fosters the reproducibility of scientific workflows by supporting standardization of the software execution environment, thus leading to consistent scientific workflow results and accelerating scientific output.

References

  1. Psomopoulos, F. E., Diplaris, S., & Mitkas, P. A. (2004). A finite state automata based technique for protein classification rules induction. Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics (in Conjunction with ECML/PKDD), 54–60.
  2. Polychroniadou, H. E., Psomopoulos, F. E., & Mitkas, P. A. (2006). G-Class: A Divide and Conquer Application for Grid Protein Classification. Proceedings of the 2nd ADMKD 2006: Workshop on Data Mining and Knowledge Discovery (in Conjunction with ADBIS 2006: The 10th East-European Conference on Advances in Databases and Information Systems), 121–132.
  3. Gkekas, C. N., Psomopoulos, F. E., & Mitkas, P. A. (2008). Exploiting parallel data mining processing for protein annotation. Student EUREKA 2008: 2nd Panhellenic Scientific Student Conference, 242–252.
  4. Mprouza, I. K., Psomopoulos, F. E., & Mitkas, P. A. (2008). AMoS: Agent-based Molecular Simulations. Student EUREKA 2008: 2nd Panhellenic Scientific Student Conference, 175–186.
  5. Gkekas, C. N., Psomopoulos, F. E., & Mitkas, P. A. (2008). A Parallel Data Mining Application for Gene Ontology Term Prediction. 3rd EGEE User Forum, 1.
  6. Psomopoulos, F. E., & Mitkas, P. A. (2009). BADGE: Bioinformatics Algorithm Development for Grid Environments. 13th Panhellenic Conference on Informatics, 93–107.
  7. Psomopoulos, F. E., & Mitkas, P. A. (2010). Bioinformatics algorithm development for Grid environments. Journal of Systems and Software, 83(7), 1249–1257. https://doi.org/10.1016/j.jss.2010.01.051
  8. Psomopoulos, F. E., Mitkas, P. A., & Ouzounis, C. A. (2013). Detection of genomic idiosyncrasies using fuzzy phylogenetic profiles. PloS One, 8(1), e52854. https://doi.org/10.1371/journal.pone.0052854
  9. Psomopoulos, F. E., Siarkou, V. I., Papanikolaou, N., Iliopoulos, I., Tsaftaris, A. S., Promponas, V. J., & Ouzounis, C. A. (2012). The chlamydiales pangenome revisited: structural stability and functional coherence. Genes, 3(2), 291–319. https://doi.org/10.3390/genes3020291
  10. Kintsakis, A. M., Psomopoulos, F. E., & Mitkas, P. A. (2016). Data-aware optimization of bioinformatics workflows in hybrid clouds. Journal of Big Data, 3(1), 20. https://doi.org/10.1186/s40537-016-0055-2
  11. Duarte, A. M. S., Psomopoulos, F. E., Blanchet, C., Bonvin, A. M. J. J., Corpas, M., Franc, A., Jimenez, R. C., de Lucas, J. M., Nyrönen, T., Sipos, G., & Suhr, S. B. (2015). Future opportunities and trends for e-infrastructures and life sciences: going beyond the grid to enable life science data analysis. Frontiers in Genetics, 6(June), 197. https://doi.org/10.3389/fgene.2015.00197
  12. Gregory, K., Khalsa, S. J., Michener, W. K., Psomopoulos, F. E., de Waard, A., & Wu, M. (2018). Eleven quick tips for finding research data. PLOS Computational Biology, 14(4), 1–7. https://doi.org/10.1371/journal.pcbi.1006038
  13. Kintsakis, A. M., Psomopoulos, F. E., Symeonidis, A. L., & Mitkas, P. A. (2017). Hermes: Seamless delivery of containerized bioinformatics workflows in hybrid cloud (HTC) environments. SoftwareX, 6(Supplement C), 217–224. https://doi.org/10.1016/j.softx.2017.07.007