Introduction to Machine Learning

Applications of ML in Bioinformatics

There are several biological domains where machine learning techniques are applied for knowledge extraction from data. The following figure (retrieved from Pedro Larrañaga et.al, Briefings in Bioinformatics 7:1, 2006) shows a scheme of the main biological problems where computational methods are being applied.

Classification of the topics where machine learning methods are applied (https://doi.org/10.1093/bib/bbk007)

Examples of different Machine Learning / Data Mining techniques that can be applied to different NGS data analysis pipelines.

An extensive list of examples of applications of Machine Learning in Bioinformatics can be found in the Pedro Larrañaga et.al, Briefings in Bioinformatics 7:1, 2006

How to choose the right Machine Learning technique?

Tip 4 in the Ten quick tips for machine learning in computational biology, by Davide Chicco (BioData Mining, Vol. 10, No. 35, 2017) provides a nice overview of what one should keep in mind, when choosing the right Machine Learning technique in Bioinformatics.

Which algorithm should you choose to start? In short; The simplest one!

Once you understand what kind of biological problem you are trying to solve, and which method category can fit your situation, you then have to choose the machine learning algorithm with which to start your project. Even if it always advisable to use multiple techniques and compare their results, the decision on which one to start can be tricky.

Many textbooks suggest to select a machine learning method by just taking into account the problem representation, while Pedro Domingos (“A few useful things to know about machine learning”, Commun ACM. 2012; 55(10):78–87) suggests to take into account also the cost evaluation, and the performance optimization.

This algorithm-selection step, which usually occurs at the beginning of a machine learning journey, can be dangerous for beginners. In fact, an inexperienced practitioner might end up choosing a complicated, inappropriate data mining method which might lead him/her to bad results, as well as to lose precious time and energy. Therefore, this is our tip for the algorithm selection: if undecided, start with the simplest algorithm (Hand DJ, “Classifier technology and the illusion of progress”. Stat Sci. 2006; 21(1):1–14).

By employing a simple algorithm, you will be able to keep everything under control, and better understand what is happening during the application of the method. In addition, a simple algorithm will provide better generalization skills, less chance of overfitting, easier training and faster learning properties than complex methods. As David J. Hand explained, complex models should be employed only if the dataset features provide some reasonable justification for their usage.

Slides / material

The slides for this section are available here