Prior-knowledge modelling in machine learning

01 oktober 2017

Timo Deist

Thanks to the financial support by the René Vogels Stichting, I visited the Department of Radiotherapy at the Massachusetts General Hospital (MGH)/Harvard Medical School between October and December 2017. Together with dr. David Craft and his group, we investigated how to enhance existing machine learning methods by including prior-knowledge from (biological) simulations. These machine learning methods could then be used in a large range of applications involving biological systems, e.g., (radiotherapy) treatment outcome prediction.
Currently, many ‘big data’ or artificial intelligence studies of biological systems use the traditional machine learning (ML) approach where statistical models are fit to a set of collected samples. As data collection is a main impediment to statistical analyses in medical and biological sciences, we believe that prior-knowledge of the underlying biological system needs to get a prominent role in these analyses. Biological systems have been studied extensively but, given their complexity, we might have only gained an approximate understanding of their behavior. Therefore, current mathematical representations of these systems are not sufficient to make accurate predictions but, when combined with machine learning methodologies, this accuracy could potentially be improved.

Methods
We developed an approach to link kernelized machine learning algorithms (like support vector machines) with these approximate (biological) simulation models so that the prior-knowledge from the simulation models as well as the statistical information from the sample data can be exploited. The approach, which we called SimKern ML, has only few restrictions so that it can be used in a wide range of applications with different types of simulation models. We tested the method in four artificial scenarios:

a model describing processes in a cell affected by radiation damage,
a biological flowering time model,
a Boolean cancer model,
a network-flow model.

In all four cases, we compared the SimKern ML method to the traditional machine learning approach (Standard ML) which uses only the sample data and no simulation model. We studied how the prediction accuracy increases when more samples are provided to our proposed SimKern ML as well as Standard ML. Furthermore, we investigated how changes to the simulation models can affect SimKern ML’s performance.

Results
Our results illustrate that there is a trade-off between uninformed Standard ML and SimKern ML depending on the amount of sample data available: when only few samples are available, SimKern ML provides superior prediction accuracy. When the available sample data for statistical analysis grows beyond a certain number, Standard ML approaches outperform SimKern ML. The approximate understanding of a biological system used to build simulation models (and consequently also the SimKern ML models) may introduce a bias in the predictions. Therefore, once enough sample data is available, Standard ML eventually extracts a better representation of the biological system than the (potentially biased) SimKern ML. The lack of sufficient samples in biological/medical studies is notorious, therefore there is potential benefit in applying SimKern ML.
Conclusions
We developed a simulation-assisted machine learning methodology, SimKern ML, which allows researchers to link their biological simulation models to machine learning algorithms. We studied the performance of SimKern ML in multiple artificial scenarios and conducted sensitivity analyses of various factors influencing its performance.

During this three-months visit, David Craft and I had many in-depth discussions to develop our methodology. Furthermore, staying in the Boston/Cambridge area allowed me to visit many lectures and conferences hosted by the various local research institutes (MIT, BROAD Institute) that would normally be out of my reach. I value this research visit as a key experience in my PhD studies both academically and personally. Therefore, I am very grateful to the René Vogels Stichting for their support.

René Vogels Stichting

Prior-knowledge modelling in machine learning

Recent