Introduction to Machine Learning - streamed
The goal of this exam is perform an analysis on data related to heart disease,
in particular, we want to explore the relationship between a target variable - whether patient has a heart disease or not - and several other variables such as cholesterol level, age, …
The data is present in the file 'heartData_simplified.csv', which is a cleaned and simplified version of the UCI heart disease data set
We ask that you explore the data-set and answer the questions in a commented R code (or Rmd if you know how). You should send your code to monique.zahn@sib.swiss by the 7th of August.
Do not hesitate to comment your code to explain to us your thought process and detail your conclusions following the analysis.
heartData <- read.csv('heartData_simplified.csv')
heartData$target=as.factor(heartData$target)
heartData$sex=as.factor(heartData$sex)
heartData$thal=as.factor(heartData$thal)
perform a PCA on the age, chol, thalach, ca and oldpeak features. Do the PCA axes helps you to visually distinguish patients along different categorized features such as target, sex or thal ?
perform a Hierarchical Clustering on all features but target. Evaluate the quality of your clustering and explore the different options (distances, clustering method).
regression
target using the other features with the train settarget)