Introduction to Machine Learning - streamed
The goal of this exam is perform an analysis on data related to heart disease,
in particular, we want to explore the relationship between a target
variable - whether patient has a heart disease or not - and several other variables such as cholesterol level, age, …
The data is present in the file 'heartData_simplified.csv'
, which is a cleaned and simplified version of the UCI heart disease data set
We ask that you explore the data-set and answer the questions in a commented R code (or Rmd if you know how). You should send your code to monique.zahn@sib.swiss by the 7th of August.
Do not hesitate to comment your code to explain to us your thought process and detail your conclusions following the analysis.
heartData <- read.csv('heartData_simplified.csv')
heartData$target=as.factor(heartData$target)
heartData$sex=as.factor(heartData$sex)
heartData$thal=as.factor(heartData$thal)
perform a PCA on the age
, chol
, thalach
, ca
and oldpeak
features. Do the PCA axes helps you to visually distinguish patients along different categorized features such as target
, sex
or thal
?
perform a Hierarchical Clustering on all features but target
. Evaluate the quality of your clustering and explore the different options (distances, clustering method).
regression
target
using the other features with the train settarget
)