Professor of Stochastic Modelling School of Mathematics & Statistics Newcastle University

# Multivariate Data Analysis

## Computer Practical 4

This computer practical can be accessed via the course web page:

It may be helpful to have the course web page, the course notes, and this practical page all open in different tabs of your web browser during this practical session. In particular, it may save time to copy-and-paste R commands rather than re-typing them.

• Work through all of the R code in the notes from p.72 to the end of Chapter 2, paying particular attention to the material relating to the construction and analysis of Principal Components. When you construct the 3d plot of the first 3 principal components of the galaxy data, make sure that you interact with the plot, by clicking and dragging corners of the enclosing cube in order to look at the data from different angles.
• For the nci microarray data, try repeating the PCA given in the notes, but instead using the princomp() function. What goes wrong?
• For the zip.train data set, do a 3d interactive plot of the first 3 principal components. By interacting with the plot, see that the first 3 principal components do provide enough information to allow classification of most images.
• For the zip.train data set, form a subset of the data corresponding to the digit 3.
• Form the principal components for these data using the prcomp() function.
• Produce a scatterplot of the second principal component against the first.
• Plot images representing the loadings for the first 4 principal components of those images.
• What proportion of variation is explained by the first 1, 2, 3, and 4 principal components?
• Work through all of the R code in the notes from Chapter 4 (Discrimination and Classification), starting on p.123.
• R contains a famous (old!) dataset called iris containing measurements on 4 quantitative variables together with a fifth qualitative variable containing a species classification. We will use this dataset try out some classification techniques.
• Start by using columns 3 and 4 to predict the classification in column 5. Use LDA and compute the misclassification rate.
• Produce a scatterplot of columns 3 and 4, with points coloured according to the true species, and then highlight the misclassified points.
• You may also use this practical session to work on Project 2 and get help with Project 2.

 darren.wilkinson@ncl.ac.uk http://www.staff.ncl.ac.uk/d.j.wilkinson/