Professor of Stochastic Modelling | |
School of Mathematics & Statistics | |
Newcastle University |
'R' is a programming language for data analysis and statistics. It is free, and very widely used by professional statisticians. It is also very popular in certain application areas, including bioinformatics. R is a dynamically typed interpreted language, and is typically used interactively. It has many built-in functions and libraries, and is extensible, allowing users to define their own functions and procedures using R, C or Fortran. It also has a simple object system.
Vectors are a fundamental concept in R, as many functions operate on and return vectors, so it is best to master these as soon as possible. For the technically inclined, in R, a (numeric) vector is an object consisting of a one-dimensional array of scalars.
> rep(1,10) [1] 1 1 1 1 1 1 1 1 1 1 >Here rep is a funtion that returns a vector (here, 1 repeated 10 times). You can get documentation for rep by typing
> ?repYou can assign any object (including vectors) using the assignment operator <-, and combine vectors and scalars with the c function.
> a<-rep(1,10) > b<-1:10 > c(a,b) [1] 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 10 > a+b [1] 2 3 4 5 6 7 8 9 10 11 > a+2*b [1] 3 5 7 9 11 13 15 17 19 21 > a/b [1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000 0.1666667 0.1428571 [8] 0.1250000 0.1111111 0.1000000 > c(1,2,3) [1] 1 2 3 >Note that arithmetic operations act element-wise on vectors. To look at any object (function or data), just type its name.
> b [1] 1 2 3 4 5 6 7 8 9 10To list all of your objects, use ls(). Note that because of the existance of a function called c (and another called t) it is best to avoid using these as variable names in user-defined functions (this is a common source of bugs).
Vectors can be "sliced" very simply:
> d<-c(3,5,3,6,8,5,4,6,7) > d [1] 3 5 3 6 8 5 4 6 7 > d[4] [1] 6 > d[2:4] [1] 5 3 6 > d<7 [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE > d[d<7] [1] 3 5 3 6 5 4 6 >Vectors can be sorted and randomly sampled. The following command generates some lottery numbers.
> sort(sample(1:49,6)) [1] 2 17 23 24 25 35Get help on sort and sample to see how they work. R is also good at simulating vectors of quantities from standard probability distributions, so
> rpois(20,2) [1] 1 4 0 2 2 3 3 3 3 4 2 2 4 2 1 3 4 1 2 1generates 20 Poisson random variates with mean 2, and
> rnorm(5,1,2) [1] -0.01251322 -0.03181018 0.30426031 3.24302197 -2.04370284generates 5 Normal random variates with mean 1 and standard deviation (not variance) 2. There are lots of functions that act on vectors.
> x<-rnorm(50,2,3) > x [1] 2.04360635 5.01113289 -1.52215979 -0.19789766 1.41945311 -0.08850784 [7] -0.91161025 3.47199019 6.13447194 4.62796165 0.07600234 -2.99687943 [13] 1.75153104 8.55000833 3.11921624 3.38411717 3.86860456 0.29103619 [19] 1.25823419 3.88427191 -0.77722215 -0.57774833 2.99937058 4.29042603 [25] 6.10597239 2.83832381 3.73618138 4.12999252 6.23009274 1.07251421 [31] -0.19645150 1.77581296 2.08783542 1.62948606 2.74911850 0.44028844 [37] 1.80996899 1.86436309 0.29372974 2.37077354 1.54285955 4.40098545 [43] -3.01913118 -0.23174209 3.58252631 5.18954147 3.61988373 4.08815220 [49] 6.30878505 4.56744882 > mean(x) [1] 2.361934 > length(x) [1] 50 > var(x) [1] 6.142175 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.0190 0.3304 2.2290 2.3620 4.0370 8.5500 > cumsum(x) [1] 2.043606 7.054739 5.532579 5.334682 6.754135 6.665627 [7] 5.754017 9.226007 15.360479 19.988441 20.064443 17.067563 [13] 18.819095 27.369103 30.488319 33.872436 37.741041 38.032077 [19] 39.290311 43.174583 42.397361 41.819613 44.818983 49.109409 [25] 55.215382 58.053705 61.789887 65.919879 72.149972 73.222486 [31] 73.026035 74.801848 76.889683 78.519169 81.268288 81.708576 [37] 83.518545 85.382908 85.676638 88.047412 89.590271 93.991257 [43] 90.972125 90.740383 94.322910 99.512451 103.132335 107.220487 [49] 113.529272 118.096721 > sum(x) [1] 118.0967 >
R has lots of great functions for producing publication quality plots - running demo(graphics) will give an idea of some of the possibilities. Some more basic commands are given below - try them in turn and see what they do.
> plot(1:50,cumsum(x)) > plot(1:50,cumsum(x),type="l",col="red") > plot(x,0.6*x+rnorm(50,0.3)) > curve(0.6*x,-5,10,add=TRUE) > hist(x) > hist(x,20) > hist(x,freq=FALSE) > curve(dnorm(x,2,3),-5,10,add=TRUE) > boxplot(x,2*x) > barplot(d) >Study the help file for each of these commands to get a feel for the way each can be customised.
R is a full programming language, and before long, you are likely to want to add your own functions. Consider the following declaration.
rchi<-function(n,p=2) { x<-matrix(rnorm(n*p),nrow=n,ncol=p) y<-x*x as.vector(y %*% rep(1,p)) }The first line declares the object rchi to be a function with two arguments, n and p, the second of which will default to a value of 2 if not specified. Then everything between { and } is the function body, which can use the variables n and p as well as any globaly defined objects. The second line declares a local variable x to be a matrix with n rows and p columns, whose elements are standard normal random variables. The next line forms a new matrix y whose elements are the squares of the elements in x. The last line computes the matrix-vector product of y and a vector of p ones, then coerces the resulting n by 1 matrix into a vector. The result of the last line of the function body is the return result of the function. In fact, this function provides a fairly efficient way of simulating Chi-squared random quantities with p degrees of freedom, but that isn't particularly important. The function is just another R object, and hence can be viewed by entering rchi on a line by itself. It can be edited by doing fix(rchi). The function can be called just like any other, so
> rchi(10,3) [1] 1.847349 5.590369 3.994036 4.243734 2.104224 1.027634 1.119508 6.653095 [9] 5.660968 5.384954 > rchi(10) [1] 0.09356735 3.63633129 1.34073206 1.79412360 1.46038656 2.67362870 [7] 0.50413958 6.04307710 1.03116671 1.39662895 >generates 10 chi-squared random variates with 3 and 2 degrees of freedom, respectively. You might want to look at my R Hints and Tips for a few tips on working with functions in R. For more information on programming in R, see my computer practicals on Stochastic simulation and MCMC (using R and C).
Of course, in order to use R for data analysis, it is necessary to be able to read data into R from other sources. It is often also desirable to be able to output data from R in a format that can be read by other applications. Unsurprisingly, R has a range of functions for accomplishing these tasks, but we shall just look here at the two simplest.
The simplest way to get data into R is to read a list of numbers from a text file using the scan command. So, to load data from the file scandata.txt, use a command like> x<-scan("scandata.txt") Read 50 items >In general, you may need to use the getwd and setwd commands to inspect and change the working directory that R is using. Similarly, a vector of numbers can be output with a command like
> write(x,"scandata.txt") >More often, we will be concerned with loading tabular data output from a spreadsheet or database or even another stats package. R copes best with whitespace-separated data, but can be persuaded to read other formats with some effort. The key command here is read.table (and the corresponding output command write.table). So to read the data in the file mytable.txt, do
> mytab<-read.table("mytable.txt",header=TRUE) > mytab Name Shoe.size Height 1 Fred 9 170 2 Jim 10 180 3 Bill 9 185 4 Jane 7 175 5 Jill 6 170 6 Janet 8 180 >Note that R does contain some primitive functions for editing data frames like this (and other objects), so
> mytabnew<-edit(mytab)will pop up a simple editor for mytab, and on quitting, the edited version will be stored in mytabnew. Data frames like mytab are a key object type in R, and tend to be used a lot. Here are some ways to interact with data frames.
> mytab$Height [1] 170 180 185 175 170 180 > mytab[,2] [1] 9 10 9 7 6 8 > plot(mytab[,2],mytab[,3]) > mytab[4,] Name Shoe.size Height 4 Jane 7 175 > mytab[5,3] [1] 170 > mytab[mytab$Name=="Jane",] Name Shoe.size Height 4 Jane 7 175 > mytab$Height[mytab$Shoe.size > 8] [1] 170 180 185 >Also see the help on source and dump for inputting and output objects of other sorts.
One of the great things about R is that it comes with a load of excellent documentation (from CRAN). The next thing to work through is the official Introduction to R, which covers more stuff in more depth than this very quick intro. My hints and tips are worth a quick scan. If you are interested in programming in R, you'll probably now want to work through my computer practicals on stochastic simulation using R and C, which also include some basic information on extending R with C. After that, you are probably ready to dive into my R links - good luck!
darren.wilkinson@ncl.ac.uk | ||
http://www.staff.ncl.ac.uk/d.j.wilkinson/ |