Professor of Stochastic Modelling School of Mathematics & Statistics Newcastle University

# A quick introduction to R

'R' is a programming language for data analysis and statistics. It is free, and very widely used by professional statisticians. It is also very popular in certain application areas, including bioinformatics. R is a dynamically typed interpreted language, and is typically used interactively. It has many built-in functions and libraries, and is extensible, allowing users to define their own functions and procedures using R, C or Fortran. It also has a simple object system.

### Vectors

Vectors are a fundamental concept in R, as many functions operate on and return vectors, so it is best to master these as soon as possible. For the technically inclined, in R, a (numeric) vector is an object consisting of a one-dimensional array of scalars.

```> rep(1,10)
[1] 1 1 1 1 1 1 1 1 1 1
>
```
Here rep is a funtion that returns a vector (here, 1 repeated 10 times). You can get documentation for rep by typing
```> ?rep
```
You can assign any object (including vectors) using the assignment operator <-, and combine vectors and scalars with the c function.
```> a<-rep(1,10)
> b<-1:10
> c(a,b)
[1]  1  1  1  1  1  1  1  1  1  1  1  2  3  4  5  6  7  8  9 10
> a+b
[1]  2  3  4  5  6  7  8  9 10 11
> a+2*b
[1]  3  5  7  9 11 13 15 17 19 21
> a/b
[1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000 0.1666667 0.1428571
[8] 0.1250000 0.1111111 0.1000000
> c(1,2,3)
[1] 1 2 3
>
```
Note that arithmetic operations act element-wise on vectors. To look at any object (function or data), just type its name.
```> b
[1]  1  2  3  4  5  6  7  8  9 10
```
To list all of your objects, use ls(). Note that because of the existance of a function called c (and another called t) it is best to avoid using these as variable names in user-defined functions (this is a common source of bugs).

Vectors can be "sliced" very simply:

```> d<-c(3,5,3,6,8,5,4,6,7)
> d
[1] 3 5 3 6 8 5 4 6 7
> d[4]
[1] 6
> d[2:4]
[1] 5 3 6
> d<7
[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
> d[d<7]
[1] 3 5 3 6 5 4 6
>
```
Vectors can be sorted and randomly sampled. The following command generates some lottery numbers.
```> sort(sample(1:49,6))
[1]  2 17 23 24 25 35
```
Get help on sort and sample to see how they work. R is also good at simulating vectors of quantities from standard probability distributions, so
```> rpois(20,2)
[1] 1 4 0 2 2 3 3 3 3 4 2 2 4 2 1 3 4 1 2 1
```
generates 20 Poisson random variates with mean 2, and
```> rnorm(5,1,2)
[1] -0.01251322 -0.03181018  0.30426031  3.24302197 -2.04370284
```
generates 5 Normal random variates with mean 1 and standard deviation (not variance) 2. There are lots of functions that act on vectors.
```> x<-rnorm(50,2,3)
> x
[1]  2.04360635  5.01113289 -1.52215979 -0.19789766  1.41945311 -0.08850784
[7] -0.91161025  3.47199019  6.13447194  4.62796165  0.07600234 -2.99687943
[13]  1.75153104  8.55000833  3.11921624  3.38411717  3.86860456  0.29103619
[19]  1.25823419  3.88427191 -0.77722215 -0.57774833  2.99937058  4.29042603
[25]  6.10597239  2.83832381  3.73618138  4.12999252  6.23009274  1.07251421
[31] -0.19645150  1.77581296  2.08783542  1.62948606  2.74911850  0.44028844
[37]  1.80996899  1.86436309  0.29372974  2.37077354  1.54285955  4.40098545
[43] -3.01913118 -0.23174209  3.58252631  5.18954147  3.61988373  4.08815220
[49]  6.30878505  4.56744882
> mean(x)
[1] 2.361934
> length(x)
[1] 50
> var(x)
[1] 6.142175
> summary(x)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-3.0190  0.3304  2.2290  2.3620  4.0370  8.5500
> cumsum(x)
[1]   2.043606   7.054739   5.532579   5.334682   6.754135   6.665627
[7]   5.754017   9.226007  15.360479  19.988441  20.064443  17.067563
[13]  18.819095  27.369103  30.488319  33.872436  37.741041  38.032077
[19]  39.290311  43.174583  42.397361  41.819613  44.818983  49.109409
[25]  55.215382  58.053705  61.789887  65.919879  72.149972  73.222486
[31]  73.026035  74.801848  76.889683  78.519169  81.268288  81.708576
[37]  83.518545  85.382908  85.676638  88.047412  89.590271  93.991257
[43]  90.972125  90.740383  94.322910  99.512451 103.132335 107.220487
[49] 113.529272 118.096721
> sum(x)
[1] 118.0967
>
```

### Plotting

R has lots of great functions for producing publication quality plots - running demo(graphics) will give an idea of some of the possibilities. Some more basic commands are given below - try them in turn and see what they do.

```> plot(1:50,cumsum(x))
> plot(1:50,cumsum(x),type="l",col="red")
> plot(x,0.6*x+rnorm(50,0.3))
> hist(x)
> hist(x,20)
> hist(x,freq=FALSE)
> boxplot(x,2*x)
> barplot(d)
>
```
Study the help file for each of these commands to get a feel for the way each can be customised.

### User-defined functions

R is a full programming language, and before long, you are likely to want to add your own functions. Consider the following declaration.

```rchi<-function(n,p=2) {
x<-matrix(rnorm(n*p),nrow=n,ncol=p)
y<-x*x
as.vector(y %*% rep(1,p))
}
```
The first line declares the object rchi to be a function with two arguments, n and p, the second of which will default to a value of 2 if not specified. Then everything between { and } is the function body, which can use the variables n and p as well as any globaly defined objects. The second line declares a local variable x to be a matrix with n rows and p columns, whose elements are standard normal random variables. The next line forms a new matrix y whose elements are the squares of the elements in x. The last line computes the matrix-vector product of y and a vector of p ones, then coerces the resulting n by 1 matrix into a vector. The result of the last line of the function body is the return result of the function. In fact, this function provides a fairly efficient way of simulating Chi-squared random quantities with p degrees of freedom, but that isn't particularly important. The function is just another R object, and hence can be viewed by entering rchi on a line by itself. It can be edited by doing fix(rchi). The function can be called just like any other, so
```> rchi(10,3)
[1] 1.847349 5.590369 3.994036 4.243734 2.104224 1.027634 1.119508 6.653095
[9] 5.660968 5.384954
> rchi(10)
[1] 0.09356735 3.63633129 1.34073206 1.79412360 1.46038656 2.67362870
[7] 0.50413958 6.04307710 1.03116671 1.39662895
>
```
generates 10 chi-squared random variates with 3 and 2 degrees of freedom, respectively. You might want to look at my R Hints and Tips for a few tips on working with functions in R. For more information on programming in R, see my computer practicals on Stochastic simulation and MCMC (using R and C).

Of course, in order to use R for data analysis, it is necessary to be able to read data into R from other sources. It is often also desirable to be able to output data from R in a format that can be read by other applications. Unsurprisingly, R has a range of functions for accomplishing these tasks, but we shall just look here at the two simplest.

The simplest way to get data into R is to read a list of numbers from a text file using the scan command. So, to load data from the file scandata.txt, use a command like
```> x<-scan("scandata.txt")
>
```
In general, you may need to use the getwd and setwd commands to inspect and change the working directory that R is using. Similarly, a vector of numbers can be output with a command like
```> write(x,"scandata.txt")
>
```
More often, we will be concerned with loading tabular data output from a spreadsheet or database or even another stats package. R copes best with whitespace-separated data, but can be persuaded to read other formats with some effort. The key command here is read.table (and the corresponding output command write.table). So to read the data in the file mytable.txt, do
```> mytab<-read.table("mytable.txt",header=TRUE)
> mytab
Name Shoe.size Height
1  Fred         9    170
2   Jim        10    180
3  Bill         9    185
4  Jane         7    175
5  Jill         6    170
6 Janet         8    180
>
```
Note that R does contain some primitive functions for editing data frames like this (and other objects), so
```> mytabnew<-edit(mytab)
```
will pop up a simple editor for mytab, and on quitting, the edited version will be stored in mytabnew. Data frames like mytab are a key object type in R, and tend to be used a lot. Here are some ways to interact with data frames.
```> mytab\$Height
[1] 170 180 185 175 170 180
> mytab[,2]
[1]  9 10  9  7  6  8
> plot(mytab[,2],mytab[,3])
> mytab[4,]
Name Shoe.size Height
4 Jane         7    175
> mytab[5,3]
[1] 170
> mytab[mytab\$Name=="Jane",]
Name Shoe.size Height
4 Jane         7    175
> mytab\$Height[mytab\$Shoe.size > 8]
[1] 170 180 185
>
```
Also see the help on source and dump for inputting and output objects of other sorts.