Guiyuan Lei - iNotes

The use of R packages to analyse data:

I am currently working on identifying differential expression and network inference for microarray data using R packages in Bioconductor. Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data. We use several R packages: affy, affyPLM, smida, limma, time course and GeneNet. These packages have been integrated into Bioconductor. Please refer to Bioconductor on how to install Bioconductor and these packages.

The following is part of this tutorial on using R and Bioconductor to analyse microarray data, including how to load microarray data into R, pre-process microarray data, identify differential expression and infer network. The full tutorial with lots of figures will be available soon. The whole R code will also be available upon request. Alternatively, you can download pdf file for a presentation "Tutorial: analysing Microarray data using BioConductor" which I gave recently.

• Load data

Two strains, wild type and mutants, of the yeast Saccharomyces cerevisiae were monitored for four hours. At each hour three microarray experiments were carried out for each strain. In total we have thirty arrays. The raw data files are named as follows,yeast01.cel, yeast02.cel, ..., yeast30.cel. These files are stored in the directory data2. To load the data, use the following command:

    library(affy)
    fns2 = list.celfiles(path="data2", full.names=TRUE)
    rawdata = ReadAffy(filenames=fns2)

• Extract Cerevisiae probesets

Yeast2 affymetrix chip contains probesets for both cerevisiae and pombe, we only need cerevisiae probesets, so we should extract only the relevant data before normalisation. I have one thread on Filter out pombe probeset from cerevisiae probesets for yeast2 Affymetrix chip in Bioconductor mailing list.

We download s_cerevisiae.msk file from Affymetrix website to filter out the pombe probesets.

    s_cerevisiae<-scan("s_cerevisiae.msk", skip=2, list("", ""))
    pombe_filter_out<-s_cerevisiae[[1]]

    #Get the whole 10928 Probeset ID from yeast2GENENAME
    library(yeast2)
    genenames = as.list(yeast2GENENAME)
    YeastProbeID <- names(genenames) 

    #Get the Probeset ID for Cerevisiae
    CerevisiaeProbeID <- YeastProbeID[-match(pombe_filter_out,YeastProbeID)]
    #Get the Gene Name for Yeast, need ProbeID if the gene name is 'NA'
    YeastGeneName<-character()
    for(i in 1:length(YeastProbeID)){
         YeastGeneName[i]=genenames[i][[1]]
         if(is.na(YeastGeneName[i])){
             YeastGeneName[i]=YeastProbeID[i] 
           }
       }

    #Get the Gene name for Cerevisiae
    CerevisiaeGeneName<-YeastGeneName[-match(pombe_filter_out,YeastProbeID)]

There are 5,028 pombe probesets in this mask file. After filtering we are left with 5,900 cerevisiae and spike-in probesets. For these probesets, we can obtain 4,640 genenames from yeast2GENENAME.

We use the RemoveProbes.r [1] to remove the mappings from the x- and y-coordinates to the pombe probesets in the cdf environment so that when R is told to look at the environment, R cannot detect that those probesets are there.

    #Get the raw subset data for Cerevisiae, use RemoveProbes() function
    #load function to remove pombe probesets
    source("RemoveProbes.r")
    library(yeast2probe)
    cleancdf = cleancdfname("yeast2")
    RemoveProbes(listOutProbes=NULL, pombe_filter_out, "yeast2cdf", "yeast2probe")

• Probeset level to gene level

It's worth mentioning that the expression after RMA is probeset expression. There are usually several probesets map to one gene in Affymetrix. If you prefer use gene level expression instead of probeset expression, you can use the following R code to convert the probeset level expression to gene level expression.

    CerevisiaeGeneNameLevels<-factor(CerevisiaeGeneName)
    #Function to average the expression of probesets which map to same gene
    probeset2genelevel<-function(onesample){ 
       return(tapply(onesample,CerevisiaeGeneNameLevels,mean)) 
      }
    #Do the average for each column/array
    CerevisiaeGeneData<-apply(CerevisiaeProbeData,2,probeset2genelevel)

Based on the November 2007 version of Yeast2GENENAME (in this list, 4640 among 5900 cerevisiae probesets have gene names) from Bioconductor, We found that for yeast2 Affymetrix chip, after filer out pombe probesets, there is mainly one probeset per gene, but there are 35 genes with two probesets, and one gene with three probesets. That is, 5863 genes for 5900 probesets.

• Quality assessment

Before doing conducting more complex statistical analysis, it is advisable to examine the unprocessed data. Bioconductor provides a number of methods for doing this.

    #raw image
    image(rawdata[,1])
    #density plot
    hist(rawdata, lty=1:30, lwd=2)
    legend(14, 0.60, legend=sampleNames(rawdata), lty=1:30, lwd=2)
    #MA plots
    MAplot(rawdata,which=1)
    #RNA degradation
    RNAdeg <- AffyRNAdeg(rawdata)
    plotAffyRNAdeg(RNAdeg)

• Normalising the data

Since the time-course data set is a larger, we will use Robust Multiple-array Average (RMA) to normalise the data. RMA normalises across the set of hybridizations, at the probe level. The data can be normalised via,

    eset.rma=rma(rawdata)
    CerevisiaeProbeData= exprs(eset.rma)

However, the row names of probeset expression rownames(CerevisiaeProbeData) have different order from the vector CerevisiaeProbeID. So

    CerevisiaeGeneName<-CerevisiaeGeneName[match(rownames(CerevisiaeProbeData),CerevisiaeProbeID)]
    CerevisiaeProbeID<-rownames(CerevisiaeProbeData)

to ensure that the correct probeset IDs and names are used.

Boxplot provides a graphical view of the media, quartiles, maximum and minimum of a data set. We can compare the raw data and normalized data. Typically, differences in spread and position are correctted by normalisation.

    library(affyPLM)
    par(mfrow=c(1,2))
    boxplot(rawdata, col="red", main="Cerevisiae Probe intensities")
    boxplot(eset.rma, col="blue", main="Cerevisiae RMA expression values")

• Model fitting for identifying differential Expression

The standard package for identifying differentially expressed genes is limma. The function lmFit() within the limma library fits a linear model for each gene given a series of arrays. The expression data should be log-ratios for two-color array platforms or log-expression values for one-channel platforms. The coefficients of the fitted models describe the differences between the RNA sources hybridised to the arrays.

    library(limma)
    #Design matrix
    levels = c("m0","m1", "m2", "m3", "m4","w0","w1", "w2", "w3", "w4")
    dimnames(eset.matrix)[[2]]= rep(levels, 3)
    TS <- factor(X, levels= levels)
    design <- model.matrix(~0+TS)
    colnames(design) <- levels(TS)
    #Model Fitting
    fit <- lmFit(CerevisiaeProbeData, design) 
    mc <- makeContrasts('m1-w1','m2-w2','m3-w3','m4-w4',levels=design)
    fit2 <- contrasts.fit(fit, mc)
    eb <- eBayes(fit2)

• Rank differential expression

The following R code plots time course expression for top 3 differentially expressed genes according to F-statistic.

modF <- eb$F #$F is to get F-statistic
which.M <- c(seq(1,5), seq(11,15), seq(21,25))
which.W <- c(seq(6,10), seq(16,20), seq(26,30))
par(mfrow=c(1,3),ask=T,cex=0.5)#cex:font size
for(i in 0:2){
    indx <- rank(modF) == nrow(CerevisiaeProbeData)-i
    row1 = CerevisiaeProbeData[indx, which.M]
    row2 = CerevisiaeProbeData[indx, which.W]
    id=CerevisiaeProbeID[indx]
    #genes with multiple probesets occur more than once in CerevisiaeGeneName
    name = CerevisiaeGeneName[indx]
    genetitle<-paste(sprintf("%.30s",id)," ",sprintf("%.30s",name)," Rank=", i+1) 
    time=c(0,1,2,3,4)
    plot(time,row1[1:5],
        ylim=range(min(row1,row2), max(row1,row2)), 
         ylab="Expression", main=genetitle,pch='M',type='b',col=2)
    
    lines(time, row1[6:10] , pch='M', type='b', col=2)
    lines(time, row1[11:15], pch='M', type='b', col=2)
    
   
    lines(time, row2[1:5], pch='W', type='b', col=1)
    lines(time, row2[6:10], pch='W', type='b', col=1)
    lines(time, row2[11:15], pch='W', type='b', col=1)
  }

• Up and down regulated list

The advantage of F-test is that it can rank the differential expression by considering combined contrasts. For the ranked probesets, there is command to tell the up or down regulated for each contrast (corresponding to each time point for our case).

#Check the number of probesets which are significantly differential expression
modFpvalue <- eb$F.p.value #$F.p.value is the p value for F-test
#The adjustment methods include the Bonferroni correction (``bonferroni") 
#in which the p-values are multiplied by the number of comparisons.
selectedgenesindx <- p.adjust(eb$F.p.value,method="bonferroni") < 0.05 
Sig<-modFpvalue[selectedgenesindx]
nsiggenes<-length(Sig) 
results1 <- decideTests(eb, method="global")

modF <- eb$F #$F is F-test
modFordered<-order(modF, decreasing = TRUE)

#Get the significantly differential probesets
CerevisiaeRankProbe<-CerevisiaeProbeID[modFordered[1:nsiggenes]]
CerevisiaeRankGeneName<-CerevisiaeGeneName[modFordered[1:nsiggenes]]

updown<-results1[modFordered[1:nsiggenes],]
write.table(cbind(CerevisiaeRankProbe,CerevisiaeRankGeneName,updown),file="updown.csv",sep=",",row.names=FALSE,col.names=FALSE)

• Heatmap

The top 100 differentially expressed genes according to F-statistic is plotted as heatmap.

#top 100 differentially expressed probesets
ngenes = 100
m = matrix(nrow=ngenes,ncol=30)
rnames = vector("list", length(1))
library(Hmisc)
for(i in 0:(ngenes-1)){
    indx <- rank(modF) == nrow(CerevisiaeProbeData) - i
    m[i+1,] = CerevisiaeProbeData[indx,]
    rnames[i+1]=CerevisiaeGeneName[indx]
    }

rownames(m) = rnames
row3 = m[1:ngenes, which.M] 
row4 = row3 
for(j in 0:2){ 
    for(i in 0:4){ 
        row4[,3*i+j+1] = row3[,i+5*j+1] 
    } 
} 
arraynames = c('r1t0','r2t0','r3t0','r1t1','r2t1','r3t1','r1t2','r2t2','r3t2','r1t3','r2t3','r3t3','r1t4','r2t4','r3t4')
pdf("heatmap.pdf", height = 12, width = 12,paper="A4") 
heatmap(row4,Colv=NA,labCol=arraynames)
dev.off()

• Network inference

GeneNet can infer dynamical association gene networks by calculating regularised estimator of dynamical pairwise correlation and then estimating partial dynamical correlation. The links between two genes in the inferred network represent direct interactions between two genes.

The first step is to create longitudinal R object. The expression data of top 100 differentially expressed genes is stored in a matrix m, where the rows are genes and the 30 columns of m are the arrays. For network inference network for time series data, need to transform m (after transform, rows of m represents 30 arrays) and rearrange the order of row according to time point (new matrix mnew). The first 6 rows of mnew are the 6 arrays of time point 0 (t=0), then the 6 arrays of time point 1 (t=60), the 6 arrays of time point 2 (t=120), the 6 arrays of time point 3 (t=180), , the 6 arrays of time point 4 (t=240).

#Build longitudinal data for top 100 Probesets
#Need to transform data matrix, rows to be arrays
m = t(m)
#Need to rearrange the rows so that the rows are ordered by time points
#Note that TS <- factor(levels, levels= c('m0','m1', 'm2', 'm3', 'm4',
#         'w0','w1', 'w2', 'w3', 'w4'))
#using the value of design matrix, the entry design[i,j] with value 1 means 
#array i is for time point j!!!
#Works for uneven repeats, like repeats=c(6,5,6,6,6) 
mnew = t(matrix(nrow=ngenes,ncol=30))
#Get the index of array ordered by time point
arrayindx<-numeric(0)
ntime=5
for(j in 1:ntime)
{arrayindx<-c(arrayindx,grep(1,design[,j]),grep(1,design[,j+ntime]))
}
mnew<-m[arrayindx,]

We then compute the partial correlations and assign (local) values to all possible edges to allow us to choose the top edges according to a threshold. Finally dot file is output for graphviz to generate a graph. The R code is as following:

    library("GeneNet")
    # step 1: create longitudinal object #
    mlong = as.longitudinal(mnew,repeats=c(6,6,6,6,6), 
                                 time=c(0,1,2,3,4))

    # step 2: compute partial correlations #
    pcor.dyn <- ggm.estimate.pcor(mlong, method = "dynamic")

    # step 3: assign (local) fdr values to all possible edges #
    m.edges <- network.test.edges(pcor.dyn,direct=TRUE,df.ggm=14, df.dir=29)
    dim(m.edges)

    # step 4: construct graph containing the 150 top edges #
    m.net <- extract.network(m.edges, method.ggm="number", cutoff.ggm=150)

    # step 5: plot graph using graphviz #
    node.labels <- colnames(m)
    network.make.dot(filename="net.dot", m.net, node.labels, main="Yeast Network")

• References

W.G. Alvord, J.A. Roayaei, O.A. Quinones and K.T. Schneider (2007) A microarray analysis for differential gene expression in the soybean genome using Bioconductor and R, Briefings in Bioinformatics.

Last modified:
13 Feb, 2008