Gene Expression Analysis

by Katja Nowick, University of Leipzig

Affymetrix microarrays

Loading data into R

All Affymetrix .CEL files you want to analyze together should be in the same directory. The .CEL files contain the locations and intensities of the probes. Start R. Set the working directory to the directory where the Affymetrix .CEL files are located:


Install the library affy for preprocessing and analysis of Affymetrix expression microarray data:


Some libraries, such as affy, depend on other libraries. If this is the case, R will check if you have them installed already. If not, R will automatically also install these libraries for you. Load the library affy:


To read in all CEL files you use the function ReadAffy(). The information of all microarrays will be stored in an AffyBatch object which we will call here arrays.:


An object is in principle just a complicated variable. But it has special object-specific features and you can do complicated functions with it. We can look at the object arrays by just calling it:


You can see that arrays is an AffyBatch object. You can also see the size of the array, that it is a HG-U133_Plus_2 microarray with 54675 affy IDs. Each affy ID stands for one probe set with which you measure the expression of genes. In total, we have 8 samples in our dataset. Since this is the first time that we are working with HG-U133_Plus_2 microarrays, R installed the annotation for this type of microarray. Next time you ask for the information about the arrays object R does not need to do it again:


To get the sample names of each array in your AffyBatch object use the function sampleNames() on your object arrays:


These are just the file names so far. We know that the first 5 samples are from chimpanzee brains and the next 5 samples are from human brains. So we can assign meaningful sample names to each array:

sampleNames(arrays)=c("Chimp1", "Chimp2", "Chimp3", "Chimp4", "Chimp5", "Human1", "Human2", "Human3", "Human4", "Human5")

Quality control

There are many methods for Quality control; check the affy vignette for a complete overview. Here we will explore three QC methods.

1. Look at the images of the scanned arrays. Are there any scratches or lighter areas? Using the function image() you can open the image of each array in the AffyBatch object (you have to click Enter to move to next image):


If you want to open any particular array of your dataset, use indexing:


In each array you can see the signature of Affymetrix control probes (e.g. the edges or the letters in the left upper corner). None of the arrays have scratches or any other major defects.

2. Look at distribution of expression intensity values of each array. Are they all the same? Do you see any batch effects?:


They are all similar with no obvious batch effects. We will take care of the small differences in intensities between arrays later by normalization.

3. RNA degradation. The function arraysDEG() calculates for each array of the AffyBatch object the expression intensity values of each probe and sorts them from 5' to 3'. For mRNA of good quality you expect the intensity to be about the same across the whole transcript. RNA degradation always starts from the 5' end. Thus, when RNA is degraded, the expression intensity is lower for probes matching the 5' end of the transcript. Looking at the plot by plotAffyRMAdeg() and the slope of intensities for each array by summaryAffyRNAdeg() let's you assess how good the quality of your samples is and if all your samples are of equal quality. Slopes less than 5 are generally considered to represent good quality samples.:


The slopes of all samples are about 2 and approximately similar. Thus we have good quality data.

Calculating expression intensities

The library affy implements multiple methods for the calculation of expression values. Here we are going to explore the methods RMA and MAS5. The function rma() performes background correction, normalization, and calculation of the expression values based on the RMA method. Expression values are calculated based on summarization of the signal intensities of the probes belonging to a probeset. We will store the result of this function in the object arraysRMA:


We can look at the content of arraysRMA:


The arraysRMA object contains rma values of all microarrays in our study, knows the names of our samples, what kind of microarray we were using etc.

The function mas5() does pre-processing of the microarray data based on Mas5 algorithm. It considers the signal intensities from perfect and mismatch probes. This function takes a few minutes:


To get a table with the expression values for each array, we call the function exprs() on our objects arraysRMA and arraysMAS5:


Let's have a look at our tables. How big are they and what is their content? You can use functions we have learned in the first lecture, e.g.:


Note: RMA gives you log2 transformed values while MAS5 gives you untransformed values. Even if you were log2 transforming MAS5 values, you would still not get the same expression values as with the RMA function, because the two methods differ in their algorithms and statistics.

It makes sense to save a table with the calculated expression values to be able to use it in other analyses. Of course you could use the function write.table() that we have learned two in the first lecture. The affy library also offers a more convenient function for saving the tables:

write.exprs(arraysRMA, "RMAvalues.txt")
write.exprs(arraysMAS5, "MAS5values.txt")

Open the two files to see what they look like. You can either use a text editor or Excel. Don't close your R session! Use a different terminal for this.

It usually makes sense, not to consider genes that are not detected (not sufficiently expressed above background) for future analysis. With the function mas5calls() on the object arrays you can determine which genes are significantly detected above background. This function will calculate presence/marginal/absence calls. We will store the result of this function in the object



Let's also save a table with the presence/marginal/absence calls for each gene:

write.exprs(arraysPMA, file="PMAvalues.txt")

Exercise I

1. Produce a boxplot of the microarrays after normalization. How do they differ from the boxplot we produced for the un-normalized data? How do the boxplots for RMA and MAS5 processed data differ?

2. With the function mas5calls() we obtained presence/marginal/absence calls. See the help pages for this function and find out how you can obtain the p-values for calling a probeset detected. Create a table with detection p-values for each probeset and sample and call it arraysDETP. Hint: you need to use another function for doing this. From the help page on mas5calls() you will find out what this other function is called. Then save a table with the detection p-values.

3. Make a table containing the expression values of only the probesets detected in at least 1 sample. Call this table "arraysRMA0.05". Hint: Start with the arraysPMA or arraysDETP table. Create a vector with counts of number of samples each probe is detected in. You can count using the function length(). Add this vector as column to the arraysRMAtable. Make a sub-table of the arraysRMAtable with RMA values of probesets detected in at least one sample. How many probesets were detected in at least one sample? How many probesets were detected in all samples?

Identification of differentially expressed genes

The library multtest allows to perform various statistical tests to identify genes that are differentially expressed between two or more groups of samples. Install and load the package multtest:


Here we will use the library multtest to perform a tests for differentially expression between the two species for each probeset in our table with expression values. We will work with the table that contains only probesets expressed in at least 1 sample. Let's have a look at it again:


We first need to define to which group each sample belongs to. The first 4 samples are from humans, the next 4 from chimpanzees. We will assign the group labels 0 and 1 to them, respectively:


To run the t-test, we need to delete the column with the detection p-values from our arraysRMA0.05 table. We save this as a new table:


Now we can run the t-test using the function mt.teststat() implemented in multtest. This function will perform a t-test for each probeset. We will save the result of this function in the variable stats. This variable will be a vector with results of t-test statistics for each probeset.

stats=mt.teststat(tmp_arraysRMA0.05, groups, test="t")

We can convert the t-values into p-values using the following function:


These are raw (unadjusted) p-values. Since we performed the t-test on so many probesets, we should adjust the raw p-values for multiple testing. Here we do this using the method by Benjamini and Hochberg as an example:

adjp=p.adjust(rawp, method="BH")

Now adjp is a vector with the adjusted p-values of the t-test. We can add this vector as a column to table with the expression values:

arraysRMAstats=cbind(arraysRMA0.05, adjp)

We can save a table with all expression and p-values:

write.table(arraysRMAstats, "stats.txt", col.names=TRUE, row.names=TRUE, sep="\t", quote=TRUE)

Exercise II

1. How many probesets are significant with a) p<0.05, b) p<0.01?

2. Although t-test is commonly used for the analysis of differential expression, it would be more corect to use a non-parametric test, since the observations (gene expression levels of different genes) are not independent from each other. Use the mt.teststat() function and change the arguments so that you run a non-paramtric test on the arraysRMA0.05 table.

3. Calculate the fold change for the significant probesets. Hint: Create a table that contains expression values of only the significant probesets. Call it "arraysRMAstats0.05". Create a vector that will contain the fold changes for each probeset. Loop through the arraysRMAstats0.05 table to calculate the fold change for each probeset. Fold change = 2^(abs(mean Humans - mean Chimps)). Add the vector with the fold changes to the table with expression values.

4. Which gene names belong to the significantly changed probesets? Hint: Use the file HG-U133_Plus_2.annot.csv, which was downloaded from the Affymetrix website. This is the probeset annotation table that Affymetrix provides. Read in the annotation table. Have a look at the table to see in which column the probeset IDs and which column the gene names are. Since both tables have probeset IDs you can use the annotation table to get the gene names for your table with significant probesets (arraysRMAstats0.05) via the probeset IDs. You can loop through your table with significant probesets to pick the respective gene names. Save a table containing the following columns probeset ID, gene name, adjusted p-value, and Fold change. Example:

Probeset_ID Gene_Name adjusted_p FoldChange
1487_at ESRRA 0.00966883021716861 1.02961618826135
1494_f_at CYP2A6 0.00255709198503476 1.00938856217344
1552257_a_at TTLL12 0.00918289865845312 1.12925100545465
1552275_s_at PXK 0.00377940863516511 1.13029726897967
1552279_a_at HCP1 0.000216072645015747 1.00341554938146


We are going to work with a dataset consisting of 3 human and 3 chimpanzee brain transcriptome data, sequenced by Illumina RNA-Seq, paired-end 76 bp (taken from Somel et al. 2012). The reads were mapped to the human and chimpanzee genome, respectively. For each transcript the number of reads were counted. The counts are in the files "....readCount". We will use this dataset to try out two different R libraries for identifying genes differentially expressed between the two species and compare the results. Note, that in this tutorial we will use the terms "genes" and "differentially expressed genes". You can use the same library also for counts on transcripts, exons, ESTs, tags etc. and hence identify differentially expressed transcripts, exons, ESTs, tags etc. The data given to you for this tutorial are counts per transcripts.

We first read in our data with the read counts using the function read.table(). The human and chimpanzee data are stored in two separate files:

Hcounts=read.table("segemehl.hg19.readCount", head=T, sep="\t", quote="", stringsAsFactor=FALSE, row.names="id")

Ccounts=read.table("segemehl.panTro3.readCount", head=T, sep="\t", quote="", stringsAsFactor=FALSE, row.names="id")

The tables look similar to the rma tables from before, just that now we have discrete values (counts). The first column contains the RefSeq transcript IDs. For the cimpanzee data we already provide the human RefSeq IDs for the orthologous genes. We need to combine the data into one table:

counts=cbind(Hcounts, Ccounts)

The first 3 samples come from humans, the next three from chimpanzees. Let's rename the columns:

colnames(counts)=c("H1", "H2", "H3", "C1", "C2", "C3")

We also need to create a vector that can be used to assign our data to the respective species:


Now we have prepared our input data and can use the libraries.


DESeq is based on a model using the negative binomial distribution. Start by loading the library:


Using the function newCountDataSet() we can create an object that can be used by DeSeq to perform the analysis. It will hold the count information for each gene for each sample:

counts_DESeq_obj=newCountDataSet(counts, groups)

Have a look at the object:


You got 38439 features - these stand for the number of genes assessed and 6 samples. You also got the sample names stored in the object. For the analysis we need to estimate the effective library size to normalize for. Imagine, a gene has the same number of counts in two samples. But the library size (total number of reads) was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample. You can estimate the size factors from the count data using the function estimateSizeFactors():


Have a look at the size factors:


The human samples all have smaller effective library sizes.

Next we need to estimate for each condition a function that allows to predict the dispersion. The core assumption of this method it that the mean is a good predictor of the variance, i.e., that genes with a similar expression level also have similar variance across replicates:


Now we can test for differential expression of the genes between the two groups (species) by calling the function nbinomTest(). We provide the counts_DESeq_obj and the names of the groups of our samples to this function. This might take a few minutes:

DESeq_DEgenes=nbinomTest(counts_DESeq_obj, "0", "1")

Have a look at the variable DESeq_DEgenes you produced:


It is a data frame with p values (raw and adjusted), mean values, fold changes, and other results. The Mean is the mean of all 6 samples. Then you got the means of the human and of the chimpanzee samples. The fold change is calculated dividing the mean of chimps by the mean of humans, the p-value comes from the binomial test. Also adjusted p-values are provided and were calculated using the method of Benjamini and Hochberg, which controls for the False Discovery Rate (FDR).

You can visualize your result by plotting the log2 fold changes against the base means and coloring in red those genes that are significant cant at 5% FDR:

plot(DESeq_DEgenes$baseMean,DESeq_DEgenes$log2FoldChange,log="x", pch=20, cex=.1,col = ifelse( DESeq_DEgenes$padj < 0.05, "red", "black" ) )

You can see that you got quite a lot of significantly differentially expressed genes. Genes with higher mean expression seems to lead to higher FC.

To extract the significantly differentially expressed genes we can use one of the next two statements. They do the same thing, just in the first case we use the column name, in the second case the column number for indexing.:


Although you should not attempt to compare between species or conditions without having replicates, the DESeq manual provides some hints on what you can do to attenuate if you have no replicates.


edgeR uses empirical Bayes estimation and exact tests based on the negative binomial distribution to call differentially expressed genes.:


For the edgeR library we use the function DGEList() to create an object with the sample and read count information:

count_edgeR_obj=DGEList(counts=counts, group=groups)

Here, when you create this object of class DGEList, R already calculates the library size for you.:


You see how samples got assigned to groups (species), what the library sizes are, the counts for each gene etc.

edgeR uses the quantile-adjusted conditional maximum likelihood (qCML) method to estimate the dispersion(s) before comparing two groups of samples. It first determines the common dispersion and then the dispersion for each gene. For the gene-wise dispersion it implements an empirical Bayes strategy for squeezing the gene-wise dispersions towards the common dispersion. This may take a few seconds:


The edgeR test for differential expression is similar to a Fisher's exact test and is based on the qCML method. This may take a few minutes:


To see the top differentially expressed genes use the functiontopTags():


By default, the p-values in this table are adjusted for multiple testing by the Benjamini and Hochberg method. We can also sort the genes by fold change instead of by p-value:

topTags(edgeR_DEgenes, = "logFC")

The edgeR_DEgenes object contains multiple elements. The first one is the table with logFC, logCPM, and p-values for each gene. To get access to this table use edgeR_DEgenes$table:


Then a table with the significant genes can be extracted:


The edgeR manual provides also some hints on what to do if you have no replicates.

Exercises III

1. Compare the lists of genes that are called differentially expressed between the two species by the two methods (DESeq, edgeR). How many genes are DE with each method? How many are DE by both methods? What are the Gene Symbols of the DE genes with both methods? How many genes are differentially expressed with affy, DESEq, and edgeR?

2. Calculate DE genes from the RNA-Seq data mapped by Bowtie (it's the same data as we used before, just mapped with Bowtie instead of with Segemehl). How do the lists of DE genes differ depending on how the reads were mapped? You can chose either DESeq or edgeR for this analysis.

3. There are RNA-Seq counts from blood from 4 individuals from the 1000 Genomes project, 3 Europeans (CEU, FIN, and TSI) and one African. Find out which genes are differentially expressed in the African compared to the Europeans.

4. The files Sample1.CEL - Sample8.CEL contain Affymetrix microarray data from blood cells of 4 humans (1-4) and four chimpanzees (5-8). Which genes are DE between the two species in blood? How many genes are DE in blood and brain?