R is Free Software and runs a variety of platforms (Unix, Windows, and MacOS). At the time of writing, the current version of R is 2.2.1. For more in-depth information about installation, see the R Installation and Administration Manual. Here we will cover installation on Linux.
Although you can install R on a Linux system the old-fashioned way, either from source code or from pre-compiled binaries, it is much easier with a package manager, since it will spare you from having to worry about dependencies. Here I will show you how to install R on a Red Hat Linux distribution (Fedora Core 3, to be specific) using yum (Yellow Dog Updater Modified).
For starters, you will need root access. Once you are root, you will have to let yum know where to find the necessary installation files. Change your working directory to the yum repository directory:
[root]# cd /etc/yum.repos.d/
The required files are available from one of CRAN's mirror sites. (Just in case you're wondering, CRAN stands for "Comprehensive R Archive Network".) Add a file named CRAN.repo to the directory /etc/yum/repos.d/. The file should have the following contents (substitute the URL for your preferred mirror site):
[CRAN] name=http://cran.cnr.Berkeley.edu baseurl=http://cran.cnr.Berkeley.edu/bin/linux/redhat/fc3/i386/ enabled=1 gpgcheck=0
(Note: When I recently tried re-installing the program, no public key was available for the main R installation file on the Berkeley mirror. To work around this problem, I instructed yum to ignore the public key by setting the gpgcheck flag to false (zero). You should first try running yum with this line removed. If the installation fails, you can then add it back and try again.)
Once this has been done, yum will know about CRAN and can do all the work of installing R for you. Just run the following command, and the rest should happen automatically (output of command omitted to save space).
[root]# yum install R
If you already have R installed, you can use yum to ensure that you have the latest version by running this command instead:
[root]# yum update R
R can be run in one of two modes: interactively or non-interactively (running R code from a saved file). For the rest of the tutorial, we will assume that you are running R in interactive mode. A brief discussion of how R can be run non-interactively is provided towards the end.
You can start an interactive R session by running the program from the commandline. When you do so, the program will automatically produce output that looks like this:
[stuart]$ R R : Copyright 2005, The R Foundation for Statistical Computing Version 2.2.1 (2005-12-20 r36812) ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
R is now running in interactive mode. This means that you can use it in the style of a calculator, where you type instructions which are automatically executed. However, in R, commands are not executed until you hit the return key. Below is a sample interactive session with R, where R is used to do some simple arithmetic. Try duplicating this session. Just remember not to type the right bracket (>) that every line starts with. It is R's interactive command prompt, and not R code. (The hash marks and everything following them are comments for the reader's use and can be omitted, since they will be ignored by R.)
> 2 + 6 # addition  8 > 6 / 2 # division  3 > 2 * 6 # multiplication  12 > 6 - 2 # subtraction  4 > 2 ^ 6 # 2 to the 6th power  64
The same operations can be carried out on lists of values. In the first line of the following R code, we create a list of values in sequential order and assign them to the variable v (short for vector). We then view the contents of v before showing a simpler way of creating a list of sequential values. Finally, we show that mathematical operations performed on a vector apply to all of the items within that vector.
> v = c(1, 2, 3, 4) # create a vector > v # view contents of vector  1 2 3 4 > v = 1:4 # easier way of creating numerical sequence > v  1 2 3 4 > v + 1 # add 1 to every item in vector  2 3 4 5 > v ^ 2 # square each item in vector  1 4 9 16
When you are ready to end your R session, you will need to type q() in order to quit. When you do so, R will ask you whether you want to save your workspace image. The workspace image is essentially your interactive session history. If you say yes, hidden files (.RData and .Rhistory) will be saved in the directory where you originally started R. The next time you run R from this directory, it will load these hidden files and retain the data and history from your last session. If you say no, nothing will be saved.
> q() Save workspace image? [y/n/c]: y [stuart]$
A critical feature for any statistical package is some means of obtaining data from outside sources: text files, spreadsheets, relational databases, etc. For more complete documentation of the data importation and exportation facilities of R, see the R data manual. Here we will show the basics of importing tabular data into R.
To see how this works, let's look at a spreadsheet with some sample data. We'll use the results of the 2004 European Union Parliament elections (from wikipedia.org), which is saved as an Open Office spreadsheet in EU-2004.sxc. When viewed within Open Office, it should look like this:
As you can see, this spreadsheet consists of ten rows, a header row and nine rows of data. For each row, there are four columns of data: the acronym of a political party, its full name, the number of votes it received in the 2004 election, and the number of seats won.
We will save this data from Open Office into a plain text tabular format for ease of importation into R. To make the parsing of the data as trouble-free as possible, we will use both column separators (commas) and column delimiters (double quotes) (EU-2004.csv).
Once the data has been exported, it can be read by R using the function read.table(), which minimally takes a single argument, the name of the data file to be read. In addition, we will explicitly specify the column separator with the option sep. If the data file contains header information (i.e., a row that labels each column of data), as ours does, the option header should also be set to TRUE in order to ensure that the first line isn't treated as data. If the data is not saved into a variable, R will print a limited number of rows for inspection, as illustrated below:
> read.table("data/EU-2004.csv", sep=",", header=TRUE) abbrev name 1 EPP-ED European People's Party/European Democrats 2 PES Party of European Socialists 3 ALDE Alliance of Liberals and Democrats for Europe 4 G-EFA Greens/European Free Alliance 5 GUE-NGL Confederal Group of the European United Left/Nordic Green Left 6 ID Independence and Democracy 7 UEN Union for a Europe of Nations 8 FRP Far-Right Parties 9 NAP Non-Affiliated Parties votes seats 1 50007368 268 2 38824140 199 3 16935211 88 4 10210061 42 5 9365076 42 6 8007854 38 7 5837438 27 8 5261614 13 9 10569182 15
Once it's been imported, tabular data can be easily manipulated within R. For example, if we wanted to know the total number of EU seats or the total number of votes, we simply pass one column's worth of data to a function that calculates the total. There are two ways of doing this: bracket notation and dollar sign notation.
Bracket notation allows you to specify particular rows and columns. Once a table has been read and assigned to a variable, specific rows and columns can be accessed by using bracket notation on a variable, as [row(s), column(s)], where row(s) and column(s) are either single integers or lists of integers. Both usages are illustrated below:
> d <- read.table("data/EU-2004.csv", header=TRUE) > d[,3] # every row of third column  50007368 38824140 16935211 10210061 9365076 8007854 5837438 5261614  10569182 > d[1:3,3] # first three rows of third column  50007368 38824140 16935211 > d[1,] # every column of first row abbrev name votes seats 1 EPP-ED European People's Party/European Democrats 50007368 268 > d[1,3:4] # third and fourth column of first row > d[1,3:4] votes seats 1 50007368 268
An alternative to bracket notation is dollar sign notation, which takes advantage of the headers in tabular data. Rather than specifying columns by position, we simply specify the name of the column as it appears in the header, as illustrated below:
> d$votes # use labels  50007368 38824140 16935211 10210061 9365076 8007854 5837438 5261614  10569182 > d$seats # use labels  268 199 88 42 42 38 27 13 15
Although bracket notation and dollar sign notation provide equivalent results, it is generally preferable to use dollar sign notation, since it relies on the labels within a data set, which are less likely to change than the relative position of the columns (which may get shifted around if new columns are added).
Now that we are able to extract a specific column, obtaining totals is fairly trivial, and requires nothing more than using the column values as input to the function sum(), as shown below:
> d <- read.table("data/EU-2004.csv", header=TRUE) > sum(d$votes) # total votes  155017944 > sum(d$seats) # total seats  732 > sum(d[,4]) # ditto  732
With these data manipulations basics, we are now in a position to import data into R for statistical analysis. In the following section, we will see how to create various types of graphs using various data sets saved in tabular format.
One of R's strength is its graphics capabilities, which allow you to create many different types of graphs. In this tutorial, we will cover a few better known types: pie charts, bar charts/histograms, scatterplots, and line graphs.
A pie chart is commonly used to display the relative proportions of different values in a data set (more technically, the frequency of the levels of a categorical variable). For example, if we wanted to show how many seats the various political parties won in the 2004 European Union Parliament elections, we could display the information in the form of a pie chart. The R code in the interactive session shown below provides a first-pass attempt at this:
> d <- read.table("data/EU-2004.csv", header=TRUE) # read data and assign to variable 'd' > seats <- d$seats # obtain 'seats' column and assign to variable > pie(seats) # pass variable to pie-drawing command
You can run this code within R yourself. The first line will import the data from the external data file discussed in the previous section. The second line accesses the column of data with the number of votes using dollar sign notation and saves it into a variable called seats. That variable is then passed to the function pie(), which does the work of creating the graph. Once you type the last line, R will open a new popup window containing a graph that should look like this:
This graph is not very useful as it stands. The most obvious shortcoming is that the various slices of the pie are unlabeled. Labels can be assigned using the option labels, which we set using the column of abbreviations from the data set, as shown in the third line. In addition, a title is added using the option main (where \n stands for a line break). Note that when you hit return on an unfinished line, it continues on the next line starting with a different command prompt, the plus sign (+) rather than the normal right bracket (>).
> d <- read.table("data/EU-2004.csv", header=TRUE) # read data > pie(d$seats, # create pie chart with seats column + labels=d$abbrev, # use abbrev column for slice labels + main="2004 EU Parliament Results\n(Seats by Party)") # create pie chart
The results of running the R code above is a graph that significantly improves upon the previous one:
Although they are frequently used in the popular press, pie charts are unpopular among statisticians. If you consult the R help page for pie() by typing ?pie, you will see the following warning in the "Notes" section: "Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data."
A bar graph or bar chart has rectangular bars of lengths that represent the quantity or frequency of data values. (A histogram is a particular type of bar graph, one that displays only the frequency of data values, rather than their quantity.) The bars can be horizontally or vertically oriented.
Bar graphs are produced in much the same way as pie charts, although the command used to create them, barplot(), differs slightly from pie. The main difference is that instead of using labels to set the labels, we use names.arg instead. (Also note that we have explicitly converted the abbreviations into character data, since they would otherwise be treated as factors for analysis and therefore displayed as numbers.) Below, we make a bar graph of the same data that we used for our pie chart in the last section:
> d <- read.table("data/EU-2004.csv", header=TRUE) # read data > barplot(d$seats, # create bar plot on seats won + names.arg=as.character(d$abbrev), # set labels + main="2004 EU Parliament Results\n(Seats by Party)") # add title
This code will produce the following bar graph:
The main problem with this bar graph is that some of the bar labels are too wide and are therefore omitted by R. There are a number of different ways to solve this problem. The easiest solution is to shrink the bar labels, which can be done by explicitly instructing R to use a smaller font for them. This is done with the option cex.names, which determines the font size for the bar labels. We will shrink them to 70% (0.7) of the default size.
> d <- read.table("data/EU-2004.csv", header=TRUE) # read data > barplot(d$seats, # create bar plot on seats won + names.arg=as.character(d$abbrev), # set labels + cex.names=0.7, # labels are 70% of normal size + col="lightgray", # use light gray bars + main="2004 EU Parliament Results\n(Seats by Party)") # add explanation below x-axis
This will produces a bar graph in which the bars are light gray and individually labeled:
Note that it is much easier to observe small difference with a bar graph than with a pie chart. For example, it is quite easy to see that the GUE-NGL (Confederal Group of the European United Left/Nordic Green Left) obtained more seats than the ID (Independence and Democracy) in the bar graph, whereas in the pie chart it is difficult, if not impossible, to discern this fact.
A scatterplot is a graph used in statistics to visually display and compare two or more sets of related quantitative/numerical data by displaying a finite number of data points (observations) in a space defined by two scales, which are placed on the horizontal and vertical axes, the well-known x-axis and y-axis, respectively.
To use a linguistic example, let's look at word frequency distributions. More specifically, let's look at the relationship between the total number of words in a text (the token count) and the number of unique words in a text (the type count). For example, consider a sentence such as "Boys will be boys". It has four words, but only three of them are unique, since boys occurs twice. In other words, it consists of four tokens, but only three types.
Using a simple Python script (type-token.py), I have calculated the type and token count for every text in a large collection of folk tales published in the Wantok newspaper of Papua New Guinea. The results can be found in corpus-counts.csv. These folk tales were originally published in Tok Pisin (an English-based creole) and later translated into English by Thomas Slone in 1001 Papua New Guinean Nights. Below, we produce a scatterplot of these values with the token count on the x-axis and the type count on the y-axis.
> d <- read.table("data/corpus-counts.csv", header=TRUE, sep="\t") > scatterplot(d$eng.token, d$eng.type, # plot English + xlab="Tokens", ylab="Types", main="English") # label axes and add title > scatterplot(d$tkp.token, d$tkp.type, # plot Tok Pisin + xlab="Tokens", ylab="Types", main="Tok Pisin") # label axes and add title
If we place the two side-by-side, we can eyeball the two graphs and see that, among other things, the overall number of types appears to be lower in Tok Pisin. In other words, the two graphs provide evidence that Tok Pisin has a more restricted vocabulary than English.
The two scatterplots are difficult to directly compare because R automatically adjusts the range set to the data set being plotted, and therefore uses different ranges for the x- and y-axis. For Tok Pisin, the range is roughly 0 to 300 for the x-axis and 0 to 3000 for the y-axis, whereas for English it is roughly 0 to 500 for the x-axis and 0 to 2500 for the y-axis. We can put both graphs on the same scale by explicitly setting the ranges with the options xlim and ylim. (Graphs omitted to save space.)
> data = read.table("data/corpus-counts.csv", header=TRUE, sep="\t") # read data > scatterplot(data$eng.token, data$eng.type, main="English", # plot English + xlab="Tokens (Words)", ylab="Types (Unique Words)", # label axes + xlim=c(0,3000), ylim=c(1,500)) # set range of axes > scatterplot(data$tkp.token, data$tkp.type, main="Tok Pisin", # plot Tok Pisin + xlab="Tokens (Words)", ylab="Types (Unique Words)", # label axes + xlim=c(0,3000), ylim=c(1,500)) # set range of axes
A better way of visualizing the same information would be to place both scatterplots on a single graph. Below, we instruct R to plot the two data sets on one graph and to distinguish the two using color: Tok Pisin in red and English in blue. The color for a plot is set using the col option. Note that the first plot is done using the high-level function plot(), whereas the second is done using the low-level function points(). Although these two functions have the same syntax, they have different uses. plot() will create a new graph; points() simply adds data points to a pre-existing graph.
> d <- read.table("data/corpus-counts.csv", header=TRUE, sep="\t") > plot(d$eng.token, d$eng.type, col="blue", # plot English in blue + xlab="Tokens", ylab="Types", # label the x and y axis + xlim=c(0,3100), ylim=c(0,550), # explicitly set range of x and y axis + main="English vs. Tok Pisin") # add title > points(d$tkp.token, d$tkp.type, col="red") # plot Tok Pisin in red
Because we have used color to distinguish the plots for the two data sets, we need a legend that explains the color scheme. This can be done with the legend() command. The location of the legend is specified by its x,y coordinates. The elements in the legend are specified as points by setting the plotting character option pch to the default plotting character, which is 1. The colors of the points are determined by the option col, which takes a list of two colors. The text for the red and blue points is provided by the option legend.
> legend(150, 500, # add legend at coord 150,500 + pch=1, col=c("blue", "red"), # legend has blue and red point + legend=c("English", "Tok Pisin")) # label blue and red points
The resulting graph, shown below, is much easier to read, and the trends in the two scatterplots are much more visually striking.
A line graph is like a scatterplot, except that its data points are connected by a line. Since line graphs are commonly used to visualize activity in the stock market, we will show how to produce a line graph of the Dow Jones Industrial Average (DJIA) during a twenty-year period (1985 to 2005). The data is saved as a tab-delimited text file, dow-jones.csv, and comes from djindexes.com.
In the following sample R code, we read the data and plot it with the year on the x-axis and the Dow Jones on the y-axis. Note that the command plot() is run with the same syntax, the only difference being the option type, which is explicitly set to l to obtain a line graph. A title is added with main, and the x-axis is labeled with xlab and the y-axis with ylab.
> d <- read.table("data/dow-jones.csv", sep="\t", header=TRUE) # read in tab-delimited data > plot(d$year, d$start, type="l", # plot points and connect with line + main="Dow Jones Industrial Average", xlab="Year", ylab="DJIA") # add labels
The resulting graph should look like this:
It shows a steady increase in the DJIA until 2000, when there is an abrupt drop followed by a gradual recovery. This is, of course, the infamous bursting of the dotcom bubble and the eventual recovery from it.
In the examples provided in this tutorial, we have used the R environment to display graphs in its popup window. But sooner or later, you will want to save a graph into an external file for incorporation into a document (e.g., such as this one). By default, R writes all graphs to its popup window, but this is not the only available device (the technical term for the destination of any graphics drawn by R). R supports a variety of image formats. To see the available options, type ?device within an R session. The options include postscript, PDF, PNG, and JPEG (among others). It is even possible to have R generate the commands required to draw a graph in LaTeX.
Below, we will show how to export a graph into a JPEG image file. The main trick is to call the function jpeg(). It has only one required argument, which is the filepath to the JPEG file. Once the function jpeg() has been called, all subsequent graphing will be done inside of the specified JPEG file. Therefore, the following code will create a line graph in a file named test.jpg. (The file will be saved in the directory where R was originally run.)
> data <- read.table("data/dow-jones.csv", header=TRUE) # read data > jpeg("test.jpg") # graph to JPEG file > plot(data$year, data$start, type="line", # create graph + xlab="Year", ylab="DJIA", main="Dow Jones Industrial Average") # label axes and add title > dev.off() # close file > q()
Before quitting R, it is important to close the file with the command dev.off() to ensure that R does not continue to write to it in future sessions.
As you can see, the basics of creating graphs in external image files are fairly straightforward. For more in-depth information, see the Graphics section of the official R introduction.
So far, we have run R interactively, but it is also possible to run it non-interactively by storing R code in a separate file and redirecting it to R. For example, we can save the code from the Dow Jones line graph into a file (dow-jones.r):
# ----------------------------------------------------------------------- # Author: Stuart Robinson # Date: 18 Jan 2006 # Description: This code will looks for file data/dow-jones.csv # and use the data in it to create a line graph # of the Dow Jones Industrial Average, saved as a JPEG # file /tmp/test.jpg. # ----------------------------------------------------------------------- d <- read.table("data/dow-jones.csv", header=TRUE) # read data jpeg("/tmp/test.jpg") # file for graph plot(d$year, d$start, # year as x, DJIA as y type="l", # plot lines, not points xlab="Year", # label x axis ylab="DJIA", # label y axis main="Dow Jones Industrial Average") # add title dev.off() # close file q() # quit
There are two things to note about the formatting. First, there are no right brackets, since these are only used as command prompts by R in interactive mode. Second, whitespace can be used to give commands a more readable formatting.
The script can now be run from a Unix commandline as follows:
[stuart]$ R --no-save < dow-jones.r
There are two things to observe about how the R code is run above. First, the contents of dow-jones.r are sent to R using redirection (the left angle bracket). Second, as you will recall, when you quit R in interactive mode, you were asked whether to save the session. When a script is run interactively, this question must be answered in advance with a commandline option, either --save or --no-save; otherwise, an error results. (The option --vanilla conveniently wraps a number of options into one; for more information, see Invoking R.)
[stuart]$ R < dow-jones.r Fatal error: you must specify '--save', '--no-save' or '--vanilla'
When learning to use R, you will need raw data sets on which you can test various features of the language. Fortunately, R comes with an assortment of built-in data sets. These are an eclectic bunch, ranging in nature from Nile, which provides data about water flow in the river Nile from 1871 to 1970, to cars, which provides the speed of cars and braking distance from the 1920s. It also includes such gems as Titanic, which provides data concerning who did and did not survive the wreck of that ill-fated cruise ship.
The see the full list of these data sets, simply type data(). To obtain more information about a particular data set, you can type the name of the data set preceded by a question mark (e.g., ?AirPassengers). Most data sets also come with example R code, which you can study to improve your understanding of the language.
R provides the means to perform a wide variety of statistical techniques. Because it is the tool of choice for many professional statisticians, it has been thoroughly tested, and its functionality is very comprehensive. Your knowledge of statistics will most likely be the main limitation on what you can do with it.
There is a good deal of documentation for R, both online and in print. For another introductory tutorial on R, try Analyzing Statistics with GNU R from the O'Reilly Press website ONLamp.com. The R homepage provides a more in-depth introduction to R as well as a listing of all official R documentation. There is also a good deal of documentation in print. A partial listing of books on R can be found on the R homepage, and the website for John Verzani's textbook Using R for Introductory Statistics provides another listing of R-related materials (including sample data sets).