Monthly Archives: August 2014

Creating classes in R

Creating a new class

Although the casual user might not realize it, R is actually a fully object oriented language, as every variable used in an R program is an object, or instance of a class. Classes in R are of two main types: S3 and S4. S3 classes (so named because they were defined for version 3 of the S language, the precursor to R) are older and, although many built-in R classes are of the S3 type, it’s considered good practice to create any new classes according to the more recent S4 standard, so that’s what we’ll look at in this post.

If you’re familiar with class definition techniques in languages such as Java, C++ or C#, R’s methods for defining classes will seem a bit bizarre. At the minimum, an R class must have a name and optionally one or more data fields, known as slots, each of which must have an existing data type. A class is created using setClass():

setClass("numbers", representation(a = "numeric", b = "numeric"))
num1 = new("numbers", a = 12, b = 42)
num1@a 
[1] 12

We’ve created a class called numbers which contains 2 numeric fields: a and b. The representation() property of setClass() is given a list of slot names and their associated data types.

An object can be created from a class using the new() function (this is about the only feature of R classes that would be familiar to a ‘regular’ object-oriented programmer!), which takes as its first argument the name of the class, followed by initial values for its slots. Once the num1 object has been created, its slots can be referred to by using the object’s name followed by @ followed by the slot name, as shown.

Adding methods to a class

In most OO languages, methods can be added to a class by writing them inside the class definition. Such methods belong to that class and need have no connection with any code outside the class (indeed, proper object oriented design often precludes outside connections). In R, things are quite different. A method can be added to a class using the setMethod() function, but the procedure for doing so is a bit tricky. As an example, suppose we want to add a method to numbers which prints out the slot a for a given object. In order to do this, we must override an existing function so that it operates on a numbers object; we can’t just invent a new method from scratch.

For example, there is a print() function built in to R, so we could call our new method print and customize it so that it prints out the a slot of a numbers object. Here’s how it’s done:

setMethod("print", "numbers", function(x) { 
  cat(paste("a =", x@a))})
print(num1)
a = 12

The first argument to setMethod() is the method’s name, which must match that of an existing function. The second argument is the class to which the method is to be added. The third argument is a definition of the method which overrides the existing definition, and which will be called whenever print() is invoked on a numbers object. In this case, the function uses the cat() function to print out "a =" followed by the value of a. The function is invoked as shown.

One important point must be emphasized here. The argument name (x) in the function definition must match that in the definition of the function that is being overridden. If you’re overriding a built-in R function, you’ll need to check the documentation to see what name is used for the argument(s) of the function you’re overriding. The documentation for print() gives the first argument name as x, so we have to use that name in our own definition. In fact, the documentation says explicitly: “x: an object used to select a method”.

What if we want to add a method with a name of our own choosing? In that case, we need to define a function with that name outside the class first and then override it as a method within the class. For example, if we wanted a method a.b that prints out both a and b we could write:

a.b = function(obj) {}
setMethod("a.b", "numbers", function(obj) { 
  cat(paste("a =", obj@a, " b =", obj@b))})
a.b(num1)
a = 12  b = 42

We first define a.b as an empty function that takes a single argument called obj. We can then use setMethod() to override this function so that it works for a numbers object. Again, we must use the same argument name (obj) in the method definition as was used in the original function definition. Calling a.b() on a numbers object gives the expected result. If we call a.b on any other data type, the original (empty) definition of a.b is called which returns nothing, so the result is NULL.

Prototypes and default values

In our definition of the numbers class, the slots a and b were defined as numeric data types, but no default values were given. If we create a new object without giving values for these slots, we get an object with empty numeric vectors:

> num2 = new("numbers")
> num2
An object of class "numbers"
Slot "a":
numeric(0)
Slot "b":
numeric(0)

If we want the option of not specifying one or more of the arguments, we can provide a prototype parameter to setClass():

setClass("numbersDef", 
         representation(a = "numeric", b = "numeric"),
         prototype(a = 100, b = 666))
> num2 = new("numbersDef")
> num2
An object of class "numbersDef"
Slot "a":
[1] 100
Slot "b":
[1] 666
> num3 = new("numbersDef", b = 222)
> num3
An object of class "numbersDef"
Slot "a":
[1] 100
Slot "b":
[1] 222

We can now create a numbersDef object by specifying none, one or both slots, with the prototype default values filling in any missing slots.

Reading an R data frame from a file; Customized coercion for date-times

Reading a data file into a data frame

For any realistic use of data frames, we’ll be dealing with large sets of data, usually stored in an external file. R has a number of methods for reading data from various file types, but we’ll look at one of the simplest here, which is reading from .csv (comma-separated values) files. CSV files are produced by many applications, including popular spreadsheets such as Excel and LibreOffice. Data in a CSV file are given in rows with each row consisting of a fixed number of columns separated by commas. For illustration, I’ll use a data file containing weather readings for April 2014 taken from my weather station. There are 25 columns in this file, giving data on things like temperature, rainfall, wind speed and direction and so on. We’ll load this file into R and then do a few manipulations of the data.

> april2014 = read.csv("april2014.csv", stringsAsFactors = F)
> str(april2014)
'data.frame':	2880 obs. of  25 variables:
 $ dateTime       : chr  "2014-04-01 00:00:00" "2014-04-01 00:15:00" "2014-04-01 00:30:00" "2014-04-01 00:45:00" ...
 $ archiveInterval: int  15 15 15 15 15 15 15 15 15 15 ...
 $ iconFlags      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ moreFlags      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ packedTime     : int  15 30 45 60 75 90 105 120 135 150 ...
 $ outsideTemp    : num  6.72 6.67 6.67 6.61 6.56 ...
 $ hiOutsideTemp  : num  6.78 6.72 6.67 6.67 6.61 ...
 $ lowOutsideTemp : num  6.72 6.67 6.61 6.61 6.56 ...
 $ insideTemp     : num  21.1 20.9 20.8 20.8 20.5 ...
 $ barometer      : num  1014 1014 1014 1014 1014 ...
 $ outsideHum     : int  94 94 94 94 94 94 94 94 94 94 ...
 $ insideHum      : int  53 53 53 53 53 53 53 53 53 53 ...
 $ rain           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hiRainRate     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ windSpeed      : num  6.44 6.44 6.44 8.05 8.05 ...
 $ hiWindSpeed    : num  17.7 12.9 16.1 17.7 19.3 ...
 $ windDirection  : int  3 3 3 3 3 3 3 3 3 2 ...
 $ hiWindDirection: int  4 4 3 5 3 1 4 4 3 2 ...
 $ numWindSamples : int  342 341 341 343 343 343 342 342 342 342 ...
 $ solarRad       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hiSolarRad     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ UV             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hiUV           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DayTime        : num  1 1.01 1.02 1.03 1.04 ...
 $ Year           : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...

We use the read.csv() function to read the file into a data frame. All the data are numeric with the exception of the dateTime column which contains the date and time as a character string, so we want to prevent R from interpreting dateTime as a factor. We can see the structure of the resulting data frame. The weather station records data every 15 minutes, so dateTime starts at midnight on April 1 and advances in 15 minute intervals.

Converting a date-time string to a date-time object

It’s useful to convert the character strings giving the date and time to a proper date-time object. Unfortunately, the functions for doing this have non-intuitive names. There is a function called as.Date() but it returns only the date part, ignoring the time. If we want a proper date-time variable, we can use as.POSIXct() (I told you it was non-intuitive!). The acronym POSIX stands for Portable Operating System Interface (I don’t know what the X is for) and is a collection of IEEE standards. The ‘ct’ stands (I think) for ‘calendar time’. We can convert the dateTime column to POSIXct as follows:

> april2014[,"dateTime"] = as.POSIXct(april2014[,"dateTime"])
> str(april2014$dateTime)
 POSIXct[1:2880], format: "2014-04-01 00:00:00" "2014-04-01 00:15:00" ...

As our date-time data are already in a standard format, we don’t need to specify the format for as.POSIXct(). If the date-time is in some other format, we can specify it explicitly, as in

> april2014[,"dateTime"] = as.POSIXct(april2014[,"dateTime"], format="%Y-%m-%d %H:%M:%S")

Other date formats are possible; the R help entry for strptime gives the details. [To get help for this command, type ?strptime at the R console prompt in RStudio. The help will appear in the lower right panel.]

Reading data by specifying column classes

There is another way of reading the data that avoids the necessity of converting character strings to POSIXct date-time objects after reading. We can specify the classes (data types) of the columns in the CSV file as part of the read.csv command. In our example with the weather data, we know that all columns contain numerical data except the first which is a date-time in POSIXct format, so we can create a vector specifying these data types and pass it to read.csv.

> classes = c("POSIXct", rep("numeric", times = 24))
> april2014 = read.csv("april2014.csv", colClasses = classes) 

We’ve used the rep() function to generate a vector containing 24 strings, all saying "numeric" and concatenated it onto a "POSIXct" string.

This gives a slightly different structure to the data frame apr2014, as all columns except the first are now of type numeric rather than some being numeric and some being integer, but we can fine-tune the data types by giving a more detailed classes vector if we wanted to.

We cheated a bit here, since this works only if the date-times are in the default POSIXct format as shown above. It is possible to tell read.csv the format of a date-time that isn’t in the default form, but it’s a bit tricky.

The technique relies on the fact that what read.csv does when given a colClasses vector is try to coerce the raw character string read from the CSV file into the data type specified for that column. In order for this to work, there needs to be what is known as an 'as' function that performs this coercion (like the as.POSIXct() function we used above to coerce the string to a POSIXct object). R provides as functions for all the basic data types like numeric and also a few other data types like POSIXct. However, it’s possible to create your own data type and write an as function that coerces a string (or, indeed any other data type) into that new data type. We can use this method to read date-times in a non-standard format. Here’s the code:

> setClass("myDateTime")
> setAs("character","myDateTime", function(from) 
+ as.POSIXct(from, format="%Y-%m-%d %H:%M:%S") )
> customClasses = classes = c("myDateTime", rep("numeric", times = 24))
> april2014 = read.csv("april2014.csv", colClasses = customClasses)
> str(april2014)
'data.frame':	2880 obs. of  25 variables:
 $ dateTime       : POSIXct, format: "2014-04-01 00:00:00" "2014-04-01 00:15:00" "2014-04-01 00:30:00" ...

First we call setClass() to define a new class called myDateTime. Then we use setAs() to define a coercion from character to myDateTime. setAs() takes 3 arguments (in its most basic form). The first is the data type we want to coerce from, the second is the data type to coerce to, and the third is a function that takes a single argument which must be an instance of the ‘from’ data type. This function returns an instance of the to data type. In this case, the function uses the built-in as.POSIXct() function to coerce the date-time string with the given format to a POSIXct object. In R, functions can be passed as parameters to other functions, and the last statement in a function is that function’s return value.

As can be seen in the structure of april2014, the dateTime column in the data frame has the POSIXct data type.

Clearly there are a lot of techniques that we’ve glossed over here, but we’ll hopefully return to these in later posts for a more thorough understanding of how R handles classes and functions.

Data frames in R: basic operations

Creating data frames

The data frame in R is a two dimensional data structure. The data within each column in a data frame must be all of the same type, but separate columns can contain data of different types. It is probably the most commonly used data type in R, as its structure resembles that of a spreadsheet. We can create a simple data frame using R’s built in data editor. We’ll build a data frame that contains the ASCII codes of a few letters. First, we create an empty data frame with two columns; the first column contains the letters and is of type character, and the second column contains the codes and is of type integer. After that, we invoke the data editor using the edit() function:

> ascii = data.frame(Symbol = character(), Code = integer(),
+                    stringsAsFactors = F)
> ascii = edit(ascii)
> ascii
  Symbol Code
1      A   65
2      B   66
3      C   67
4      D   68
5      E   69
6      F   70

Some things to note here:

  1. The function for creating a data frame is data.frame(). If you come from a more traditional object oriented background, you might think we are calling a function named frame() from an object named data. However, in R, the period or full stop ‘.’ has no special meaning and is just another character that is allowed in variable and function names. Thus the name data.frame is a single function name and doesn’t refer to any object.
  2. The names Symbol and Code are the labels of the two columns. When defining a data frame, we list the column names and their associated data types; there’s no need to put the names in quotes.
  3. By default, a character column in a data frame is interpreted as defining factors, which are basically labels we can use to categorize the rows in a data frame (more on this later). If we don’t want strings to be factors, we need to explicitly switch this behaviour off, which is what stringsAsFactors = F does.
  4. When opening the data editor with edit(ascii) make sure to assign the result to a data frame variable, otherwise all your edits will be lost! The edit() function should pop up a separate window in RStudio. Just type in the values you want and then close the window by clicking the little X icon in the upper right.
  5. When R prints out the contents of a data frame, it provides names for the rows if you didn’t specify them yourself. Here the row names are just the numbers 1 through 6; they aren’t part of the data stored in the data frame.

Row and column names

If we want to change (or add) the row or column names we can use rownames() or colnames():

> rownames(ascii) = c("alpha", "beta", "gamma", "delta", "epsilon", "zeta")
> colnames(ascii) = c("letter", "asciiCode")
> ascii
        letter asciiCode
alpha        A        65
beta         B        66
gamma        C        67
delta        D        68
epsilon      E        69
zeta         F        70

The row and column names can be used to access individual elements, but somewhat confusingly, these names must now be enclosed in quotes:

> ascii["beta","letter"]
[1] "B"
> ascii["beta","asciiCode"]
[1] 66

We can use the $ notation as with lists to refer to columns, but not rows. A few examples of selecting rows and columns:

> beta = ascii["beta",]
> beta
     letter asciiCode
beta      B        66
> str(beta)
'data.frame':	1 obs. of  2 variables:
 $ letter   : chr "B"
 $ asciiCode: num 66
> ascii$letter
[1] "A" "B" "C" "D" "E" "F"
> str(ascii$letter)
 chr [1:6] "A" "B" "C" "D" "E" "F"

On line 1, we select row beta. The str() function shows the structure of a variable; in this case we see that the isolated row is itself a data frame. However, on line 9 we isolate the letter column, and we see that it is a character vector, not a data frame.

Adding rows and columns

We can add extra rows or columns using rbind() or cbind():

> x = cbind(ascii,Reverse = c(6,5,4,3,2,1))
> x
        letter asciiCode Reverse
alpha        A        65       6
beta         B        66       5
gamma        C        67       4
delta        D        68       3
epsilon      E        69       2
zeta         F        70       1
> x = rbind(ascii,eta = c("G",71))
> x
        letter asciiCode
alpha        A        65
beta         B        66
gamma        C        67
delta        D        68
epsilon      E        69
zeta         F        70
eta          G        71

These functions create and return a new data frame by adding to an existing data frame, so don’t forget to save the result in a variable.

Deleting rows and columns

Deleting rows and columns using the index numbers of the desired rows and columns is done using the – operator:

> x[-2,]
        letter asciiCode
alpha        A        65
gamma        C        67
delta        D        68
epsilon      E        69
zeta         F        70
eta          G        71
> x[-c(2:4),]
        letter asciiCode
alpha        A        65
epsilon      E        69
zeta         F        70
eta          G        71
> x[,-1]
[1] "65" "66" "67" "68" "69" "70" "71"

The first expression deletes row 2, the second deletes rows 2 through 4, and the third deletes the first column, leaving only column 2 which is a vector. All these commands create and return a new data frame (or vector) without modifying the original.

Deleting using row and column names is a bit trickier. For columns, we can delete by setting the column to NULL, but be warned that this deletes the column from the original data frame! If you want to save the original and produce a new data frame with the column deleted, create a copy of the original first:

> xsave = x
> x$letter = NULL
> x
        asciiCode
alpha          65
beta           66
gamma          67
delta          68
epsilon        69
zeta           70
eta            71
> xsave
        letter asciiCode
alpha        A        65
beta         B        66
gamma        C        67
delta        D        68
epsilon      E        69
zeta         F        70
eta          G        71

We copy x to xsave and then delete the letter column. We see that x has this column deleted but xsave remains unaltered. We can’t use the $ notation to delete rows.

A safer, non-destructive method that works for both rows and columns is as shown:

> y = cbind(ascii,Reverse = c(6,5,4,3,2,1))
> y
        letter asciiCode Reverse
alpha        A        65       6
beta         B        66       5
gamma        C        67       4
delta        D        68       3
epsilon      E        69       2
zeta         F        70       1
> !colnames(y) %in% c("letter", "Reverse")
[1] FALSE  TRUE FALSE
> y[,!colnames(y) %in% c("letter", "Reverse")]
[1] 65 66 67 68 69 70
> y
        letter asciiCode Reverse
alpha        A        65       6
beta         B        66       5
gamma        C        67       4
delta        D        68       3
epsilon      E        69       2
zeta         F        70       1

We first show y as it starts out. On line 10, there is a rather cryptic statement which produces the logical vector on line 11. The colnames() function is a vector of the column names of y. The %in% operator tests each element in its left operand to see if it is present in its right operand and generates a logical vector with TRUE if the item is present and FALSE if it isn’t. If we then pass a logical vector as the column index for y on line 12, only those columns in a TRUE location will be saved, so the result is that the letter and Reverse columns are deleted. Finally we print out y to show that it’s unaltered. The same technique works for rows with rownames(y) replacing colnames(y).

Lists in R

Although the R vector is a list of items, it suffers from the constraint that all elements in a vector must be the same data type. The list data type is a one dimensional list of items, where each item can be a different data type. Items in a list can be anything, including vectors, matrices and even other lists. A list can be created using the list() function:

> x = list(42, T, "wibble", matrix(1:10,5,2))
> x
[[1]]
[1] 42
[[2]]
[1] TRUE
[[3]]
[1] "wibble"
[[4]]
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

The list x contains integer, logical, character and matrix data types. To refer to an element in the list, we need to use the double-bracket notation, so x[[1]] is the first element in the list, which is a vector with a single element (42). The element x[[4]] is the 5×2 matrix shown.

Note the difference between the following two objects:

> intList = list(1:10)
> intVector = 1:10
> intList
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10
> intVector
 [1]  1  2  3  4  5  6  7  8  9 10

The object intList is a list containing a single element which is the vector of integers from 1 to 10. The object intVector is a vector with 10 elements. To get the number 4 out of intList, we’d need to say intList[[1]][4] while for intVector we say intVector[4].

The typeof() function returns ‘list’ when applied to a list, no matter what that list contains. As ‘list’ isn’t a numeric type, we can’t use any of the mathematical operations on a list.

We can name the elements of a list by using the names() function. For our list x above we could say:

> names(x) = c("number", "logical", "string", "matrix")
> x
$number
[1] 42
$logical
[1] TRUE
$string
[1] "wibble"
$matrix
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

Notice that the name of each element is prefixed by a $. We can use the $ notation to refer to list elements:

> x$matrix
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

We can also use the notation x[["matrix"]] to get the same element. (Vector elements can also be named, but we can’t use the $ notation to refer to vector elements; we must use vec["name"].)

Apart from that, there’s not a lot that can be done with lists at the top level. Their main use is as a storage container for other objects, and as the basis for the data frame, which is a much more commonly used data type.

Matrices in R

The natural next step after looking at R vectors is to examine matrices. Although a matrix, being a two dimensional grid of values, may seem the natural choice for representing tables of data, it is actually better to use R’s data frame for that purpose as it contains many more functions for manipulating the data. It’s best to use a matrix primarily for those operations you would normally perform on a mathematical matrix, such as matrix multiplication, inversion and so on. Here’s a simple matrix

> x = matrix(1:10, 2, 5)
> x
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

The matrix x is composed of the integers from 1 to 10, arranged in 2 rows and 5 columns. By default, R fills a matrix column-wise. We can refer to a particular element in a matrix by giving its row and column index (again, remember that R indexes start at 1, not 0), so x[2,3] is 6. If we specify just one of these indexes, we get a vector containing either a single row or single column:

> x[,2]    # Column 2
[1] 3 4
> x[1,]    # Row 1
[1] 1 3 5 7 9

It’s important to note that a vector is not a one dimensional matrix in R; they are quite different beasts. If you do want a one dimensional matrix you can get it by including the option ‘drop = FALSE’:

> x[,2, drop=FALSE]
     [,1]
[1,]    3
[2,]    4
> x[1,,drop=FALSE]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9

The first expression gives a 2×1 matrix, while the second gives a 1×5 matrix. The word ‘drop’ means to drop the dimensions not used in the selection. The default is ‘drop=TRUE’ meaning that when you select a single row or column, the matrix reduces from 2 to 1 dimension, becoming a vector. We can even produce a 1×1 matrix by isolating a single element as in x[2, 4, drop=FALSE].

Sometimes it’s more convenient if we can refer to a row or column by a specific name, rather than by its index. Names can be any string:

> colnames(x) = c("A","B","C","D","E")
> rownames(x) = c("Alpha", "Beta")
> x
      A B C D  E
Alpha 1 3 5 7  9
Beta  2 4 6 8 10

With these names, we can refer to x[“Alpha”, “B”] to get the value 3.

The t() function returns the transpose of a matrix:

  Alpha Beta
A     1    2
B     3    4
C     5    6
D     7    8
E     9   10

Addition, subtraction and component-wise multiplication are done with the +, – and * operators as usual. To do ‘real’ matrix multiplication, use the %*% operator, which works only if the number of columns in its left operand is equal to the number of rows in the right operand. We can multiply a matrix by its transpose (or a function of its transpose) in either order:

> sqrt(t(x)) %*% x
         A         B        C        D        E
A 3.828427  8.656854 13.48528 18.31371 23.14214
B 5.732051 13.196152 20.66025 28.12436 35.58846
C 7.135047 16.506163 25.87728 35.24839 44.61951
D 8.302606 19.250962 30.19932 41.14768 52.09603
E 9.324555 21.649111 33.97367 46.29822 58.62278
> x %*% t(x)
      Alpha Beta
Alpha   165  190
Beta    190  220

A matrix inverse is found with the solve() function. The inverse of a matrix is defined only if the matrix is square and its rows and columns are linearly independent. If either of these conditions is violated, R will give an error.

> xtx = x %*% t(x)
> xtxinv = solve(xtx)
> xtxinv
      Alpha   Beta
Alpha  1.10 -0.950
Beta  -0.95  0.825
> xtxinv %*% xtx
      Alpha Beta
Alpha     1    0
Beta      0    1

We’ve done a check that the matrix multiplied by its inverse gives the identity matrix.

R programming: first steps

R is a (very) high level programming language that is used for statistical analysis. It’s open source and free, and is used in many industrial-strength applications, so if you’re planning on doing any data analysis, it’s worth a look.

Here, I’ll start with installing R and looking at a few basic concepts.

Installing R and RStudio

R runs on all major computing platforms (Windows, Mac, Linux) but I’ll restrict myself to Windows, since that’s all I have. Installing R is quite straightforward. Visit the CRAN website and download the R package, then install it in the usual way (on Windows, the install file is an exe, so just run it).

Although R on its own runs from the command line and many tutorials assume this is the environment you’re using, if you’re used to an IDE for your programming in other languages I’d highly recommend that you now install RStudio, which is a free (for non-commercial use) graphical interface for R development. You can get it here. From now on, I’ll assume this is the environment we’re using. RStudio should find your R installation, but if it doesn’t, or you want to change the version of R it uses, open Tools –> Global Options and select the General tab.

R is an object-oriented, functional language that contains most of the usual features such as arithmetic operators, if statements and loops. Since these don’t differ much from other languages such as Java or C#, we won’t dwell on them here. Rather, we’ll start off by examining some simple operations on data sets, which is what R is designed for.

One feature of R that is at once powerful and frustrating is that there is a pre-written function to do almost anything you can think of. It’s frustrating because the sheer number of such functions makes it virtually impossible to remember them (so remember Google is your friend) and even if you do, their usage is often far from obvious. There is a built-in ‘help’ facility in R, but, at least for novices, it’s often far from helpful.

Anyway, open up RStudio and follow along. You should find that that there is a Console window in the lower left (or possibly covering the entire left side) into which you can type R commands.

Data types

R comes with several built-in primitive data types such as ‘logical’, ‘character’, ‘double’, ‘numeric’, ‘integer’ and ‘complex’. R doesn’t require you to declare your variables before using them; rather the type of the variable is determined by how it is used, and the same variable name can be reassigned to different data types within the same program. The current type of a variable can be found with the typeof() function:

> x = 12 
> typeof(x) 
[1] "double" 

A note about the assignment operator: in older versions of R the backwards arrow <- was the only acceptable assignment operator so the x = 12 statement above would be written x <- 12. More recent versions of R allow both <- and the more intuitive = for assignment. I'll use = here since it's what I'm used to from other languages.

Vectors

For anything beyond trivial commands, we’ll be dealing with collections of data so we need to see how R handles these. There are four main types of object that are used for storing data: vectors, lists, matrices and data frames. The simplest of these is the vector.

We can define a vector using the range operator (a colon : ) if we want, say, a sequence of integers:

 
> v = 1:3 
> v [1] 1 2 3 

(Typing a variable on a line by itself prints out the value of that variable.)

Another way to create a vector is to use the c() function (‘c’ for ‘concatenate’), which produces a vector from its arguments. Thus we could have written the above vector as

 
> v = c(1,2,3) 
> v [1] 1 2 3 

There is, however, a subtle distinction between the two, which can be seen by using typeof(v). In the first example, the range operator : produces a vector of integers from 1 to 3 so typeof(v) produces ‘integer’. In the second example just listing the numbers 1, 2, 3 makes R think they are doubles.

Notice that applying typeof() to a vector produces the type of its elements, and not the type of the vector itself (which is ‘vector’).

If you’ve been wondering what the [1] at the start of the output line means, it indicates that the first element printed in that line is element 1 of the vector. Printing out a longer vector causes the output to appear on several lines, and the index of the first element in each line is printed at the start of that line. Try it yourself by generating a vector with the integers 1 through 100 and then print it.

The elements of a vector can be accessed individually using square bracket notation, so that v[1] is the first element of v, v[2] is the second element, and so on. Note that vector indexes begin at 1, not 0 as in many other languages like Java and C#.

Coercion

We can apply c() to a list of any types of arguments, even mixed types. However, the elements of a vector must be all the same type, so what happens if we try something like

 
> u = c(42, "Hello", TRUE) 
> typeof(u) 
[1] "character" 
> u 
[1] "42" "Hello" "TRUE" 

This illustrates coercion; each element is coerced into the most general data type. In this case 42 is double, “Hello” is character and TRUE is logical. A character string, in general, can’t be interpreted as a number or a logical variable (true or false), while the other elements can be interpreted as just character strings rather than as actual values. So in this case, c() coerces all the elements to be of type character. When we print u we see all its elements are in quotes, indicating that they are merely strings, not values.

R scripts

For anything more than the odd isolated command, typing R statements at the command line can get tedious, especially if you need to repeat several commands. It’s easier to create an R script file and run that. In RStudio, click on the ‘New’ icon in the top left (a blank sheet with a green circle with a plus sign) and create a new R Script. An empty window will appear in the top left. Any code entered in this window can be run by clicking the Source icon at the top right of this window (or press Ctrl + Shift + S). This just runs the code but doesn’t produce any output. If you want to see the output, you can open the drop-down menu to the right of the Source icon and select “Source with echo” (or press Ctrl + Shift + Enter). This will echo the code as well as the output in the console.

It’s also worth noting that any R objects created by your code (either in the console or from running a script) remain in your environment until you clear it. They are visible in RStudio in the top right panel under the Environment tab.

Anyway, back to vectors. Here are a few things you can do with vectors that illustrate some of the built-in R functions and operators.

 
x = 1:10
y = seq(2, 20, by = 2)
x
y
x + y    # Add corresponding elements
x - y    # Subtract corresponding elements
x * y    # Multiply corresponding elements
x / y    # Divide corresponding elements
x %*% y  # Inner product (produces 1 value)
crossprod(x, y)  #Caution! Same as x %*% y
sqrt(x)  # Square root of each element
x^3      # Cube of each element
sum(x)   # Add up all elements
mean(x)  # Mean of elements
var(x)   # Variance of elements
x = c(x, 11:15) # Add 11:15 to end of x and reassign x to result
y = c(y, seq(22, 30, by = 2))  # Extend y similarly
x
y
z = 1:5
x + z    # Add vectors of different lengths
q = 1:6
y + q    # Longer length not a multiple of shorter length

Try copying and pasting this code into RStudio and run it to see what you get. Line 2 shows the seq() function which generalizes the colon operator by allowing you to specify a step size with ‘by’. The four standard arithmetic operators each operate on every element in the two vectors, so x + y performs x[1] + y[1], x[2] + y[2] and so on.

The %*% operator on line 9 performs an inner product, which is essentially the same thing as a dot or scalar product between the two vectors, equal to x[1] * y[1] + x[2] * y[2] + … There is also a function called, confusingly, crossprod() which does the same thing. This is NOT the cross or vector product that you may be familiar with from linear algebra! As far as I can tell, if you want the cross product you’ll need to write an R function to do that yourself (though it’s not hard).

The caret ^ is the exponentiation operator, and operates on each vector element separately. sum(), mean() and var() calculate the sum, mean and variance of the elements in the vector, so each returns a single value.

It’s possible to extend a vector using c() as shown in lines 17 and 18. We add 11:15 to the end of x and then reassign x to be this longer vector.

Finally, it’s worth noting what R does if we use vectors of different length in the arithmetic operations. We create a vector z of length 5 and add it to x, which is now length 15. R repeats the shorter vector enough times to make it fit the longer vector, so that x + z produces a vector with elements [1+1, 2+2, 3+3, 4+4, 5+5, 6+1, 7+2, 8+3,…]. If the longer vector’s length is a multiple of the shorter vector’s length, R will do the computation silently, but if this isn’t the case, as with y+q, it will still wrap the shorter vector enough times to match the longer one, but you’ll get a warning that the length of the longer vector isn’t a multiple of that of the shorter.

Some commands won’t work on vectors of different lengths. For example, the %*% operator requires two vectors of equal length.