## Four Fundamentals

The essence of R:

R <- c(1:4)
R
## [1] 1 2 3 4

(See Vectors later).

• Vector-based: R is not a procedural language

[Two] reasons to use R for Data Science:

• Designed for data: R can manipulate big data sets
• Graphics Are Graspable: people understand graphical data

[Three] fundamental principles of R per John Chambers:

• Objects: Everything that exists in R is an object
• Functions: Everything that happens in R is a function call
• Interfaces: to other softwares are an integral part of R

[Four] ways of programming R:

• Command line: entering R commands in a terminal
• Source file: running a set of commands from a saved file
• R GUI interface: available for Mac, WIndows, and Linux
• Code chunks in RStudio: allows debugging as you write

## Basic Maths

R has all the basic mathematical functions:

1 + 1
## [1] 2
1 + 2 + 3
## [1] 6
3 * 7 * 2
## [1] 42
4 / 3
## [1] 1.333333

R obeys the standard order of mathematical operations (PEMDAS):

1. Parentheses ( )
2. Exponents ^
3. Multiplication x
4. Division
6. Subtraction -
(2 ^ 5) + (2 * 5)
## [1] 42

The use of white space between operators is recommended.

## Variables

Unlike statically-typed languages such a C++, R does not require variable types to be declared. An R variable can represent any data type or R object, such as a function, result, or graphical plot. R variables can be redeclared.

• Variable names can contain alphanumeric characters but not periods . or underscores _
• Variable names are case sensitive

### Assigning variables

R variable assignment operators are <- (default) and = (acceptable).

x <- 2
x
## [1] 2
y = 5
y
## [1] 5

You can also assign left-to-right with ->, but variables are not often assigned that way.

7 -> z
z
## [1] 7

Assignment operations can be used successively to assign a value to multiple variables

a <- b <- 42
a
## [1] 42
b
## [1] 42

You can also use the built-in assign function:

assign("q", 4)
q
## [1] 4

### Removing variables

rm(variablename) removes a variable.

rm(q)

## Data Types

R has four main data types:

• Numeric
• Character (a.k.a Nominal)
• Date
• Logical

You can check the type of variable with class(variablename)

x <- "eh?"
x
## [1] "eh?"
class(x)
## [1] "character"
y <- 99
y
## [1] 99
class(y)
## [1] "numeric"

### Numeric data types

Numeric data includes both integers and decimals — positive, negative, and zero — similar to float or double in other languages. A numeric value stored in a variable is automatically assumed to be numeric in R.

You can test whether data is numeric with is.numeric():

is.numeric(y)
## [1] TRUE

And if it’s an integer with is.integer():

is.integer(y)
## [1] FALSE

The response of FALSE is because to set an integer as a variable you must append the value with L:

y <- 99L
is.integer(y)
## [1] TRUE

R promotes integers to numeric when needed.

### Character data types

R handles Character data in two primary ways: as character and as factor. They are treated differently:

x <- "data"
x
## [1] "data"
class(x)
## [1] "character"

and

y <- factor("data")
y
## [1] data
## Levels: data

The levels are attributes of that factor.

To find the length of a character (or numeric):

nchar(x)
## [1] 4

This does not work for factor data.

### Date data types

R has numerous types of dates. Date and POSIXct are the most useful.

date1 <- as.Date("2018-03-28")
date1
## [1] "2018-03-28"
class(date1)
## [1] "Date"
as.numeric(date1)
## [1] 17618

and

date2 <- as.POSIXct("2018-03-28 10:45")
date2
## [1] "2018-03-28 10:45:00 PDT"
class(date2)
## [1] "POSIXct" "POSIXt"
as.numeric(date2)
## [1] 1522259100

Using as.numeric also changes the underlying type:

class(date1)
## [1] "Date"
class(as.numeric(date1))
## [1] "numeric"

### Logical data types

Logicals can be either TRUE (T or 1) or FALSE (For 0). T and F are not recommended as they are simply shortcuts to TRUE and FALSE and can be overwritten, causing woe, anguish, mayhem, and rioting. (TRUE or F?)

Logical data types have a similar test function is.logical():

k <- TRUE
class(k)
## [1] "logical"
is.logical(k)
## [1] TRUE

## Data Structures

R data structures are containers for data elements:

• Vectors – collections of only same-type elements
• Matrices – rectangular containers of only same-type elements
• Data Frames – contain many types of vectors , all of the same length
• Arrays – Vectors with dimensions for each same-type element
• Lists – containers for elements of multi-type data types

### Vectors

Vectors are the heart of R; it is a vectorised language. An R Vector is:

A collection of elements of the same type.

Operations are applied to each element of a vector without the need to loop through them. This separates R from other programming languages and makes it most suited to manipulation and graphical presentation of data.

Vectors do not have a dimension: there is no column or row vector. Unlike mathematical vectors there is no difference between column or row orientation.

#### Creating a vector

Vectors are created with c, meaning “combine”:

x <- c(1, 2, 3, 4, 5, 6, 7, 8)
x
## [1] 1 2 3 4 5 6 7 8

Operations are applied to all elements at once:

x + 2
## [1]  3  4  5  6  7  8  9 10
x -3
## [1] -2 -1  0  1  2  3  4  5
x * 2
## [1]  2  4  6  8 10 12 14 16
x / 4
## [1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
x^2
## [1]  1  4  9 16 25 36 49 64
sqrt(x)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427

#### Vector creation shortcuts

1:8
## [1] 1 2 3 4 5 6 7 8
8:1
## [1] 8 7 6 5 4 3 2 1
-3:4
## [1] -3 -2 -1  0  1  2  3  4
4:-3
## [1]  4  3  2  1  0 -1 -2 -3

#### Accessing vector elements

Any element of a Vector can be directly access using [square brackets] to point to it:

x
## [1] 1 2 3 4 5 6 7 8
x[4]
## [1] 4
x[8]
## [1] 8

#### Counting within Vectors

You can check the length of a vector:

x
## [1] 1 2 3 4 5 6 7 8
length(x)
## [1] 8
y
## [1] data
## Levels: data
length(y)
## [1] 1
length(x + y)
## Warning in Ops.factor(x, y): '+' not meaningful for factors
## [1] 8

and count the number of charactors in a vector:

q <- c("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight")
q
## [1] "One"   "Two"   "Three" "Four"  "Five"  "Six"   "Seven" "Eight"
nchar(q)
## [1] 3 3 5 4 4 3 5 5

#### Combining Vectors

Two vectors of the same or different length can be combined:

##### Vectors of the same length
x <- 1:8
x
## [1] 1 2 3 4 5 6 7 8
y <- -3:4
y
## [1] -3 -2 -1  0  1  2  3  4
x + y
## [1] -2  0  2  4  6  8 10 12
x - y
## [1] 4 4 4 4 4 4 4 4
x * y
## [1] -3 -4 -3  0  5 12 21 32
x / y
## [1] -0.3333333 -1.0000000 -3.0000000        Inf  5.0000000  3.0000000
## [7]  2.3333333  2.0000000
x^y
## [1]    1.0000000    0.2500000    0.3333333    1.0000000    5.0000000
## [6]   36.0000000  343.0000000 4096.0000000
##### Vectors of different lengths

For two vectors of different lengths, the shorter vector is recycled, and R may issue a warning:

x + c(1, 2)
## [1]  2  4  4  6  6  8  8 10
x + c(1, 2, 3)
## Warning in x + c(1, 2, 3): longer object length is not a multiple of
## shorter object length
## [1]  2  4  6  5  7  9  8 10

#### Comparison of two Vectors

x <- c(1:8)
x
## [1] 1 2 3 4 5 6 7 8
x > 5
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
y <- c(3:10)
y
## [1]  3  4  5  6  7  8  9 10
x > y
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The all() function tests whether all elements are TRUE

x <-  10:1
y <-  -4:5
x
##  [1] 10  9  8  7  6  5  4  3  2  1
y
##  [1] -4 -3 -2 -1  0  1  2  3  4  5
all(x < y)
## [1] FALSE

The any() function tests is any element is ’TRUE:

any(x < y)
## [1] TRUE

including vectors, matrices, data frames (similar to datasets), and lists (collections of objects).

#### Factor Vectors

Factors are an important concept in R. Factors contain levels, which are the unique values of that factor variable.

q
## [1] "One"   "Two"   "Three" "Four"  "Five"  "Six"   "Seven" "Eight"
qFactor <- as.factor(q)
qFactor
## [1] One   Two   Three Four  Five  Six   Seven Eight
## Levels: Eight Five Four One Seven Six Three Two

Note that the order of levelsdoes not matter unless the ordered argument is set TRUE:

factor(x=c("High School", "Doctorate", "Masters", "College"),
levels=c("High School", "College", "Masters", "Doctorate"),
ordered=TRUE)
## [1] High School Doctorate   Masters     College
## Levels: High School < College < Masters < Doctorate

### Matrices

A familiar mathematical structure, matrices are essential to statistics.

A Matrix is a rectangular structure of rows and columns in which every element is of the same type, often all numerics.

Matrics can be acted upon similarly to Vectors, with PEDMAS-style element-by-element addition, subtraction, division, and equality.

#### Creating a Matrix

A <- matrix(1:12, nrow=3)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Any element of a matrixcan be directly accessed using [square bracket] co-ordinates:

A[2,3]
## [1] 8
A[3,4]
## [1] 12

#### Dimensions of a Matrix

nrow(A)
## [1] 3
ncol(A)
## [1] 4
dim(A)
## [1] 3 4

A
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
B <-  matrix(13:24, nrow=3)
B
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24
A + B
##      [,1] [,2] [,3] [,4]
## [1,]   14   20   26   32
## [2,]   16   22   28   34
## [3,]   18   24   30   36

#### Multiplying Matrices

A * B
##      [,1] [,2] [,3] [,4]
## [1,]   13   64  133  220
## [2,]   28   85  160  253
## [3,]   45  108  189  288

#### Logical querying

A == B
##       [,1]  [,2]  [,3]  [,4]
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE

#### Naming rows and columns

colnames(A) <- c("A1", "A2", "A3", "A4")
rownames(A) <- c("First", "Second", "Third")
A
##        A1 A2 A3 A4
## First   1  4  7 10
## Second  2  5  8 11
## Third   3  6  9 12
A["First", "A2"]
## [1] 4
A[1,2]
## [1] 4

Two special vectorsletters and LETTERS – create lowercase and UPPERCASE letter named matrix columns or rows:

C <- matrix(21:40, nrow=2)
colnames(C) <- LETTERS[1:10]
rownames(C) <- c(letters[1:2])
C
##    A  B  C  D  E  F  G  H  I  J
## a 21 23 25 27 29 31 33 35 37 39
## b 22 24 26 28 30 32 34 36 38 40

### Dataframes

The data.frame is perhaps the primary reason for R’s growing popularity as a powerful, focussed, and flexible language for use in all aspects of Data Science.

A data.frame is a rectangular collection of vectors, all of which are of the same length but differing data types.

A Data Frame looks like an Excel spreadsheet in that the data is organised into columns and rows. In statistical terms, each column is a variable while each row contains specific observations. Similar to a Matrix only in that it is also rectangular, a data.frame is a much more flexible and comprehensive data structure.

#### Creating a Dataframe

Using the existing functions:

(x <- 8:1)
## [1] 8 7 6 5 4 3 2 1
(y <- -3:4)
## [1] -3 -2 -1  0  1  2  3  4
(q <- c("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight"))
## [1] "One"   "Two"   "Three" "Four"  "Five"  "Six"   "Seven" "Eight"

The simplest way of creating a Dataframe is with the data.frame() function:

theDF <- data.frame(x, y, q)
theDF

This creates an 8x3 data.frame consisting of three vectors. Notice that the data types are included below the column headings.

To assign names to the vectors:

theDF <- data.frame(First=x, Second=y, Third=q)
theDF

To assign names to the rows:

rownames(theDF) <- c("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight")
theDF

#### Examining a Dataframe

The nrow(), ncol(), dim(), rownames(), and names() functions are available to investigate its properties:

(nrow(theDF))
## [1] 8
(ncol(theDF))
## [1] 3
(dim(theDF))
## [1] 8 3
(rownames(theDF))
## [1] "One"   "Two"   "Three" "Four"  "Five"  "Six"   "Seven" "Eight"
(names(theDF))
## [1] "First"  "Second" "Third"

Elements of any vector of a data.frame can be directly accessed using the $ or [row, col] operators: (theDF$Second)
## [1] -3 -2 -1  0  1  2  3  4
(theDF[7, 3])
## [1] Seven
## Levels: Eight Five Four One Seven Six Three Two

To specify an entire row, leave out the column specification, vice versa for specifying an entire column:

(theDF[2, ])
(theDF[, 2])
## [1] -3 -2 -1  0  1  2  3  4

To specify more than one row or column, use a vector of indices:

(theDF[3:5, 2:3])

To specify multiple columns by name, use a character vector of the column names:

(theDF[, c("First", "Third")])

To find the class of the entire data.frame:

(class(theDF))
## [1] "data.frame"

or the class of any vector:

(class(theDF$Third)) ## [1] "factor" #### Displaying a Dataframe data.frames can be small, large, big, huge, or ginormous, depending on their size. The head() and tail() functions functions print only the first or last few rows, or the number of rows you set: (head(theDF)) (head(theDF, n=5)) (tail(theDF, n=5)) ### Arrays An Array is a multidimensional Vector whose elements are all the same type, but which also have attributes having dimensions (dim) that can also be named (dimnames). #### Creating Arrays To create an Array, the first element is the row index, the second the column index, and the remaining elements are for the outer dimensions row, column, number of arrays: theArray <- array(1:12, dim = c(2, 3, 2)) theArray ## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 7 9 11 ## [2,] 8 10 12 #### Accessing Arrays Individual elements of an Array are accesssed using square brackets similar to a Vector but in this case by [row, column, array #]. theArray[1, , ] ## [,1] [,2] ## [1,] 1 7 ## [2,] 3 9 ## [3,] 5 11 theArray[2, , ] ## [,1] [,2] ## [1,] 2 8 ## [2,] 4 10 ## [3,] 6 12 theArray[1, , 1] ## [1] 1 3 5 theArray[1, , 2] ## [1] 7 9 11 theArray[, , 1] ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 theArray[, , 2] ## [,1] [,2] [,3] ## [1,] 7 9 11 ## [2,] 8 10 12 ### Lists Lists are used to store any number of items of any type: all numeric or all character vectors, or a mix of them; complete data.frames; and even other lists. #### Creating Lists Lists are created with the list() function. Each argument to the function becomes an element of the list: list(1, 2, 3) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 2 ## ## [[3]] ## [1] 3 Single-element lists can contain multi-element vectors: list(c(1, 2, 3)) ## [[1]] ## [1] 1 2 3 Here’s a two-element list with the second element a five-element vector: list1 <- list(c(1, 2, 3), 3:7) list1 ## [[1]] ## [1] 1 2 3 ## ## [[2]] ## [1] 3 4 5 6 7 A two-element list with the first element an array, the second element a ten-element vector: list2 <- list(theArray, 1:10) list2 ## [[1]] ## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 7 9 11 ## [2,] 8 10 12 ## ## ## [[2]] ## [1] 1 2 3 4 5 6 7 8 9 10 #### Creating Empty Lists Empty lists of a determined length are created using a vector: (emptyList <- vector(mode = "list", length = 4)) ## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL Note: Enclosing an expression in round brackets displays the results immediately after execution. #### Naming Lists Lists can have names, and each element of a list can have a unique name names(list2) ## NULL (names(list2) <- c("The Array", "The Vector")) ## [1] "The Array" "The Vector" list2 ##$The Array
## , , 1
##
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
##
## , , 2
##
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
##
##
## $The Vector ## [1] 1 2 3 4 5 6 7 8 9 10 #### Naming List Elements Names can also be assigned to list elements during creation using name-value pairs. This can also include naming the list itself: (list3 <- list(theARR=theArray, theVECT=1:10, List3=list2)) ##$theARR
## , , 1
##
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
##
## , , 2
##
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
##
##
## $theVECT ## [1] 1 2 3 4 5 6 7 8 9 10 ## ##$List3
## $List3$The Array
## , , 1
##
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
##
## , , 2
##
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
##
##
## $List3$The Vector
##  [1]  1  2  3  4  5  6  7  8  9 10

New elements can be added to a list by appending a numeric or named index that does not yet exist:

length(list3)
## [1] 3

Adding a numeric index:

list3[[4]] <- 11
length(list3)
## [1] 4
list3
## $theARR ## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 7 9 11 ## [2,] 8 10 12 ## ## ##$theVECT
##  [1]  1  2  3  4  5  6  7  8  9 10
##
## $List3 ##$List3$The Array ## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 7 9 11 ## [2,] 8 10 12 ## ## ##$List3$The Vector ## [1] 1 2 3 4 5 6 7 8 9 10 ## ## ## [[4]] ## [1] 11 Adding a named index: list3[["AddedElement"]] <- 12:16 length(list3) ## [1] 5 list3 ##$theARR
## , , 1
##
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
##
## , , 2
##
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
##
##
## $theVECT ## [1] 1 2 3 4 5 6 7 8 9 10 ## ##$List3
## $List3$The Array
## , , 1
##
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
##
## , , 2
##
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
##
##
## $List3$The Vector
##  [1]  1  2  3  4  5  6  7  8  9 10
##
##
## [[4]]
## [1] 11
##
## [1] 12 13 14 15 16

## Introducing Functions

R is a functional language, so almost every operation in R involves either creating new functions or accessing functions in package libraries.

An R programmer must choose and load a suitable package, select an appropriate function, and supply the arguments needed to make it work.

This can be as simple as calling a function against an element:

mean (x)
## [1] 4.5

More complicated functions require supplying their arguments either in the correct order, or specifying their name with an equals sign.

Either way means knowing the capabilities and requirements of any function.

### Function Help

If you know the name of a function, entering a question mark followed by the function name in the Console pane will display its documentation in the Viewer pane:

?mean

For help on binary operator (e.g. +, *, ==), surround it with backticks:

?==

If you are not sure which function to use, you can search using only part of the name with apropos():

apropos("mea")
##  [1] ".colMeans"          ".rowMeans"          "colMeans"
##  [4] "influence.measures" "kmeans"             "mean"
##  [7] "mean.Date"          "mean.default"       "mean.difftime"
## [10] "mean.POSIXct"       "mean.POSIXlt"       "rowMeans"
## [13] "weighted.mean"

Construction and use of functions will be detailed later in this Tutorial series.

## Nothing There?

It’s rare that a data set is complete in every detail. Some observations will be missing, while others may be accurately reported as being not available. Missing values can be represented in many ways: by a dash -, a period ., or even the number 99.

R has two types of missing data: NA and NULL.

### NA

R recognises NA as an element of a Vector.

is.na() tests each element of a Vector:

z <- c(3, 4, NA, 5, 6, NA)
z
## [1]  3  4 NA  5  6 NA
is.na(z)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

This works with any type of Vector.

#### Removing NAs

Some Base R functions return NA even if a single element is NA, for example mean():

mean(z)
## [1] NA

na.rm=TRUE removes any NAs, allowing these functions to proceed:

mean(z, na.rm=TRUE)
## [1] 4.5

### NULL

NULL is very Zen: it is nothingness, which means that there isn’t even anything missing. NULL is atomical and can’t exist in a Vector. If used in a Vector, it simply disappears:

z <- c(3, 4, NULL, 5, 6)
z
## [1] 3 4 5 6

The test for NULL is is.null():

no <- NULL
is.null(no)
## [1] TRUE
is.null(z)
## [1] FALSE

is.null(z) returns FALSE because the NULL within it is not recognised.

## Pipes

Functions can be chained together using the magrittr package, which introduces the %>% operator, which for simplicity can be read as then…

library("magrittr")
x <- 1:12
mean(x)
## [1] 6.5
x %>% mean
## [1] 6.5

Using pipes means that code can be read left-to-right, which is more natural than thetraditional right-to-left R <- operator. Additional arguments can be named and included inside parentheses after the function call

z %>% mean(na.rm=TRUE)
## [1] 4.5`

## Cheat Sheet

A guide to what we have covered so far (and more) can be found in this PDF: Data Science Free - basicR