The essence of R:

`## [1] 1 2 3 4`

(See Vectors later).

**[One]** important difference about R:

**Vector-based**: R is not a procedural language

**[Two]** reasons to use R for Data Science:

**Designed for data**: R can manipulate big data sets**Graphics Are Graspable**: people understand graphical data

**[Three]** fundamental principles of R per John Chambers:

**Objects**: Everything that exists in R is an object**Functions**: Everything that happens in R is a function call**Interfaces**: to other softwares are an integral part of R

**[Four]** ways of programming R:

**Command line**: entering R commands in a terminal**Source file**: running a set of commands from a saved file**R GUI interface**: available for Mac, WIndows, and Linux**Code chunks in RStudio**: allows debugging as you write

R has all the basic mathematical functions:

`## [1] 2`

`## [1] 6`

`## [1] 42`

`## [1] 1.333333`

R obeys the standard order of mathematical operations (**PEMDAS**):

**P**arentheses ( )**E**xponents ^**M**ultiplication x**D**ivision

**A**ddition +**S**ubtraction -

`## [1] 42`

The use of white space between operators is recommended.

Unlike statically-typed languages such a C++, R does not require variable types to be declared. An R variable can represent any data type or R object, such as a function, result, or graphical plot. R variables can be redeclared.

- Variable names can
**contain alphanumeric characters**but not**periods**`.`

or underscores`_`

**They cannot***start*with a number or underscore- Variable names are
**case sensitive**

R variable assignment operators are `<-`

(default) and `=`

(acceptable).

`## [1] 2`

`## [1] 5`

You can also assign left-to-right with `->`

, but variables are not often assigned that way.

`## [1] 7`

Assignment operations can be used successively to assign a value to multiple variables

`## [1] 42`

`## [1] 42`

You can also use the built-in `assign`

function:

`## [1] 4`

R has four main data types:

- Numeric
- Character (a.k.a Nominal)
- Date
- Logical

You can check the type of variable with `class(variablename`

)

`## [1] "eh?"`

`## [1] "character"`

`## [1] 99`

`## [1] "numeric"`

`Numeric`

data typesNumeric data includes both integers and decimals — positive, negative, and zero — similar to `float`

or `double`

in other languages. A numeric value stored in a variable is automatically assumed to be numeric in R.

You can test whether data is numeric with `is.numeric()`

:

`## [1] TRUE`

And if it’s an integer with ``is.integer()`

:

`## [1] FALSE`

The response of `FALSE`

is because to set an integer as a variable you must append the value with `L`

:

`## [1] TRUE`

R promotes `integers`

to `numeric`

when needed.

`Character`

data typesR handles Character data in two primary ways: as `character`

and as `factor`

. They are treated differently:

`## [1] "data"`

`## [1] "character"`

and

```
## [1] data
## Levels: data
```

The `levels`

are attributes of that factor.

To find the length of a `character`

(or `numeric`

):

`## [1] 4`

This does not work for `factor`

data.

`Date`

data typesR has numerous types of dates. `Date`

and `POSIXct`

are the most useful.

`## [1] "2018-03-28"`

`## [1] "Date"`

`## [1] 17618`

and

`## [1] "2018-03-28 10:45:00 PDT"`

`## [1] "POSIXct" "POSIXt"`

`## [1] 1522259100`

Using `as.numeric`

also changes the underlying type:

`## [1] "Date"`

`## [1] "numeric"`

`Logical`

data types`Logical`

s can be either `TRUE`

(`T`

or `1`

) or `FALSE`

(`F`

or 0). `T`

and `F`

are not recommended as they are simply shortcuts to `TRUE`

and `FALSE`

and can be overwritten, causing woe, anguish, mayhem, and rioting. (`TRUE`

or `F`

?)

Logical data types have a similar test function `is.logical()`

:

`## [1] "logical"`

`## [1] TRUE`

R data structures are containers for data elements:

**Vectors**– collections of only*same-type elements***Matrices**– rectangular containers of only*same-type elements***Data Frames**– contain, all of the same length*many types of vectors***Arrays**– Vectors withfor each*dimensions**same-type element***Lists**– containers for elements of*multi-type data types*

Vectors are the heart of R; it is a **vectorised language**. An R `Vector`

is:

A collection of elements of the same type.

**Operations are applied to each element of a vector without the need to loop through them**. This separates R from other programming languages and makes it most suited to manipulation and graphical presentation of data.

**Vectors do not have a dimension**: there is no `column`

or `row`

vector. Unlike `mathematical vectors`

there is no difference between column or row orientation.

Vectors are created with `c`

, meaning “combine”:

`## [1] 1 2 3 4 5 6 7 8`

Operations are applied to all elements at once:

`## [1] 3 4 5 6 7 8 9 10`

`## [1] -2 -1 0 1 2 3 4 5`

`## [1] 2 4 6 8 10 12 14 16`

`## [1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00`

`## [1] 1 4 9 16 25 36 49 64`

`## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427`

`## [1] 1 2 3 4 5 6 7 8`

`## [1] 8 7 6 5 4 3 2 1`

`## [1] -3 -2 -1 0 1 2 3 4`

`## [1] 4 3 2 1 0 -1 -2 -3`

Any element of a `Vector`

can be directly access using [square brackets] to point to it:

`## [1] 1 2 3 4 5 6 7 8`

`## [1] 4`

`## [1] 8`

You can check the length of a vector:

`## [1] 1 2 3 4 5 6 7 8`

`## [1] 8`

```
## [1] data
## Levels: data
```

`## [1] 1`

`## Warning in Ops.factor(x, y): '+' not meaningful for factors`

`## [1] 8`

and count the number of charactors in a vector:

`## [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" "Eight"`

`## [1] 3 3 5 4 4 3 5 5`

Two vectors of the same or different length can be combined:

`## [1] 1 2 3 4 5 6 7 8`

`## [1] -3 -2 -1 0 1 2 3 4`

`## [1] -2 0 2 4 6 8 10 12`

`## [1] 4 4 4 4 4 4 4 4`

`## [1] -3 -4 -3 0 5 12 21 32`

```
## [1] -0.3333333 -1.0000000 -3.0000000 Inf 5.0000000 3.0000000
## [7] 2.3333333 2.0000000
```

```
## [1] 1.0000000 0.2500000 0.3333333 1.0000000 5.0000000
## [6] 36.0000000 343.0000000 4096.0000000
```

**For two vectors of different lengths, the shorter vector is recycled**, and R may issue a warning:

`## [1] 2 4 4 6 6 8 8 10`

```
## Warning in x + c(1, 2, 3): longer object length is not a multiple of
## shorter object length
```

`## [1] 2 4 6 5 7 9 8 10`

`## [1] 1 2 3 4 5 6 7 8`

`## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE`

`## [1] 3 4 5 6 7 8 9 10`

`## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE`

The `all()`

function tests whether all elements are `TRUE`

`## [1] 10 9 8 7 6 5 4 3 2 1`

`## [1] -4 -3 -2 -1 0 1 2 3 4 5`

`## [1] FALSE`

The `any()`

function tests is any element is ’TRUE`:

`## [1] TRUE`

including vectors, matrices, data frames (similar to datasets), and lists (collections of objects).

`Factors`

are an important concept in R. ** Factors contain levels**, which are the unique values of that

`factor`

variable.`## [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" "Eight"`

```
## [1] One Two Three Four Five Six Seven Eight
## Levels: Eight Five Four One Seven Six Three Two
```

Note that the order of `levels`

does not matter unless the `ordered`

argument is set `TRUE`

:

```
factor(x=c("High School", "Doctorate", "Masters", "College"),
levels=c("High School", "College", "Masters", "Doctorate"),
ordered=TRUE)
```

```
## [1] High School Doctorate Masters College
## Levels: High School < College < Masters < Doctorate
```

A familiar mathematical structure, `matrices`

are essential to statistics.

A

`Matrix`

is a rectangular structure of rows and columns in which every element is of the same type, often all numerics.

`Matrics`

can be acted upon similarly to `Vectors`

, with **PEDMAS**-style element-by-element addition, subtraction, division, and equality.

```
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
```

Any element of a `matrix`

can be directly accessed using [square bracket] co-ordinates:

`## [1] 8`

`## [1] 12`

```
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
```

```
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
```

```
## [,1] [,2] [,3] [,4]
## [1,] 14 20 26 32
## [2,] 16 22 28 34
## [3,] 18 24 30 36
```

```
## [,1] [,2] [,3] [,4]
## [1,] 13 64 133 220
## [2,] 28 85 160 253
## [3,] 45 108 189 288
```

```
## [,1] [,2] [,3] [,4]
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
```

```
## A1 A2 A3 A4
## First 1 4 7 10
## Second 2 5 8 11
## Third 3 6 9 12
```

`## [1] 4`

`## [1] 4`

Two special `vectors`

– `letters`

and `LETTERS`

– create lowercase and UPPERCASE letter named matrix columns or rows:

```
## A B C D E F G H I J
## a 21 23 25 27 29 31 33 35 37 39
## b 22 24 26 28 30 32 34 36 38 40
```

The `data.frame`

is perhaps the primary reason for R’s growing popularity as a powerful, focussed, and flexible language for use in all aspects of Data Science.

A

`data.frame`

is a rectangular collection of vectors, all of which are of the same length but differing data types.

A `Data Frame`

looks like an **Excel spreadsheet** in that the data is organised into **columns** and **rows**. In statistical terms, each column is a *variable* while each row contains specific *observations*. Similar to a Matrix only in that it is also rectangular, a `data.frame`

is a much more flexible and comprehensive data structure.

Using the existing functions:

`## [1] 8 7 6 5 4 3 2 1`

`## [1] -3 -2 -1 0 1 2 3 4`

`## [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" "Eight"`

The simplest way of creating a `Dataframe`

is with the `data.frame()`

function:

This creates an 8x3 `data.frame`

consisting of three `vectors`

. Notice that the data types are included below the column headings.

To assign names to the `vectors`

:

To assign names to the rows:

The `nrow()`

, `ncol()`

, `dim()`

, `rownames()`

, and `names()`

functions are available to investigate its properties:

`## [1] 8`

`## [1] 3`

`## [1] 8 3`

`## [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" "Eight"`

`## [1] "First" "Second" "Third"`

Elements of any `vector`

of a `data.frame`

can be directly accessed using the `$`

or `[row, col]`

operators:

`## [1] -3 -2 -1 0 1 2 3 4`

```
## [1] Seven
## Levels: Eight Five Four One Seven Six Three Two
```

To specify an entire row, leave out the column specification, *vice versa* for specifying an entire column:

`## [1] -3 -2 -1 0 1 2 3 4`

To specify more than one row or column, use a `vector`

of indices:

To specify multiple columns by name, use a `character vector`

of the column names:

To find the `class`

of the entire `data.frame`

:

`## [1] "data.frame"`

or the `class`

of any `vector`

:

`## [1] "factor"`

`data.frames`

can be small, large, big, huge, or ginormous, depending on their size. The `head()`

and `tail()`

functions functions print only the first or last few rows, or the number of rows you set:

An

`Array`

is a multidimensional Vector whose elements are all the same type, but which also have attributes having dimensions (`dim`

) that can also be named (`dimnames`

).

`Arrays`

To create an `Array`

, the first element is the row index, the second the column index, and the remaining elements are for the outer dimensions `row`

, `column`

, `number of arrays`

:

```
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
```

Individual elements of an `Array`

are accesssed using square brackets similar to a `Vector`

but in this case by `[row, column, array #]`

.

```
## [,1] [,2]
## [1,] 1 7
## [2,] 3 9
## [3,] 5 11
```

```
## [,1] [,2]
## [1,] 2 8
## [2,] 4 10
## [3,] 6 12
```

`## [1] 1 3 5`

`## [1] 7 9 11`

```
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
```

```
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
```

`Lists`

are used to store any number of items of any type: all`numeric`

or all`character`

vectors, or a mix of them; complete`data.frames`

; and even other`lists`

.

`Lists`

are created with the `list()`

function. Each argument to the function becomes an element of the list:

```
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
```

Single-element lists can contain multi-element vectors:

```
## [[1]]
## [1] 1 2 3
```

Here’s a two-element list with the second element a five-element `vector`

:

```
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 3 4 5 6 7
```

A two-element `list`

with the first element an `array`

, the second element a ten-element `vector`

:

```
## [[1]]
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
```

Empty `lists`

of a determined length are created using a `vector`

:

```
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
```

**Note**: Enclosing an expression in round brackets displays the results immediately after execution.

`Lists`

can have names, and each element of a `list`

can have a unique name

`## NULL`

`## [1] "The Array" "The Vector"`

```
## $`The Array`
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $`The Vector`
## [1] 1 2 3 4 5 6 7 8 9 10
```

Names can also be assigned to `list`

elements during creation using name-value pairs. This can also include naming the `list`

itself:

```
## $theARR
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $theVECT
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $List3
## $List3$`The Array`
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $List3$`The Vector`
## [1] 1 2 3 4 5 6 7 8 9 10
```

New elements can be added to a `list`

by appending a `numeric`

or `named`

index that does not yet exist:

`## [1] 3`

Adding a `numeric`

index:

`## [1] 4`

```
## $theARR
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $theVECT
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $List3
## $List3$`The Array`
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $List3$`The Vector`
## [1] 1 2 3 4 5 6 7 8 9 10
##
##
## [[4]]
## [1] 11
```

Adding a `named`

index:

`## [1] 5`

```
## $theARR
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $theVECT
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $List3
## $List3$`The Array`
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $List3$`The Vector`
## [1] 1 2 3 4 5 6 7 8 9 10
##
##
## [[4]]
## [1] 11
##
## $AddedElement
## [1] 12 13 14 15 16
```

R is a functional language, so almost every operation in R involves either creating new functions or accessing functions in package libraries.

**An R programmer must choose and load a suitable package, select an appropriate function, and supply the arguments needed to make it work**.

This can be as simple as calling a function against an element:

`## [1] 4.5`

More complicated functions require supplying their arguments either in the correct order, or specifying their name with an equals sign.

Either way means knowing the capabilities and requirements of any function.

**If you know the name of a function**, entering a question mark followed by the function name in the *Console* pane will display its documentation in the *Viewer* pane:

For help on binary operator (e.g. `+`

, `*`

, `==`

), surround it with backticks:

**If you are not sure which function to use**, you can search using only part of the name with `apropos()`

:

```
## [1] ".colMeans" ".rowMeans" "colMeans"
## [4] "influence.measures" "kmeans" "mean"
## [7] "mean.Date" "mean.default" "mean.difftime"
## [10] "mean.POSIXct" "mean.POSIXlt" "rowMeans"
## [13] "weighted.mean"
```

Construction and use of functions will be detailed later in this Tutorial series.

It’s rare that a data set is complete in every detail. Some observations will be missing, while others may be accurately reported as being not available. Missing values can be represented in many ways: by a dash `-`

, a period `.`

, or even the number `99`

.

`R`

has two types of missing data: `NA`

and `NULL`

.

** R recognises NA as an element of a Vector**.

`is.na()`

tests each element of a Vector:

`## [1] 3 4 NA 5 6 NA`

`## [1] FALSE FALSE TRUE FALSE FALSE TRUE`

This works with any type of Vector.

Some `Base R`

functions return NA even if a single element is NA, for example `mean()`

:

`## [1] NA`

`na.rm=TRUE`

removes any NAs, allowing these functions to proceed:

`## [1] 4.5`

`NULL`

is very Zen: it is nothingness, which means that there isn’t even anything missing. `NULL`

is atomical and can’t exist in a Vector. If used in a Vector, it simply disappears:

`## [1] 3 4 5 6`

The test for `NULL`

is `is.null():

`## [1] TRUE`

`## [1] FALSE`

`is.null(z)`

returns `FALSE`

because the `NULL`

within it is not recognised.

**Functions can be chained together using the magrittr package**, which introduces

`%>%`

operator`## [1] 6.5`

`## [1] 6.5`

**Using pipes means that code can be read left-to-right**, which is more natural than thetraditional right-to-left R `<-`

operator. **Additional arguments** can be named and included **inside parentheses** after the function call

`## [1] 4.5`

A guide to what we have covered so far (and more) can be found in this PDF: Data Science Free - basicR