Introduction to R
Basics of Data

BUAN 327
Yegin Genc

Agenda

  • Course overview and mechanics
  • Built-in data types
  • Built-in functions and operators
  • First data structures: Vectors and arrays

Why good statisticians learn to program

  • Independence: Otherwise, you rely on someone else having given you exactly the right tool

  • Honesty: Otherwise, you end up distorting your problem to match the tools you have

  • Clarity: Making your method something a machine can do disciplines your thinking and makes it public; that's science

How this class will work

  • No programming knowledge presumed
  • Some stats. knowledge presumed
  • General programming mixed with data-manipulation and statistical inference
  • Class will be very cumulative

Mechanics

  • Lectures: concepts, methods, examples
  • In-class Assignments to try stuff out and get fast feedback
  • HW weekly to do longer and more complex things
  • Projects:
    • Project 1: Descriptive statistics
    • Project 2: Inferencial Statistics
  • Midterm and Final: in class, hands-on

R as statistical programming environment

Download and review at https://www.r-project.org/ alt text

The R Console

alt text

Basic interaction with R is by typing in the console, a.k.a. terminal or command-line

You type in commands, R gives back answers (or errors)

Menus and other graphical interfaces are extras built on top of the console

RStudio as the user interface for R

Statistical programming in a nutshell: Functional programming

2 sorts of things (objects): data and functions

  • Data: things like 7, “seven”, \( 7.000 \), the matrix \( \left[ \begin{array}{ccc} 7 & 7 & 7 \\ 7 & 7 & 7\end{array}\right] \)

  • Functions: things like \( \log{} \), \( + \) (two arguments), \( < \) (two), \( \mod{} \) (two), mean (one)

A function is a machine which turns input objects (arguments) into an output object (return value), possibly with side effects, according to a definite rule

Before functions, data

Different kinds of data object

All data is represented in binary format, by bits (TRUE/FALSE, YES/NO, 1/0)

  • Booleans Direct binary values: TRUE or FALSE in R
  • Integers: whole numbers (positive, negative or zero), represented by a fixed-length block of bits
  • Characters fixed-length blocks of bits, with special coding; strings = sequences of characters
  • Floating point numbers: a fraction (with a finite number of bits) times an exponent, like \( 1.87 \times {10}^{6} \), but in binary form
  • Missing or ill-defined values: NA, NaN, etc.

R as a calculator - Operators

7+5
[1] 12
7-5
[1] 2
7*5
[1] 35
7^5
[1] 16807
7/5
[1] 1.4
7 %% 5
[1] 2
7 %/% 5
[1] 1

Operators cont'd.

Comparisons are also binary operators; they take two objects, like numbers, and give a Boolean

7 > 5
[1] TRUE
7 < 5
[1] FALSE
7 >= 7
[1] TRUE
7 <= 5
[1] FALSE
7 == 5
[1] FALSE
7 != 5
[1] TRUE

Boolean operators

Basically “and” and “or”:

(5 > 7) & (6*7 == 42)
[1] FALSE
(5 > 7) | (6*7 == 42)
[1] TRUE

More types

typeof() function returns the type

is.foo() functions return Booleans for whether the argument is of type foo

as.foo() (tries to) “cast” its argument to type foo — to translate it sensibly into a foo-type value

typeof(7)
[1] "double"
is.numeric(7)
[1] TRUE
is.na(7)
[1] FALSE
is.na(7/0)
[1] FALSE
is.na(0/0)
[1] TRUE

Why is 7/0 not NA, but 0/0 is?

is.character(7)
[1] FALSE
is.character("7")
[1] TRUE
is.character("seven")
[1] TRUE
is.na("seven")
[1] FALSE
as.character(5/6)
[1] "0.833333333333333"
as.numeric(as.character(5/6))
[1] 0.8333333
6*as.numeric(as.character(5/6))
[1] 5
5/6 == as.numeric(as.character(5/6))
[1] FALSE

(why is that last FALSE?)

Data can have names

We can give names to data objects; these give us variables

A few variables are built in:

pi
[1] 3.141593

Variables can be arguments to functions or operators, just like constants:

pi*10
[1] 31.41593
cos(pi)
[1] -1

Most variables are created with the assignment operator, <- or =

approx.pi <- 22/7
approx.pi
[1] 3.142857
diameter.in.cubits = 10
approx.pi*diameter.in.cubits
[1] 31.42857

The assignment operator also changes values:

circumference.in.cubits <- approx.pi*diameter.in.cubits
circumference.in.cubits
[1] 31.42857
circumference.in.cubits <- 30
circumference.in.cubits
[1] 30

Using names and variables makes code: easier to design, easier to debug, less prone to bugs, easier to improve, and easier for others to read

Avoid “magic constants”; use named variables you will be graded on this!

Named variables are a first step towards abstraction

The workspace

What names have you defined values for?

ls()
[1] "approx.pi"               "circumference.in.cubits"
[3] "diameter.in.cubits"     
objects()
[1] "approx.pi"               "circumference.in.cubits"
[3] "diameter.in.cubits"     

Getting rid of variables:

rm("circumference.in.cubits")
ls()
[1] "approx.pi"          "diameter.in.cubits"

First data structure: vectors

Group related data values into one object, a data structure

A vector is a sequence of values, all of the same type

x <- c(7, 8, 10, 45)
x
[1]  7  8 10 45
is.vector(x)
[1] TRUE

c() function returns a vector containing all its arguments in order

x[1] is the first element, x[4] is the 4th element
x[-4] is a vector containing all but the fourth element

vector(length=6) returns an empty vector of length 6; helpful for filling things up later

weekly.hours <- vector(length=5)
weekly.hours[5] <- 8

Vector arithmetic

Operators apply to vectors “pairwise” or “elementwise”:

y <- c(-7, -8, -10, -45)
x+y
[1] 0 0 0 0
x*y
[1]   -49   -64  -100 -2025

Recycling

Recycling repeats elements in shorter vector when combined with longer

x + c(-7,-8)
[1]  0  0  3 37
x^c(1,0,-1,0.5)
[1] 7.000000 1.000000 0.100000 6.708204

Single numbers are vectors of length 1 for purposes of recycling:

2*x
[1] 14 16 20 90

Can also do pairwise comparisons:

x > 9
[1] FALSE FALSE  TRUE  TRUE

Note: returns Boolean vector

Boolean operators work elementwise:

(x > 9) & (x < 20)
[1] FALSE FALSE  TRUE FALSE

To compare whole vectors, best to use identical() or all.equal():

x == -y
[1] TRUE TRUE TRUE TRUE
identical(x,-y)
[1] TRUE
identical(c(0.5-0.3,0.3-0.1),c(0.3-0.1,0.5-0.3))
[1] FALSE
all.equal(c(0.5-0.3,0.3-0.1),c(0.3-0.1,0.5-0.3))
[1] TRUE

Functions on vectors

Lots of functions take vectors as arguments:

  • mean(), median(), sd(), var(), max(), min(), length(), sum(): return single numbers
  • sort() returns a new vector
  • hist() takes a vector of numbers and produces a histogram, a highly structured object, with the side-effect of making a plot
  • Similarly ecdf() produces a cumulative-density-function object
  • summary() gives a five-number summary of numerical vectors
  • any() and all() are useful on Boolean vectors

Addressing vectors

Vector of indices:

x[2];x[4]
[1] 8
[1] 45
x[c(2,4)]
[1]  8 45

Vector of negative indices

x[c(-1,-3)]
[1]  8 45

(why that, and not 8 10?)

Boolean vector:

x[x>9]
[1] 10 45
y[x>9]
[1] -10 -45

which() turns a Boolean vector in vector of TRUE indices:

places <- which(x > 9)
places
[1] 3 4
y[places]
[1] -10 -45

Named components

You can give names to elements or components of vectors

names(x) <- c("v1","v2","v3","fred")
names(x)
[1] "v1"   "v2"   "v3"   "fred"
x[c("fred","v1")]
fred   v1 
  45    7 

note the labels in what R prints; not actually part of the value

names(x) is just another vector (of characters):

names(y) <- names(x)
sort(names(x))
[1] "fred" "v1"   "v2"   "v3"  
which(names(x)=="fred")
[1] 4

Take-Aways

  • We write programs by composing functions to manipulate data
  • The basic data types let us represent Booleans, numbers, and characters
  • Data structure let us group related values together
  • Vectors let us group values of the same type
  • Use variables rather a profusion of magic constants
  • Name components of structures to make data more meaningful