More Data Structures

BUAN 327 Yegin Genc

Agenda

• Matrices
• Arrays
• Lists
• Dataframes
• Structures of structures

Vector structures, starting with arrays

Many data structures in R are made by adding bells and whistles to vectors, so “vector structures”

A matrix in R is a collections of homogeneous elements arranged in 2 dimensions

matrix(1:15, nrow = 4)
[,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12    1

Matrices

• A matrix is a vector with a dim attribute, i.e. an integer vector giving the number or rows and columns
• To create matrices us matrix()
• The functions dim(), nrow() and ncol() provide the attributes of the matrix
• Rows and columns can have names, dimnames(), rownames(), colnames()

Running example: resource allocation ("mathematical programming")

Factory makes cars and trucks, using labor and steel

• a car takes 40 hours of labor and 1 ton of steel
• a truck takes 60 hours and 3 tons of steel
• resources: 1600 hours of labor and 70 tons of steel each week
Labor Steel
Cars 40 1
Trucks 60 3
——– ——- ——-
Resources 1600 70

Matrices

factory <- matrix(c(40,1,60,3),nrow=2)
is.array(factory)
[1] TRUE
is.matrix(factory)
[1] TRUE

could also specify ncol, and/or byrow=TRUE to fill by rows.

Element-wise operations with the usual arithmetic and comparison operators (e.g., factory/3)

Compare whole matrices with identical() or all.equal()

Matrix multiplication

Gets a special operator

six.sevens <- matrix(rep(7,6),ncol=3)
six.sevens
[,1] [,2] [,3]
[1,]    7    7    7
[2,]    7    7    7
factory %*% six.sevens # [2x2] * [2x3]
[,1] [,2] [,3]
[1,]  700  700  700
[2,]   28   28   28

What happens if you try six.sevens %*% factory?

Transpose:

t(factory)
[,1] [,2]
[1,]   40    1
[2,]   60    3

Determinant:

det(factory)
[1] 60

The diagonal

The diag() function can extract the diagonal entries of a matrix:

diag(factory)
[1] 40  3

It can also change the diagonal:

diag(factory) <- c(35,4)
factory
[,1] [,2]
[1,]   35   60
[2,]    1    4

Re-set it for later:

diag(factory) <- c(40,3)

diag(c(3,4))
[,1] [,2]
[1,]    3    0
[2,]    0    4
diag(2)
[,1] [,2]
[1,]    1    0
[2,]    0    1

Names in matrices

We can name either rows or columns or both, with rownames() and colnames()

These are just character vectors, and we use the same function to get and to set their values

Names help us understand what we're working with

Names can be used to coordinate different objects

rownames(factory) <- c("labor","steel")
colnames(factory) <- c("cars","trucks")
factory
cars trucks
labor   40     60
steel    1      3
available <- c(1600,70)
names(available) <- c("labor","steel")

Doing the same thing to each row or column

Take the mean: rowMeans(), colMeans(): input is matrix, output is vector. Also rowSums(), etc.

summary(): vector-style summary of column

colMeans(factory)
cars trucks
20.5   31.5
summary(factory)
cars           trucks
Min.   : 1.00   Min.   : 3.00
1st Qu.:10.75   1st Qu.:17.25
Median :20.50   Median :31.50
Mean   :20.50   Mean   :31.50
3rd Qu.:30.25   3rd Qu.:45.75
Max.   :40.00   Max.   :60.00

Extra*

apply(), takes 3 arguments: the array or matrix, then 1 for rows and 2 for columns, then name of the function to apply to each

rowMeans(factory)
labor steel
50     2
apply(factory,1,mean)
labor steel
50     2

What would apply(factory,1,sd) do?

Arrays

arrays are basically matrices in higher dimensions

x <- c(7, 8, 10, 45 , 70, 80 , 100, 250)
x.arr <- array(x,dim=c(2,2,2))
x.arr
, , 1

[,1] [,2]
[1,]    7   10
[2,]    8   45

, , 2

[,1] [,2]
[1,]   70  100
[2,]   80  250

dim says how many rows and columns; filled by columns

Can have $$3, 4, \ldots n$$ dimensional arrays; dim is a length-$$n$$ vector

Some properties of the array:

dim(x.arr)
[1] 2 2 2
is.vector(x.arr)
[1] FALSE
is.array(x.arr)
[1] TRUE
typeof(x.arr)
[1] "double"
str(x.arr)
num [1:2, 1:2, 1:2] 7 8 10 45 70 80 100 250
attributes(x.arr)
$dim [1] 2 2 2 typeof() returns the type of the elements str() gives the structure: here, a numeric array, with two dimensions, both indexed 1–2, and then the actual numbers Exercise: try all these with x Accessing and operating on arrays Can access a 2-D array either by pairs of indices or by the underlying vector: x <- c(7, 8, 10, 45) x.arr <- array(x,dim=c(2,2)) x.arr [,1] [,2] [1,] 7 10 [2,] 8 45 x.arr[1,2] [1] 10 x.arr[3] [1] 10 Omitting an index means “all of it”: x.arr[c(1:2),2] [1] 10 45 x.arr[,2] [1] 10 45 Functions on arrays Using a vector-style function on a vector structure will go down to the underlying vector, unless the function is set up to handle arrays specially: which(x.arr > 9) [1] 3 4 Many functions do preserve array structure: y <- -x y.arr <- array(y,dim=c(2,2)) y.arr + x.arr [,1] [,2] [1,] 0 0 [2,] 0 0 Others specifically act on each row or column of the array separately: rowSums(x.arr) [1] 17 53 We will see a lot more of this idea Lists Sequence of values, not necessarily all of the same type my.distribution <- list("exponential",7,FALSE) my.distribution [[1]] [1] "exponential" [[2]] [1] 7 [[3]] [1] FALSE Most of what you can do with vectors you can also do with lists Expanding and contracting lists Add to lists with c() (also works with vectors): my.distribution <- c(my.distribution,7) my.distribution [[1]] [1] "exponential" [[2]] [1] 7 [[3]] [1] FALSE [[4]] [1] 7 Chop off the end of a list by setting the length to something smaller (also works with vectors): length(my.distribution) [1] 4 length(my.distribution) <- 3 my.distribution [[1]] [1] "exponential" [[2]] [1] 7 [[3]] [1] FALSE Extra - Accessing pieces of lists Can use [ ] as with vectors or use [[ ]], but only with a single index [[ ]] drops names and structures, [ ] does not is.character(my.distribution) [1] FALSE is.character(my.distribution[[1]]) [1] TRUE my.distribution[[2]]^2 [1] 49 What happens if you try my.distribution[2]^2? What happens if you try [[ ]] on a vector? Dataframes Dataframe = the classic data table, $$n$$ rows for cases, $$p$$ columns for variables Not just a matrix because columns can have different types Many matrix functions also work for dataframes (rowSums(), summary(), apply()) but no matrix multiplication of dataframes, even if all columns are numeric Dataframes, Encore • 2D tables of data • Each case/unit is a row • Each variable is a column • Variables can be of any type (numbers, text, Booleans, …) • Both rows and columns can get names Creating an example dataframe library(datasets) states <- data.frame(state.x77, abb=state.abb, region=state.region, division=state.division) data.frame() is combining here a pre-existing matrix (state.x77), a vector of characters (state.abb), and two vectors of qualitative categorical variables (factors; state.region, state.division) Column names are preserved or guessed if not explicitly set colnames(states) [1] "Population" "Income" "Illiteracy" "Life.Exp" "Murder" [6] "HS.Grad" "Frost" "Area" "abb" "region" [11] "division" states[1,] Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 abb region division Alabama AL South East South Central Dataframe access • By row and column index states[49,3] [1] 0.7 • By row and column names states["Wisconsin","Illiteracy"] [1] 0.7 Dataframe access (cont'd) • All of a row: states["Wisconsin",] Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Wisconsin 4589 4468 0.7 72.48 3 54.5 149 54464 abb region division Wisconsin WI North Central East North Central Exercise: what class is states["Wisconsin",]? Dataframe access (cont'd.) • All of a column: head(states[,3]) [1] 2.1 1.5 1.8 1.9 1.1 0.7 head(states[,"Illiteracy"]) [1] 2.1 1.5 1.8 1.9 1.1 0.7 head(states$Illiteracy)
[1] 2.1 1.5 1.8 1.9 1.1 0.7

Dataframe access (cont'd.)

• Rows matching a condition:
states[states$division=="New England", "Illiteracy"] [1] 1.1 0.7 1.1 0.7 1.3 0.6 states[states$region=="South", "Illiteracy"]
[1] 2.1 1.9 0.9 1.3 2.0 1.6 2.8 0.9 2.4 1.8 1.1 2.3 1.7 2.2 1.4 1.4

Replacing values

Parts or all of the dataframe can be assigned to:

summary(states$HS.Grad) Min. 1st Qu. Median Mean 3rd Qu. Max. 37.80 48.05 53.25 53.11 59.15 67.30 states$HS.Grad <- states$HS.Grad/100 summary(states$HS.Grad)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.3780  0.4805  0.5325  0.5311  0.5915  0.6730
states$HS.Grad <- 100*states$HS.Grad