More Data Structures

BUAN 327
Yegin Genc

Agenda

Matrices
Arrays
Lists
Dataframes
Structures of structures

Vector structures, starting with arrays

Many data structures in R are made by adding bells and whistles to vectors, so “vector structures”

A matrix in R is a collections of homogeneous elements arranged in 2 dimensions

matrix(1:15, nrow = 4)

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12    1

Matrices

A matrix is a vector with a dim attribute, i.e. an integer vector giving the number or rows and columns
To create matrices us matrix()
The functions dim(), nrow() and ncol() provide the attributes of the matrix
Rows and columns can have names, dimnames(), rownames(), colnames()

Running example: resource allocation ("mathematical programming")

Factory makes cars and trucks, using labor and steel

a car takes 40 hours of labor and 1 ton of steel
a truck takes 60 hours and 3 tons of steel
resources: 1600 hours of labor and 70 tons of steel each week

	Labor	Steel
Cars	40	1
Trucks	60	3
——–	——-	——-
Resources	1600	70

Matrices

factory <- matrix(c(40,1,60,3),nrow=2)
is.array(factory)

[1] TRUE

is.matrix(factory)

[1] TRUE

could also specify ncol, and/or byrow=TRUE to fill by rows.

Element-wise operations with the usual arithmetic and comparison operators (e.g., factory/3)

Compare whole matrices with identical() or all.equal()

Matrix multiplication

Gets a special operator

six.sevens <- matrix(rep(7,6),ncol=3)
six.sevens

     [,1] [,2] [,3]
[1,]    7    7    7
[2,]    7    7    7

factory %*% six.sevens # [2x2] * [2x3]

     [,1] [,2] [,3]
[1,]  700  700  700
[2,]   28   28   28

What happens if you try six.sevens %*% factory?

Matrix operators

Transpose:

t(factory)

     [,1] [,2]
[1,]   40    1
[2,]   60    3

Determinant:

det(factory)

[1] 60

The diagonal

The diag() function can extract the diagonal entries of a matrix:

diag(factory)

[1] 40  3

It can also change the diagonal:

diag(factory) <- c(35,4)
factory

     [,1] [,2]
[1,]   35   60
[2,]    1    4

Re-set it for later:

diag(factory) <- c(40,3)

Creating a diagonal or identity matrix

diag(c(3,4))

     [,1] [,2]
[1,]    3    0
[2,]    0    4

diag(2)

     [,1] [,2]
[1,]    1    0
[2,]    0    1

Names in matrices

We can name either rows or columns or both, with rownames() and colnames()

These are just character vectors, and we use the same function to get and to set their values

Names help us understand what we're working with

Names can be used to coordinate different objects

rownames(factory) <- c("labor","steel")
colnames(factory) <- c("cars","trucks")
factory

      cars trucks
labor   40     60
steel    1      3

available <- c(1600,70)
names(available) <- c("labor","steel")

Doing the same thing to each row or column

Take the mean: rowMeans(), colMeans(): input is matrix, output is vector. Also rowSums(), etc.

summary(): vector-style summary of column

colMeans(factory)

  cars trucks 
  20.5   31.5

summary(factory)

      cars           trucks     
 Min.   : 1.00   Min.   : 3.00  
 1st Qu.:10.75   1st Qu.:17.25  
 Median :20.50   Median :31.50  
 Mean   :20.50   Mean   :31.50  
 3rd Qu.:30.25   3rd Qu.:45.75  
 Max.   :40.00   Max.   :60.00

Extra*

apply(), takes 3 arguments: the array or matrix, then 1 for rows and 2 for columns, then name of the function to apply to each

rowMeans(factory)

labor steel 
   50     2

apply(factory,1,mean)

labor steel 
   50     2

What would apply(factory,1,sd) do?

Arrays

arrays are basically matrices in higher dimensions

x <- c(7, 8, 10, 45 , 70, 80 , 100, 250)
x.arr <- array(x,dim=c(2,2,2))
x.arr

, , 1

     [,1] [,2]
[1,]    7   10
[2,]    8   45

, , 2

     [,1] [,2]
[1,]   70  100
[2,]   80  250

dim says how many rows and columns; filled by columns

Can have \( 3, 4, \ldots n \) dimensional arrays; dim is a length-\( n \) vector

Some properties of the array:

dim(x.arr)

[1] 2 2 2

is.vector(x.arr)

[1] FALSE

is.array(x.arr)

[1] TRUE

typeof(x.arr)

[1] "double"

str(x.arr)

 num [1:2, 1:2, 1:2] 7 8 10 45 70 80 100 250

attributes(x.arr)

$dim
[1] 2 2 2

typeof() returns the type of the elements

str() gives the structure: here, a numeric array, with two dimensions, both indexed 1–2, and then the actual numbers

Exercise: try all these with x

Accessing and operating on arrays

Can access a 2-D array either by pairs of indices or by the underlying vector:

x <- c(7, 8, 10, 45)
x.arr <- array(x,dim=c(2,2))
x.arr

     [,1] [,2]
[1,]    7   10
[2,]    8   45

x.arr[1,2]

[1] 10

x.arr[3]

[1] 10

Omitting an index means “all of it”:

x.arr[c(1:2),2]

[1] 10 45

x.arr[,2]

[1] 10 45

Functions on arrays

Using a vector-style function on a vector structure will go down to the underlying vector, unless the function is set up to handle arrays specially:

which(x.arr > 9)

[1] 3 4

Many functions do preserve array structure:

y <- -x
y.arr <- array(y,dim=c(2,2))
y.arr + x.arr

     [,1] [,2]
[1,]    0    0
[2,]    0    0

Others specifically act on each row or column of the array separately:

rowSums(x.arr)

[1] 17 53

We will see a lot more of this idea

Lists

Sequence of values, not necessarily all of the same type

my.distribution <- list("exponential",7,FALSE)
my.distribution

[[1]]
[1] "exponential"

[[2]]
[1] 7

[[3]]
[1] FALSE

Most of what you can do with vectors you can also do with lists

Expanding and contracting lists

Add to lists with c() (also works with vectors):

my.distribution <- c(my.distribution,7)
my.distribution

[[1]]
[1] "exponential"

[[2]]
[1] 7

[[3]]
[1] FALSE

[[4]]
[1] 7

Chop off the end of a list by setting the length to something smaller (also works with vectors):

length(my.distribution)

[1] 4

length(my.distribution) <- 3
my.distribution

[[1]]
[1] "exponential"

[[2]]
[1] 7

[[3]]
[1] FALSE

Extra - Accessing pieces of lists

Can use [ ] as with vectors
or use [[ ]], but only with a single index
[[ ]] drops names and structures, [ ] does not

is.character(my.distribution)

[1] FALSE

is.character(my.distribution[[1]])

[1] TRUE

my.distribution[[2]]^2

[1] 49

What happens if you try my.distribution[2]^2? What happens if you try [[ ]] on a vector?

Dataframes

Dataframe = the classic data table, \( n \) rows for cases, \( p \) columns for variables

Not just a matrix because columns can have different types

Many matrix functions also work for dataframes (rowSums(), summary(), apply())

but no matrix multiplication of dataframes, even if all columns are numeric

Dataframes, Encore

2D tables of data
Each case/unit is a row
Each variable is a column
Variables can be of any type (numbers, text, Booleans, …)
Both rows and columns can get names

Creating an example dataframe

library(datasets)
states <- data.frame(state.x77, abb=state.abb, region=state.region, division=state.division)

data.frame() is combining here a pre-existing matrix (state.x77), a vector of characters (state.abb), and two vectors of qualitative categorical variables (factors; state.region, state.division)

Column names are preserved or guessed if not explicitly set

colnames(states)

 [1] "Population" "Income"     "Illiteracy" "Life.Exp"   "Murder"    
 [6] "HS.Grad"    "Frost"      "Area"       "abb"        "region"    
[11] "division"

states[1,]

        Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
Alabama       3615   3624        2.1    69.05   15.1    41.3    20 50708
        abb region           division
Alabama  AL  South East South Central

Dataframe access

By row and column index

states[49,3]

[1] 0.7

By row and column names

states["Wisconsin","Illiteracy"]

[1] 0.7

Dataframe access (cont'd)

All of a row:

states["Wisconsin",]

          Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
Wisconsin       4589   4468        0.7    72.48      3    54.5   149 54464
          abb        region           division
Wisconsin  WI North Central East North Central

Exercise: what class is states["Wisconsin",]?

Dataframe access (cont'd.)

All of a column:

head(states[,3])

[1] 2.1 1.5 1.8 1.9 1.1 0.7

head(states[,"Illiteracy"])

[1] 2.1 1.5 1.8 1.9 1.1 0.7

head(states$Illiteracy)

[1] 2.1 1.5 1.8 1.9 1.1 0.7

Dataframe access (cont'd.)

Rows matching a condition:

states[states$division=="New England", "Illiteracy"]

[1] 1.1 0.7 1.1 0.7 1.3 0.6

states[states$region=="South", "Illiteracy"]

 [1] 2.1 1.9 0.9 1.3 2.0 1.6 2.8 0.9 2.4 1.8 1.1 2.3 1.7 2.2 1.4 1.4

Replacing values

Parts or all of the dataframe can be assigned to:

summary(states$HS.Grad)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  37.80   48.05   53.25   53.11   59.15   67.30

states$HS.Grad <- states$HS.Grad/100
summary(states$HS.Grad)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3780  0.4805  0.5325  0.5311  0.5915  0.6730

states$HS.Grad <- 100*states$HS.Grad

Adding rows and columns

We can add rows or columns to an array or data-frame with rbind() and cbind(), but be careful about forced type conversions

Error in rbind(a.data.frame, list(v1 = -3, v2 = -5, logicals = TRUE)) : 
  object 'a.data.frame' not found

More Data Structures

BUAN 327 Yegin Genc

Agenda

Vector structures, starting with arrays

Matrices

Running example: resource allocation ("mathematical programming")

Matrices

Matrix multiplication

Matrix operators

The diagonal

Creating a diagonal or identity matrix

Names in matrices

Doing the same thing to each row or column

Extra*

Arrays

Accessing and operating on arrays

Functions on arrays

Lists

Expanding and contracting lists

Extra - Accessing pieces of lists

Dataframes

Dataframes, Encore

Creating an example dataframe

Dataframe access

Dataframe access (cont'd)

Dataframe access (cont'd.)

Dataframe access (cont'd.)

Replacing values

Adding rows and columns

BUAN 327
Yegin Genc