Data Preprocessing




IS 665 Data Mining, Data Warehousing and Visualization

Agenda

  • Loading your data
  • Cleaning your data
    • Technically correct data
    • Consistent data
    • Handling missing data
    • Misclassification and outliers
  • Data Manipulation

Why to preprocess ?

Data is usually not in the form we need it to be.

  • It is not in a form that the data mining models will accept
  • Fields that are obsolete or redundant
  • There are missing values
  • Or values are not consistent with policy or common sense

Data preprocessing alone can account for 10 – 60% of all the time and effort for the entire data mining process.

Statistical Analysis Steps

  • Raw Data: Not necessarily all correct, not formatted for systems to accept.
  • Technically Correct Data: Can be read, but doesn't mean values are correct (e.x. negative age values, missing data)
  • Consistent Data: Data ready for analysis and interpretation

Loading Data

Loading data

  • There are many data sets available in R. They are good for exploring / practicing R functions and for testing your scripts.
# install.packages('datasets')
require("datasets")
`?`(datasets)
library(help = "datasets")
data("airquality")
head(airquality, 3)  #View(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
  • But mostly you will load your own data sets…

Loading Data: Read.table and its cousins

  • The followings commands are good for loading fixed-width or csv-like formats, (not for XML-like formats)
read.table 
read.delim      read.delim2
read.csv        read.csv2
read.table      read.fwf
  • All return a data.frame.
  • Always check for the first line (is it column-name or a data point ?)

Loading data ctn.

With read.table you can load data either from a web source or from a local file.

  • Below is an example to get a data from an online source.
data.url = "http://yegingenc.com/lectures/data/SampleStudentGrades.txt"
sample.grades = read.table(data.url, header = T)
head(sample.grades)
  Student_ID Semester Grades
1          1  14_Fall     82
2          2  14_Fall     73
3          3  14_Fall     88
4          4  14_Fall     96
5          5  14_Fall     77
6          6  14_Fall     51

You need to be very careful with the options. E.x. what happens if we don't set ( header=T )

  • Try to create a dummy csv file and see if you can load it to R. Run ?read.csv to get help understand the usage of the function

Other ways to read data

With readLines you can read data in completely raw form. While this requires more work to make the data look like what read.table produces, it gives us full control over how each line is interpreted and transformed into fields in a rectangular data set.

data.url = "http://yegingenc.com/lectures/data/SampleStudentGrades.txt"
sample.grades2 = readLines(data.url)
head(sample.grades2)
[1] "Student_ID\tSemester\tGrades" "1\t14_Fall\t82"              
[3] "2\t14_Fall\t73"               "3\t14_Fall\t88"              
[5] "4\t14_Fall\t96"               "5\t14_Fall\t77"              
  • We will see how to clean this data next

Cleaning Data

1. Technically correct data

2. Consistent Data

Cleaning Data 1: Technically correct data

Converting raw data to technically correct data. That is making sure that

  • variables are correctly populated and each value belongs to a variable
  • data types in the system match to the variable types. (Text variables are stored as text, numeric variables are stored as numbers etc.)

To this end, when working with data in R, make sure:

  • Data is stored as data.frame with suitable column names,
  • Type of each column in the data.frame matches to the value domain of the variable.
    • Numeric data –> numeric or integer
    • Textual data –> character
    • Categorical data –> factor

Cleaning Data 1 ctn

read.table' does some of the data cleaning. For example all numerical values are converted to 'numeric' type.

# Reminder: sample.grades was loaded by read.table
summary(sample.grades)
   Student_ID         Semester      Grades     
 Min.   :  1.0   14_Fall  :48   Min.   :21.00  
 1st Qu.: 32.5   15_Fall  :51   1st Qu.:72.00  
 Median : 64.0   15_Spring:28   Median :81.50  
 Mean   : 64.0                  Mean   :77.98  
 3rd Qu.: 95.5                  3rd Qu.:87.03  
 Max.   :127.0                  Max.   :99.28  

summary() is a good function to get descriptive stats for each variable in the data set. Based on the variable type, it will give different statistics.

  • For numeric variables (e.x. Grades), mean, median and quartiles.
  • For categorical variables (e.x. Semester), it gives frequencies.

Cleaning Data 1 ctn

   Student_ID         Semester      Grades     
 Min.   :  1.0   14_Fall  :48   Min.   :21.00  
 1st Qu.: 32.5   15_Fall  :51   1st Qu.:72.00  
 Median : 64.0   15_Spring:28   Median :81.50  
 Mean   : 64.0                  Mean   :77.98  
 3rd Qu.: 95.5                  3rd Qu.:87.03  
 Max.   :127.0                  Max.   :99.28  

  • The problem is tough that sometimes numeric values are used for other purposes, for example student_id. (i.e. Mean student_id doesn't make sense.)
  • So we change it to factor or make it the row name.
# Convert it to factor
sample.grades$Student_ID <- as.factor(sample.grades$Student_ID)


# Alternatively set it to a row name (and remove it from the
# dataset)
row.names(sample.grades) <- as.character(sample.grades$Student_ID)
sample.grades <- sample.grades[, -1]
sample.grades["1", ]
  Semester Grades
1  14_Fall     82

Cleaning Data 1 ctn

With readLines you can exercise precise control over how each line is interpreted and transformed into fields in a rectangular data set, but requires more work.

data2 = readLines("http://yegingenc.com/lectures/data/SampleStudentGrades.txt")

# Split each line by tab
grades.raw = strsplit(data2, "\t")

# First line is the column names so let's save them
grades.colnames <- grades.raw[[1]]

# Creating vectors for each variable and identify their
# variable type
student.ids <- sapply(grades.raw[-1], function(x) x[1])
semester <- sapply(grades.raw[-1], function(x) x[2])
s.grades <- sapply(grades.raw[-1], function(x) x[3])

# Create a dataframe and identifying the variable types all
# at once
grades.df <- data.frame(semester = as.factor(semester), grade = as.numeric(s.grades), 
    row.names = student.ids)
summary(grades.df)
      semester      grade      
 14_Fall  :48   Min.   :21.00  
 15_Fall  :51   1st Qu.:72.00  
 15_Spring:28   Median :81.50  
                Mean   :77.98  
                3rd Qu.:87.03  
                Max.   :99.28  

Cleaning data 2: Consistent data