Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

The data is taken from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal of this project is to predict the manner in which the participants did the exercise.

The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har

Data Processing

Pre-processing/set-up

Necessary packages are loaded

library(caret)
library(rattle)

Data download and read

## Download and unzip the data, provided it doesn't already exist
if(!file.exists("pml-training.csv")){
  fileUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
  download.file(fileUrl,destfile="pml-training.csv",method="curl")
}
if(!file.exists("pml-testing.csv")){
  fileUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
  download.file(fileUrl,destfile="pml-testing.csv",method="curl")
}
## Read in the data
if(!"trainIn" %in% ls()){
trainIn <- read.csv("pml-training.csv")
}
if(!"testIn" %in% ls()){
testIn <- read.csv("pml-testing.csv")
}
dim(trainIn)
## [1] 19622   160
dim(testIn)
## [1]  20 160
str(trainIn)

As can be seen, there are 19622 observations from 160 variables in the training data set; while there are 20 observations in the testing set.

Cleaning data

Both the training and test data sets need to be trimmed down - the first seven variables do not affect the ‘classe’ variable and therefore are removed; along with variables with a majority of NA values, and variables that are near-zero-variance.

train <- trainIn[,-c(1:7)]
trainNZV <- nearZeroVar(train)
train <- train[,-trainNZV]
train <- train[, colSums(is.na(train)) == 0] 

test <- testIn[,-c(1:7)]
testNZV <- nearZeroVar(test)
test <- test[,-testNZV]
test <- test[, colSums(is.na(test)) == 0] 

dim(train)
## [1] 19622    53
dim(test)
## [1] 20 53
## Check to see if data contains missing values
anyNA(train)
## [1] FALSE

The data sets have now been reduced to 53 variables.

Model

Two models will be analysed: Decision Tree and Random Forest.

The data is partitioned to create a 60% training set and a 40% test set.

## Data slicing
set.seed(5)
inTrain <- createDataPartition(train$classe, p=0.6, list=FALSE)
training <- train[inTrain,]
testing <- train[-inTrain,]

Cross validation is used as the resampling method and this is set using the ‘trainControl’ function. The number of resampling iterations will be set at 3.

trctrl <- trainControl(method="cv", number=3)

Prediction with classification trees

## Training Decision Tree classifier denoted by 'rpart'
modelTree <- train(classe ~ ., method="rpart", trControl=trctrl, data=training)

The dendogram can be seen in Appendix 1.

The model is then validated by running the testing data on it.

# display confusion matrix and model accuracy
trainPrTr <- predict(modelTree, testing)
cmTree <- confusionMatrix(testing$classe, trainPrTr)
cmTree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2030   40  159    0    3
##          B  655  502  361    0    0
##          C  651   50  667    0    0
##          D  587  233  466    0    0
##          E  205  201  382    0  654
## 
## Overall Statistics
##                                         
##                Accuracy : 0.4911        
##                  95% CI : (0.48, 0.5022)
##     No Information Rate : 0.5261        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.3342        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4918  0.48928  0.32776       NA  0.99543
## Specificity            0.9457  0.85103  0.87937   0.8361  0.89039
## Pos Pred Value         0.9095  0.33070  0.48757       NA  0.45354
## Neg Pred Value         0.6263  0.91719  0.78882       NA  0.99953
## Prevalence             0.5261  0.13077  0.25937   0.0000  0.08374
## Detection Rate         0.2587  0.06398  0.08501   0.0000  0.08335
## Detection Prevalence   0.2845  0.19347  0.17436   0.1639  0.18379
## Balanced Accuracy      0.7187  0.67015  0.60357       NA  0.94291
# Calculation of accuracy and out of sample error
accTree <- sum(trainPrTr == testing$classe)/length(trainPrTr)
ooseTree <- 1 - accTree

Prediction with Random Forest

## Training Random Forest denoted by 'rf'
modelRF <- train(classe ~ ., method="rf", trControl=trctrl, data = training)

The model is then validated by running the testing data on it.

# display confusion matrix and model accuracy
trainPrRF <- predict(modelRF, testing)
cmRF <- confusionMatrix(testing$classe, trainPrRF)
cmRF
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    0    0    0    0
##          B   13 1497    8    0    0
##          C    0   15 1343   10    0
##          D    0    3   13 1269    1
##          E    0    1    2   12 1427
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9901          
##                  95% CI : (0.9876, 0.9921)
##     No Information Rate : 0.2861          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9874          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9942   0.9875   0.9832   0.9830   0.9993
## Specificity            1.0000   0.9967   0.9961   0.9974   0.9977
## Pos Pred Value         1.0000   0.9862   0.9817   0.9868   0.9896
## Neg Pred Value         0.9977   0.9970   0.9964   0.9966   0.9998
## Prevalence             0.2861   0.1932   0.1741   0.1645   0.1820
## Detection Rate         0.2845   0.1908   0.1712   0.1617   0.1819
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9971   0.9921   0.9897   0.9902   0.9985
# Calculation of accuracy and out of sample error
accRF <- sum(trainPrRF == testing$classe)/length(trainPrRF)
ooseRF <- 1 - accRF

A plot showing the accuracy of the model by the number of predictors used can be seen in Appendix 2. This shows that the number of predictors that gives the highest accuracy is 27. In addition, using the varImp function (Appendix 3), the 20 most important variables are shown. The variable roll_belt has the highest importance, meaning that its impact on the outcome values is significant.

Results

print(data.frame(
    "Model" = c('Classification Tree', 'Random Forest'),
    "Accuracy" = c(accTree, accRF),
    "Out of Sample Error" = c(ooseTree, ooseRF)), digits = 3)
##                 Model Accuracy Out.of.Sample.Error
## 1 Classification Tree    0.491             0.50892
## 2       Random Forest    0.990             0.00994

The table above shows that the Random Forest model has the highest accuracy of the two models, and by extension, the lowest out of sample error, therefore it will be used on the test data.

# Prediction of new values
final <- predict(modelRF, newdata=test)
final
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Appendix

Appendix 1

fancyRpartPlot(modelTree$finalModel)
Figure 1: Decision Tree

Figure 1: Decision Tree

Appendix 2

plot(modelRF)
Figure 2: Accuracy of Random Forest Model versus Randomly selected Predictors

Figure 2: Accuracy of Random Forest Model versus Randomly selected Predictors

Appendix 3

varImp(modelRF)
## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt             100.00
## pitch_forearm          59.76
## yaw_belt               53.51
## magnet_dumbbell_y      45.30
## pitch_belt             43.46
## roll_forearm           43.26
## magnet_dumbbell_z      43.06
## accel_dumbbell_y       21.96
## accel_forearm_x        17.66
## roll_dumbbell          16.92
## magnet_dumbbell_x      16.84
## magnet_belt_z          15.76
## accel_dumbbell_z       14.61
## magnet_forearm_z       13.83
## total_accel_dumbbell   13.53
## accel_belt_z           13.08
## gyros_belt_z           12.44
## magnet_belt_y          11.96
## yaw_arm                10.71
## magnet_belt_x          10.64