Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
The data is taken from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal of this project is to predict the manner in which the participants did the exercise.
The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har
Necessary packages are loaded
library(caret)
library(rattle)
## Download and unzip the data, provided it doesn't already exist
if(!file.exists("pml-training.csv")){
fileUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl,destfile="pml-training.csv",method="curl")
}
if(!file.exists("pml-testing.csv")){
fileUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl,destfile="pml-testing.csv",method="curl")
}
## Read in the data
if(!"trainIn" %in% ls()){
trainIn <- read.csv("pml-training.csv")
}
if(!"testIn" %in% ls()){
testIn <- read.csv("pml-testing.csv")
}
dim(trainIn)
## [1] 19622 160
dim(testIn)
## [1] 20 160
str(trainIn)
As can be seen, there are 19622 observations from 160 variables in the training data set; while there are 20 observations in the testing set.
Both the training and test data sets need to be trimmed down - the first seven variables do not affect the ‘classe’ variable and therefore are removed; along with variables with a majority of NA values, and variables that are near-zero-variance.
train <- trainIn[,-c(1:7)]
trainNZV <- nearZeroVar(train)
train <- train[,-trainNZV]
train <- train[, colSums(is.na(train)) == 0]
test <- testIn[,-c(1:7)]
testNZV <- nearZeroVar(test)
test <- test[,-testNZV]
test <- test[, colSums(is.na(test)) == 0]
dim(train)
## [1] 19622 53
dim(test)
## [1] 20 53
## Check to see if data contains missing values
anyNA(train)
## [1] FALSE
The data sets have now been reduced to 53 variables.
Two models will be analysed: Decision Tree and Random Forest.
The data is partitioned to create a 60% training set and a 40% test set.
## Data slicing
set.seed(5)
inTrain <- createDataPartition(train$classe, p=0.6, list=FALSE)
training <- train[inTrain,]
testing <- train[-inTrain,]
Cross validation is used as the resampling method and this is set using the ‘trainControl’ function. The number of resampling iterations will be set at 3.
trctrl <- trainControl(method="cv", number=3)
## Training Decision Tree classifier denoted by 'rpart'
modelTree <- train(classe ~ ., method="rpart", trControl=trctrl, data=training)
The dendogram can be seen in Appendix 1.
The model is then validated by running the testing data on it.
# display confusion matrix and model accuracy
trainPrTr <- predict(modelTree, testing)
cmTree <- confusionMatrix(testing$classe, trainPrTr)
cmTree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2030 40 159 0 3
## B 655 502 361 0 0
## C 651 50 667 0 0
## D 587 233 466 0 0
## E 205 201 382 0 654
##
## Overall Statistics
##
## Accuracy : 0.4911
## 95% CI : (0.48, 0.5022)
## No Information Rate : 0.5261
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3342
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4918 0.48928 0.32776 NA 0.99543
## Specificity 0.9457 0.85103 0.87937 0.8361 0.89039
## Pos Pred Value 0.9095 0.33070 0.48757 NA 0.45354
## Neg Pred Value 0.6263 0.91719 0.78882 NA 0.99953
## Prevalence 0.5261 0.13077 0.25937 0.0000 0.08374
## Detection Rate 0.2587 0.06398 0.08501 0.0000 0.08335
## Detection Prevalence 0.2845 0.19347 0.17436 0.1639 0.18379
## Balanced Accuracy 0.7187 0.67015 0.60357 NA 0.94291
# Calculation of accuracy and out of sample error
accTree <- sum(trainPrTr == testing$classe)/length(trainPrTr)
ooseTree <- 1 - accTree
## Training Random Forest denoted by 'rf'
modelRF <- train(classe ~ ., method="rf", trControl=trctrl, data = training)
The model is then validated by running the testing data on it.
# display confusion matrix and model accuracy
trainPrRF <- predict(modelRF, testing)
cmRF <- confusionMatrix(testing$classe, trainPrRF)
cmRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 0 0 0 0
## B 13 1497 8 0 0
## C 0 15 1343 10 0
## D 0 3 13 1269 1
## E 0 1 2 12 1427
##
## Overall Statistics
##
## Accuracy : 0.9901
## 95% CI : (0.9876, 0.9921)
## No Information Rate : 0.2861
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9874
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9942 0.9875 0.9832 0.9830 0.9993
## Specificity 1.0000 0.9967 0.9961 0.9974 0.9977
## Pos Pred Value 1.0000 0.9862 0.9817 0.9868 0.9896
## Neg Pred Value 0.9977 0.9970 0.9964 0.9966 0.9998
## Prevalence 0.2861 0.1932 0.1741 0.1645 0.1820
## Detection Rate 0.2845 0.1908 0.1712 0.1617 0.1819
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9971 0.9921 0.9897 0.9902 0.9985
# Calculation of accuracy and out of sample error
accRF <- sum(trainPrRF == testing$classe)/length(trainPrRF)
ooseRF <- 1 - accRF
A plot showing the accuracy of the model by the number of predictors used can be seen in Appendix 2. This shows that the number of predictors that gives the highest accuracy is 27. In addition, using the varImp function (Appendix 3), the 20 most important variables are shown. The variable roll_belt has the highest importance, meaning that its impact on the outcome values is significant.
print(data.frame(
"Model" = c('Classification Tree', 'Random Forest'),
"Accuracy" = c(accTree, accRF),
"Out of Sample Error" = c(ooseTree, ooseRF)), digits = 3)
## Model Accuracy Out.of.Sample.Error
## 1 Classification Tree 0.491 0.50892
## 2 Random Forest 0.990 0.00994
The table above shows that the Random Forest model has the highest accuracy of the two models, and by extension, the lowest out of sample error, therefore it will be used on the test data.
# Prediction of new values
final <- predict(modelRF, newdata=test)
final
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
fancyRpartPlot(modelTree$finalModel)
Figure 1: Decision Tree
plot(modelRF)
Figure 2: Accuracy of Random Forest Model versus Randomly selected Predictors
varImp(modelRF)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## pitch_forearm 59.76
## yaw_belt 53.51
## magnet_dumbbell_y 45.30
## pitch_belt 43.46
## roll_forearm 43.26
## magnet_dumbbell_z 43.06
## accel_dumbbell_y 21.96
## accel_forearm_x 17.66
## roll_dumbbell 16.92
## magnet_dumbbell_x 16.84
## magnet_belt_z 15.76
## accel_dumbbell_z 14.61
## magnet_forearm_z 13.83
## total_accel_dumbbell 13.53
## accel_belt_z 13.08
## gyros_belt_z 12.44
## magnet_belt_y 11.96
## yaw_arm 10.71
## magnet_belt_x 10.64