Activity Quality Prediction using sensor data

Outline

  1. Executive Summary
  2. Code Walkthrough
    1. Data preparation
    2. Feature selection and Model training
    3. Results

Executive Summary

Background:
The aim of this project was to classify the way in which a particular activity was performed in order to provide feedback to potential user regarding how well they performed this activity. This is the logical step beyond using the data to identify what the activity is. The subtle challenge is that it is harder to classify how well an activity is being perfomed than to distinguish one type of activity from another.
Methodology:
For this analysis I split the training data set into 3 folds in order to generate training, validation and testing data subsets on which to train and validate my chosen machine learning (ML) approach. The ML method of choice was the Random forest, using the caret package implemention in R. I performed a 3 fold cross-validation to estimate the out of sample error by using each fold to train the model, then estimating prediction accuracy on the remaining two folds. Since I did not use the validation data sets to optimise the model, both “testing” data folds were suitable for use in the out-of-sample error estimate.
Results:
The methodology was used to fit a random model Random forest that had an expected out of sample accuracy of 96.77 % ± 0.21 %, implying an expected out of sample error of only 3.23 ± 0.21 %.

Code Walkthrough

The training and testing of the model is described in more detail in the sections below. Note, for better code clarity, I separated the code into the main instructions (in the project_code.R file) and helper functions designed to streamline the analysis workflow (helper_functions.R).

Data preparation

The data used for this analysis is from the Weight Lifting Exercises Dataset (http://groupware.les.inf.puc-rio.br/har) and was split data into 3 even partitions (enough data to fit model, not enough to overfit), trained model on one fold, then estimated the model on the other 2 fold. Importantly, this data splitting process randomly partitioned the data, meaning that the model would avoid the risk of overfitting to the the experimental test subjects. In short, the final model would not be biased to the training habits of the test subjects included in the training set.

######## Analysis setup ######################################################
set.seed(294); require(caret); source("./helper_functions.R")
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
######## Data Preparation ####################################################
#load data
trainData = read.csv("../data//pml-training.csv", stringsAsFactors=F)    
trainData$classe <- factor(trainData$classe)

# k-fold cross validation: randomly split trainData into 3 homogenous subsets
dataFoldSegments = createFolds(y = trainData$classe, k=3)

# subset data according to folds.
# This is a bit tedious, but conforms to the input 
targetFolds = list()
targetFolds$train = trainData[dataFoldSegments$Fold1,]
targetFolds$validation = trainData[dataFoldSegments$Fold2,]
targetFolds$test = trainData[dataFoldSegments$Fold3,]

Feature selection and Model training

The random forest method was chosen as it performed best in preliminary testing, and it was also the approach used by the group from which this data was obtained (http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201). The model was trained on each data subset in order to get an estimate of the out-of-sample error that one could expect when this model were to be used in the future on new input data.

The raw data from the accelerometer, gyroscope and magnetometer were used as the input feature set with which to train the model. This was because the information embedded within summary data features, such as accelaration or pitch, was already contained within the raw data from which those summary features was derived.

######## feature selection ###################################################
# a) get column indices for desired features
rawSensorFeatures = grep("^accel|^gyros|^magn", names(trainData), value=T)
targetFeatures = rawSensorFeatures
# b) remove unwanted features from the data set
targetFolds <- subsetFeatures(folds = targetFolds, features = targetFeatures)

######## Model Cross validation - Training ###################################
# Cycle 1: train = train, validation = validation, test = test
rfModel1 = train(classe ~ ., data=targetFolds$train, method="rf")
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
# Cycle 2: train = validation, validation = test, test = train
rfModel2 = train(classe ~ ., data=targetFolds$validation, method="rf")
# Cycle 3: train = test, validation = train, test = validation
rfModel3 = train(classe ~ ., data=targetFolds$test, method="rf")

Analysis results

Each model was tested on the data used to train the model to check the usefulness of the approach. A poor performance on the input data would indicate that I had trained the wrong type of model on the data. Then each model was trained on the validation and test data folds in order to estimate out-of-sample accuracy. The average estimates were simply an aggregate of all of the out-of-sample tests from each round of training.

######## Model Cross validation - Testing ####################################
require(caret)
# Cycle 1: model testing
rfTest1a = testModel(targetFolds$train, rfModel1)
rfTest1b = testModel(targetFolds$validation, rfModel1)
rfTest1c = testModel(targetFolds$test, rfModel1)
# Cycle 2: model testing
rfTest2a = testModel(targetFolds$validation, rfModel2)
rfTest2b = testModel(targetFolds$test, rfModel2)
rfTest2c = testModel(targetFolds$train, rfModel2)
# Cycle 3: model testing
rfTest3a = testModel(targetFolds$test, rfModel3)
rfTest3b = testModel(targetFolds$train, rfModel3)
rfTest3c = testModel(targetFolds$validation, rfModel3)

######## Summary of cross validation results #################################
accuracyTest = data.frame(rbind(
  cycle1 = c( rfTest1a$overall[1], rfTest1b$overall[1], rfTest1c$overall[1] ),
  cycle2 = c( rfTest2a$overall[1], rfTest2b$overall[1], rfTest2c$overall[1] ),
  cycle3 = c( rfTest3a$overall[1], rfTest3b$overall[1], rfTest3c$overall[1] )
  ), stringsAsFactors=F)
names(accuracyTest) <- c("train","validation", "test")

accVals = c(accuracyTest$validation, accuracyTest$test)
accMean = round(mean(accVals) * 100, 2)
accSd = round(sd(accVals) * 100, 2)

errorVals = 1 - accVals
errorMean = round(mean(errorVals) *100, 2)
errorSd = round(sd(errorVals) * 100, 2)

accuracyTest  
##        train validation      test
## cycle1     1  0.9651483 0.9691179
## cycle2     1  0.9691179 0.9660499
## cycle3     1  0.9704848 0.9665240

From this analysis, the estimated out-of sample accuracy measured from the 6 “non-training” data folds tested in each step of the cross-validation cycle was 96.77 % ± 0.21 %. This means that the estimated out of sample error was only 3.23 % ± 0.21 %. While it is important to note that this out of sample error estimate is optimistic. The model that I have constructed in this study is quite a good predictor of the quality of the excercise of interested from raw sensor data.