Gradient Boosting Machine (GBM) Tutorial

A fundamental understanding of GBM models is assumed, please seek resources to improve understanding and use this tutorial as a computational example.

Read data

More information can be found in the Data Import and Export tutorial notebook.

In [1]:
train = read.csv("train.csv")
test = read.csv("test.csv")
head(train)
Out[1]:
RowIDCalendarYearModelYearMakeModelCat1Cat2Cat3Cat4Cat5Var5Var6Var7Var8NVCatNVVar1NVVar2NVVar3NVVar4Response
141807920052004AUAU.14BCAAA0.02168622-0.6852558-0.5912954-0.2584976F-0.2315299-0.26611684.209404-0.25141890
223262520062003RR.30BCBAA-0.3529838-0.4232408-0.62815820.05436843O-0.2315299-0.2661168-0.2723372-0.25141890
337902920062006AUAU.14BAAAA0.06926337-0.6852558-0.5912954-0.163872M-0.2315299-0.2661168-0.2723372-0.25141891
418145820072000BUBU.38FCACA0.0871048-0.1717531-0.97221140.2064257O-0.2315299-0.2661168-0.2723372-0.25141890
519243420051999BUBU.38FAACA0.0871048-0.2723481-0.97221140.290987M-0.2315299-0.2661168-0.2723372-0.25141891
644332120072005AUAU.11BCBAA0.1406291-0.1635651-0.54214490.5499219O-0.2315299-0.2661168-0.2723372-0.25141890

Set Random Seed

Set a random seed to allow reproducability of these results.

In [2]:
randomSeed = 1337
set.seed(randomSeed)

Load Package

R has a wide variety of open source packages for machine learning. This tutorial will use the standard 'gbm' package to make a GBM.

In [3]:
library(gbm)
Loading required package: survival
Loading required package: lattice
Loading required package: splines
Loading required package: parallel
Loaded gbm 2.1.1

Error Metric

R-Bloggers has a nice article on explaining the Log Loss function which can be read here.

In [4]:
LogLossBinary = function(actual, predicted, eps = 1e-15) {  
    predicted = pmin(pmax(predicted, eps), 1-eps)  
    - (sum(actual * log(predicted) + (1 - actual) * log(1 - predicted))) / length(actual)
}

Fit a simple GBM

The simple GBM below is fit using only 4 predictors. View the GBM package's references for more information on choosing appropriate hyperparameters and more sophisticated methods. These examples do not necessarily reflect best practices and should be viewed for illustration only. All analyses will also be done without any preprocessing of data.

In [5]:
gbmModel = gbm(formula = Response ~ Var1 + Var2 + Cat5 + Cat6,
               distribution = "bernoulli",
               data = train,
               n.trees = 2500,
               shrinkage = .01,
               n.minobsinnode = 20)

Find what the model predicts on the training data set.

In [6]:
gbmTrainPredictions = predict(object = gbmModel,
                              newdata = train,
                              n.trees = 1500,
                              type = "response")
In [7]:
head(data.frame("Actual" = train$Response, 
                "PredictedProbability" = gbmTrainPredictions))
Out[7]:
ActualPredictedProbability
100.2827823
200.2780227
310.282297
400.2780227
510.2780227
600.238471

Log Loss calculated on the training set. We can calculate this value because we have the response for our training set.

In [8]:
LogLossBinary(train$Response, gbmTrainPredictions)
Out[8]:
0.590021043705372

Fit a GBM and test with a holdout subset

Create a subset of the training data to hold out and test the model's predictive power.

In [9]:
dataSubsetProportion = .2
randomRows = sample(1:nrow(train), floor(nrow(train) * dataSubsetProportion))
trainingHoldoutSet = train[randomRows, ]
trainingNonHoldoutSet = train[!(1:nrow(train) %in% randomRows), ]

Drop ID columns in the data sets used for modeling. Also drop the Model covariate for this illustration.

In [10]:
trainingHoldoutSet$RowID = NULL
trainingNonHoldoutSet$RowID = NULL
trainingHoldoutSet$Model = NULL
trainingNonHoldoutSet$Model = NULL

Fit a GBM on 5 predictors using the training data set that is not being held out.

In [11]:
gbmForTesting = gbm(formula = Response ~ Var1 + Var2 + Var3 + NVCat + NVVar1,
                    distribution = "bernoulli",
                    data = trainingNonHoldoutSet,
                    n.trees = 1500,
                    shrinkage = .01,
                    n.minobsinnode = 50)
In [12]:
summary(gbmForTesting, plot = FALSE)
Out[12]:
varrel.inf
NVCatNVCat67.72105
Var2Var222.97079
Var1Var14.897743
Var3Var33.416278
NVVar1NVVar10.9941413

Make predictions using the model on the holdout and non-holdout data sets.

In [13]:
gbmHoldoutPredictions = predict(object = gbmForTesting,
                                newdata = trainingHoldoutSet,
                                n.trees = 100,
                                type = "response")

gbmNonHoldoutPredictions = predict(object = gbmForTesting,
                                   newdata = trainingNonHoldoutSet,
                                   n.trees = 100,
                                   type = "response")

Calculate Log Loss on the holdout and non-holdout sets.

In [14]:
print(paste(LogLossBinary(train$Response[randomRows], gbmHoldoutPredictions), 
            "Holdout Log Loss"))
print(paste(LogLossBinary(train$Response[!(1:nrow(train) %in% randomRows)], gbmNonHoldoutPredictions), 
            "Non-Holdout Log Loss"))
[1] "0.592384640817789 Holdout Log Loss"
[1] "0.586523687486559 Non-Holdout Log Loss"

Fit a GBM with cross validation

Let's fit a GBM with 5 fold cross validation and use the cross validation procedure to find the best number of trees for prediction. A random seed has to be set within the gbm function to ensure reproducibility for the distributed cross validation folds.

In [15]:
gbmWithCrossValidation = gbm(formula = Response ~ .,
                             distribution = "bernoulli",
                             data = trainingNonHoldoutSet,
                             n.trees = 2000,
                             shrinkage = .1,
                             n.minobsinnode = 200, 
                             cv.folds = 5,
                             n.cores = 1)

Find the best tree for prediction. The black line is the training bernoulli deviance and the green line is the testing bernoulli deviance. The tree selected for prediction, indicated by the vertical blue line, is the tree that minimizes the testing error on the cross-validation folds.

In [16]:
bestTreeForPrediction = gbm.perf(gbmWithCrossValidation)
Using cv method...

Make predictions on the holdout and non-holdout data sets, as before, and view their Log Loss values.

In [17]:
gbmHoldoutPredictions = predict(object = gbmWithCrossValidation,
                                newdata = trainingHoldoutSet,
                                n.trees = bestTreeForPrediction,
                                type = "response")

gbmNonHoldoutPredictions = predict(object = gbmWithCrossValidation,
                                   newdata = trainingNonHoldoutSet,
                                   n.trees = bestTreeForPrediction,
                                   type = "response")

Calculate Log Loss for holdout and non-holdout sets.

In [18]:
print(paste(LogLossBinary(train$Response[randomRows], gbmHoldoutPredictions), 
            "Holdout Log Loss"))
print(paste(LogLossBinary(train$Response[!(1:nrow(train) %in% randomRows)], gbmNonHoldoutPredictions), 
            "Non-Holdout Log Loss"))
[1] "0.584214215280609 Holdout Log Loss"
[1] "0.578276452403044 Non-Holdout Log Loss"

Baseline GBM Benchmark

Fit one more GBM model on 6 predictors and create predictions on the testing data set. These predictions are scored as the 'GBM Benchmark' on the competition leaderboard.

In [19]:
gbmForTesting = gbm(formula = Response ~ Var1 + Var2 + Var3 + NVCat + NVVar1 + NVVar2,
                    distribution = "bernoulli",
                    data = trainingNonHoldoutSet,
                    n.trees = 1500,
                    shrinkage = .1,
                    n.minobsinnode = 50)

gbmTestPredictions = predict(object = gbmWithCrossValidation,
                             newdata = test,
                             n.trees = 1000,
                             type = "response")
In [20]:
outputDataSet = data.frame("RowID" = test$RowID,
                           "ProbabilityOfResponse" = gbmTestPredictions)
In [21]:
write.csv(outputDataSet, "gbmBenchmarkSubmission.csv", row.names = FALSE)