A fundamental understanding of GBM models is assumed, please seek resources to improve understanding and use this tutorial as a computational example.
More information can be found in the Data Import and Export tutorial notebook.
train = read.csv("train.csv")
test = read.csv("test.csv")
head(train)
Set a random seed to allow reproducability of these results.
randomSeed = 1337
set.seed(randomSeed)
R has a wide variety of open source packages for machine learning. This tutorial will use the standard 'gbm' package to make a GBM.
library(gbm)
LogLossBinary = function(actual, predicted, eps = 1e-15) {
predicted = pmin(pmax(predicted, eps), 1-eps)
- (sum(actual * log(predicted) + (1 - actual) * log(1 - predicted))) / length(actual)
}
The simple GBM below is fit using only 4 predictors. View the GBM package's references for more information on choosing appropriate hyperparameters and more sophisticated methods. These examples do not necessarily reflect best practices and should be viewed for illustration only. All analyses will also be done without any preprocessing of data.
gbmModel = gbm(formula = Response ~ Var1 + Var2 + Cat5 + Cat6,
distribution = "bernoulli",
data = train,
n.trees = 2500,
shrinkage = .01,
n.minobsinnode = 20)
Find what the model predicts on the training data set.
gbmTrainPredictions = predict(object = gbmModel,
newdata = train,
n.trees = 1500,
type = "response")
head(data.frame("Actual" = train$Response,
"PredictedProbability" = gbmTrainPredictions))
Log Loss calculated on the training set. We can calculate this value because we have the response for our training set.
LogLossBinary(train$Response, gbmTrainPredictions)
Create a subset of the training data to hold out and test the model's predictive power.
dataSubsetProportion = .2
randomRows = sample(1:nrow(train), floor(nrow(train) * dataSubsetProportion))
trainingHoldoutSet = train[randomRows, ]
trainingNonHoldoutSet = train[!(1:nrow(train) %in% randomRows), ]
Drop ID columns in the data sets used for modeling. Also drop the Model covariate for this illustration.
trainingHoldoutSet$RowID = NULL
trainingNonHoldoutSet$RowID = NULL
trainingHoldoutSet$Model = NULL
trainingNonHoldoutSet$Model = NULL
Fit a GBM on 5 predictors using the training data set that is not being held out.
gbmForTesting = gbm(formula = Response ~ Var1 + Var2 + Var3 + NVCat + NVVar1,
distribution = "bernoulli",
data = trainingNonHoldoutSet,
n.trees = 1500,
shrinkage = .01,
n.minobsinnode = 50)
summary(gbmForTesting, plot = FALSE)
Make predictions using the model on the holdout and non-holdout data sets.
gbmHoldoutPredictions = predict(object = gbmForTesting,
newdata = trainingHoldoutSet,
n.trees = 100,
type = "response")
gbmNonHoldoutPredictions = predict(object = gbmForTesting,
newdata = trainingNonHoldoutSet,
n.trees = 100,
type = "response")
Calculate Log Loss on the holdout and non-holdout sets.
print(paste(LogLossBinary(train$Response[randomRows], gbmHoldoutPredictions),
"Holdout Log Loss"))
print(paste(LogLossBinary(train$Response[!(1:nrow(train) %in% randomRows)], gbmNonHoldoutPredictions),
"Non-Holdout Log Loss"))
Let's fit a GBM with 5 fold cross validation and use the cross validation procedure to find the best number of trees for prediction. A random seed has to be set within the gbm function to ensure reproducibility for the distributed cross validation folds.
gbmWithCrossValidation = gbm(formula = Response ~ .,
distribution = "bernoulli",
data = trainingNonHoldoutSet,
n.trees = 2000,
shrinkage = .1,
n.minobsinnode = 200,
cv.folds = 5,
n.cores = 1)
Find the best tree for prediction. The black line is the training bernoulli deviance and the green line is the testing bernoulli deviance. The tree selected for prediction, indicated by the vertical blue line, is the tree that minimizes the testing error on the cross-validation folds.
bestTreeForPrediction = gbm.perf(gbmWithCrossValidation)
Make predictions on the holdout and non-holdout data sets, as before, and view their Log Loss values.
gbmHoldoutPredictions = predict(object = gbmWithCrossValidation,
newdata = trainingHoldoutSet,
n.trees = bestTreeForPrediction,
type = "response")
gbmNonHoldoutPredictions = predict(object = gbmWithCrossValidation,
newdata = trainingNonHoldoutSet,
n.trees = bestTreeForPrediction,
type = "response")
Calculate Log Loss for holdout and non-holdout sets.
print(paste(LogLossBinary(train$Response[randomRows], gbmHoldoutPredictions),
"Holdout Log Loss"))
print(paste(LogLossBinary(train$Response[!(1:nrow(train) %in% randomRows)], gbmNonHoldoutPredictions),
"Non-Holdout Log Loss"))
Fit one more GBM model on 6 predictors and create predictions on the testing data set. These predictions are scored as the 'GBM Benchmark' on the competition leaderboard.
gbmForTesting = gbm(formula = Response ~ Var1 + Var2 + Var3 + NVCat + NVVar1 + NVVar2,
distribution = "bernoulli",
data = trainingNonHoldoutSet,
n.trees = 1500,
shrinkage = .1,
n.minobsinnode = 50)
gbmTestPredictions = predict(object = gbmWithCrossValidation,
newdata = test,
n.trees = 1000,
type = "response")
outputDataSet = data.frame("RowID" = test$RowID,
"ProbabilityOfResponse" = gbmTestPredictions)
write.csv(outputDataSet, "gbmBenchmarkSubmission.csv", row.names = FALSE)