ESEM-2011
This page provides supplementary material for our research paper submitted to ESEM 2011. We provide access to both data and scripts used to perform the experiments. Our goal is to encourage means and methods to foster transparency, replicability, and reproduction of our experiments.
The experiments have been conducted using the R programming language. Hence the data and scripts are in the corresponding formats. We assume that readers have basic understanding of R to use the resources shared below. In the future, we will make the data available in more widely accessible formats such as CSV.
Your feedback to improve this page is welcome.
— Rahul Premraj and Kim Herzig
Script to run the experiments:
Load all required R libraries.
> rm(list = ls(all = TRUE)) > library(caret) > library(gdata) > library(plyr) > library(reshape) > library(R.utils) |
We demonstrate our experiments on code data from JRuby 1.0 using the stratified repeated holdout setup. The code can be easily adapted/re-used to work with data from other projects and other experimental setups used in the paper.
> load("jruby_1.0.RData")
> ls()
[1] "all.data" "code.data" "network.data" "project" "version"
|
Several utility functions needed to execute the experiments. This function splits a data set into training inputs (dataX) and outputs (dataY).
> prepData <- function(data) {
+ data <- data[!duplicated(data$file), ]
+ rownames(data) <- data$file
+ data$file <- NULL
+ dataX <<- data[, 1:(ncol(data) - 1)]
+ dataY <- data[, ncol(data)]
+ dataY <<- factor(ifelse(dataY > 0, "One", "Zero"))
+ }
> prepData(code.data)
> ls()
[1] "all.data" "code.data" "dataX" "dataY" "network.data" "prepData" "project"
[8] "version"
|
This function is a generic one to train prediction models.
> trainModel <- function(method) {
+ model <- train(trainX, trainY, method = method, tuneLength = tuneLengthValue, trControl = train.control,
+ metric = "Kappa")
+ return(model)
+ }
|
These functions process the results from the prediction models.
> getPrecision <- function(x) as.numeric(unname(x$byClass[3])) > getRecall <- function(x) as.numeric(unname(x$byClass[1])) > getFmeasure <- function(x, y) 2 * ((x * y)/(x + y)) |
The following code snippets run the experiments. For the sake of simplicity and illustration, we only present the code to run one stratified random split run. We used a marginally modified version of the following code to utilize multiple cores on our machines using (architecture specific) R packages to allow parallelization. We used foreach and doMC to run our experiments on 300 samples of data.
Each of the following snippets is annotated with the corresponding steps presented in the paper in Section 5. The descriptions of the steps have been copied in verbatim from the paper.
Step 1 – createDataPartition(): Generate 300 train- ing and test set from the data using stratified sampling (note that the following steps were run on each pair of training/test set).
> inTrain <- createDataPartition(dataY, times = 1, p = 2/3) > trainX <- dataX[inTrain[[1]], ] > trainY <- dataY[inTrain[[1]]] > testX <- dataX[-inTrain[[1]], ] > testY <- dataY[-inTrain[[1]]] |
Step 2 – nearZeroVar(): Remove numerical input columns from the training set that have near zero variance (essentially a single value) to avoid any undue influence on the models. The same columns were then removed from the test set.
> train.nzv <- nearZeroVar(trainX)
> if (length(train.nzv) > 0) {
+ trainX <- trainX[, -train.nzv]
+ testX <- testX[, -train.nzv]
+ }
|
Step 3 – findCorrelation(): Remove input columns from the training set that correlate with other columns with p > .90 to avoid any undue influence on the models. The same columns were then removed from the test set.
> trainX.corr <- cor(trainX)
> trainX.highcorr <- findCorrelation(trainX.corr, 0.9)
> if (length(trainX.highcorr) > 0) {
+ trainX <- trainX[, -trainX.highcorr]
+ testX <- testX[, -trainX.highcorr]
+ }
|
Step 4 – preProcess() and predict(): Rescale the training data using the center and rescale to minimize the effect of large values on the prediction model. We additionally experimented with performing principal component analysis on our data (similar to Z & N), but this often led to inferior results. Hence we restricted ourselves to normalize our data by centering and rescaling it. The corresponding test data was centered and rescaled accordingly using the predict() function.
> xTrans <- preProcess(trainX, method = c("center", "scale"))
> trainX <- predict(xTrans, trainX)
> testX <- predict(xTrans, testX)
|
Step 5 – train() We used several prediction models for our experiments. These are listed in Table III. Each model offers one or more parameters that can be tuned to optimize performance. This is internally handled by the train() function when the number of values (tuneLength) to validate is specified. We set this number to 5.
> train.control <- trainControl(number = 2)
> set.seed(2)
> tuneLengthValue <- 5
> svmRadialFit <- trainModel("svmRadial")
Fitting: sigma=2.523217, C=0.25
Fitting: sigma=2.523217, C=0.5
Fitting: sigma=2.523217, C=1
Fitting: sigma=2.523217, C=2
Fitting: sigma=2.523217, C=4
Aggregating results
Selecting tuning parameters
Fitting model on full training set
> multinomFit <- trainModel("multinom")
Fitting: decay=0
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 188.071046
final value 186.988533
converged
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 172.820357
iter 20 value 170.667647
final value 170.667625
converged
Fitting: decay=1e-04
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 188.071455
final value 186.989001
converged
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 172.820790
iter 20 value 170.668299
final value 170.668277
converged
Fitting: decay=0.001
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 188.075143
final value 186.993210
converged
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 172.824685
iter 20 value 170.674161
final value 170.674139
converged
Fitting: decay=0.01
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 188.112222
final value 187.035170
converged
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 172.863571
iter 20 value 170.732468
final value 170.732448
converged
Fitting: decay=0.1
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 188.504346
final value 187.442268
converged
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 173.246182
iter 20 value 171.287476
final value 171.287467
converged
Aggregating results
Selecting tuning parameters
Fitting model on full training set
# weights: 12 (11 variable)
initial value 277.952019
iter 10 value 188.040943
final value 186.127838
converged
> rpartFit <- trainModel("rpart")
Fitting: maxdepth=9
Aggregating results
Selecting tuning parameters
Fitting model on full training set
> knnFit <- trainModel("knn")
Fitting: k=5
Fitting: k=7
Fitting: k=9
Fitting: k=11
Fitting: k=13
Aggregating results
Selecting tuning parameters
Fitting model on full training set
> treebagFit <- trainModel("treebag")
Fitting: parameter=none
Aggregating results
Fitting model on full training set
> nbFit <- trainModel("nb")
Fitting: usekernel=TRUE
Fitting: usekernel=FALSE
Aggregating results
Selecting tuning parameters
Fitting model on full training set
|
Step 6 – extractPrediction() Each trained model was evaluated against the test data. The evaluation measures that we computed include precision, recall, and F-measure.
> models <- list(knn3 = knnFit, multinom = multinomFit, rpart = rpartFit, svmRadial = svmRadialFit,
+ treebag = treebagFit, nb = nbFit)
> pred.values <- extractPrediction(models, testX, testY)
> pred.values <- subset(pred.values, dataType == "Test")
> pred.values.split <- split(pred.values, pred.values$object)
> n.row = length(pred.values.split)
> results <- NULL
> results <- dataFrame(colClasses = c(Model = "character", Precision = "double", Recall = "double",
+ F.Measure = "double"), nrow = n.row)
> for (j in 1:length(pred.values.split)) {
+ conf.matrix <- confusionMatrix(pred.values.split[[j]]$pred, pred.values.split[[j]]$obs,
+ positive = "One")
+ precision <- getPrecision(conf.matrix)
+ recall <- getRecall(conf.matrix)
+ f.measure <- getFmeasure(precision, recall)
+ results[j, 1] <- names(pred.values.split)[j]
+ results[j, 2:4] <- c(precision, recall, f.measure)
+ }
> print(results)
Model Precision Recall F.Measure
1 knn3 0.6060606 0.4651163 0.5263158
2 multinom 0.6153846 0.1860465 0.2857143
3 nb 0.5217391 0.5581395 0.5393258
4 rpart 0.6486486 0.5581395 0.6000000
5 svmRadial 0.7000000 0.3255814 0.4444444
6 treebag 0.5952381 0.5813953 0.5882353
|