As part of my course on statistical learning, we created 3D graphics to foster a more intuitive understanding of the various methods that are used to relax the assumption of linearity (in the predictors) in regression and classification methods.

The authors of our text (The Elements of Statistical Learning, 2nd Edition) provide a Mixture Simulation data set that has two continuous predictors and a binary outcome. This data is used to demonstrate classification procedures by plotting classification boundaries in the two predictors, which are determined by one or more surfaces (e.g., a probability surface such as that produced by logistic regression, or multiple intersecting surfaces as in linear discriminant analysis). In our class laboratory, we used the R package `rgl` to create a 3D representation of these surfaces for a variety of semiparametric classification procedures.

Chapter 6 presents local logistic regression and kernel density classification, among other kernel (local) classification and regression methods. Below is the code and graphic (a 2D projection) associated with the local linear logistic regression in these data:

library(rgl) load(url("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ESL.mixture.rda")) dat <- ESL.mixture ddat <- data.frame(y=dat$y, x1=dat$x[,1], x2=dat$x[,2]) ## create 3D graphic, rotate to view 2D x1/x2 projection par3d(FOV=1,userMatrix=diag(4)) plot3d(dat$xnew[,1], dat$xnew[,2], dat$prob, type="n", xlab="x1", ylab="x2", zlab="", axes=FALSE, box=TRUE, aspect=1) ## plot points and bounding box x1r <- range(dat$px1) x2r <- range(dat$px2) pts <- plot3d(dat$x[,1], dat$x[,2], 1, type="p", radius=0.5, add=TRUE, col=ifelse(dat$y, "orange", "blue")) lns <- lines3d(x1r[c(1,2,2,1,1)], x2r[c(1,1,2,2,1)], 1) ## draw Bayes (True) classification boundary in blue dat$probm <- with(dat, matrix(prob, length(px1), length(px2))) dat$cls <- with(dat, contourLines(px1, px2, probm, levels=0.5)) pls0 <- lapply(dat$cls, function(p) lines3d(p$x, p$y, z=1, color="blue")) ## compute probabilities plot classification boundary ## associated with local linear logistic regression probs.loc <- apply(dat$xnew, 1, function(x0) { ## smoothing parameter l <- 1/2 ## compute (Gaussian) kernel weights d <- colSums((rbind(ddat$x1, ddat$x2) - x0)^2) k <- exp(-d/2/l^2) ## local fit at x0 fit <- suppressWarnings(glm(y ~ x1 + x2, data=ddat, weights=k, family=binomial(link="logit"))) ## predict at x0 as.numeric(predict(fit, type="response", newdata=as.data.frame(t(x0)))) }) dat$probm.loc <- with(dat, matrix(probs.loc, length(px1), length(px2))) dat$cls.loc <- with(dat, contourLines(px1, px2, probm.loc, levels=0.5)) pls <- lapply(dat$cls.loc, function(p) lines3d(p$x, p$y, z=1)) ## plot probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, probs.loc, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

In the above graphic, the solid blue line represents the true Bayes decision boundary (i.e., {x: Pr("orange"|x) = 0.5}), which is computed from the model used to simulate these data. The probability surface (generated by the local logistic regression) is represented in gray, and the corresponding Bayes decision boundary occurs where the plane f(x) = 0.5 (in light gray) intersects with the probability surface. The solid black line is a projection of this intersection. Here is a link to the interactive version of this graphic: local logistic regression.

Below is the code and graphic associated with the kernel density classification (note: this code below should only be executed after the above code, since the 3D graphic is modified, rather than created anew):

## clear the surface, decision plane, and decision boundary pop3d(id=sfc); pop3d(id=qds) for(pl in pls) pop3d(id=pl) ## kernel density classification ## compute kernel density estimates for each class dens.kde <- lapply(unique(ddat$y), function(uy) { apply(dat$xnew, 1, function(x0) { ## subset to current class dsub <- subset(ddat, y==uy) ## smoothing parameter l <- 1/2 ## kernel density estimate at x0 mean(dnorm(dsub$x1-x0[1], 0, l)*dnorm(dsub$x2-x0[2], 0, l)) }) }) ## compute prior for each class (sample proportion) prir.kde <- table(ddat$y)/length(dat$y) ## compute posterior probability Pr(y=1|x) probs.kde <- prir.kde[2]*dens.kde[[2]]/ (prir.kde[1]*dens.kde[[1]]+prir.kde[2]*dens.kde[[2]]) ## plot classification boundary associated ## with kernel density classification dat$probm.kde <- with(dat, matrix(probs.kde, length(px1), length(px2))) dat$cls.kde <- with(dat, contourLines(px1, px2, probm.kde, levels=0.5)) pls <- lapply(dat$cls.kde, function(p) lines3d(p$x, p$y, z=1)) ## plot probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, probs.kde, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

Here are links to the interactive versions of both graphics: local logistic regression, kernel density classification

]]>The solid line represents the Bayes decision boundary (i.e., {x: Pr("orange"|x) = 0.5}), which is computed from the model used to simulate these data. The Bayes decision boundary and other boundaries are determined by one or more surfaces (e.g., Pr("orange"|x)), which are generally omitted from the graphics. In class, we decided to use the R package `rgl` to create a 3D representation of this surface. Below is the code and graphic (well, a 2D projection) associated with the Bayes decision boundary:

library(rgl) load(url("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ESL.mixture.rda")) dat <- ESL.mixture ## create 3D graphic, rotate to view 2D x1/x2 projection par3d(FOV=1,userMatrix=diag(4)) plot3d(dat$xnew[,1], dat$xnew[,2], dat$prob, type="n", xlab="x1", ylab="x2", zlab="", axes=FALSE, box=TRUE, aspect=1) ## plot points and bounding box x1r <- range(dat$px1) x2r <- range(dat$px2) pts <- plot3d(dat$x[,1], dat$x[,2], 1, type="p", radius=0.5, add=TRUE, col=ifelse(dat$y, "orange", "blue")) lns <- lines3d(x1r[c(1,2,2,1,1)], x2r[c(1,1,2,2,1)], 1) ## draw Bayes (True) decision boundary; provided by authors dat$probm <- with(dat, matrix(prob, length(px1), length(px2))) dat$cls <- with(dat, contourLines(px1, px2, probm, levels=0.5)) pls <- lapply(dat$cls, function(p) lines3d(p$x, p$y, z=1)) ## plot marginal (w.r.t mixture) probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, dat$prob, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

In the above graphic, the probability surface is represented in gray, and the Bayes decision boundary occurs where the plane f(x) = 0.5 (in light gray) intersects with the probability surface.

Of course, the classification task is to estimate a decision boundary given the data. Chapter 5 presents two multidimensional splines approaches, in conjunction with binary logistic regression, to estimate a decision boundary. The upper panel of Figure 5.11 in the book shows the decision boundary associated with additive natural cubic splines in x_{1} and x_{2} (4 df in each direction; 1+(4-1)+(4-1) = 7 parameters), and the lower panel shows the corresponding tensor product splines (4x4 = 16 parameters), which are much more flexible, of course. The code and graphics below reproduce the decision boundaries shown in Figure 5.11, and additionally illustrate the estimated probability surface (note: this code below should only be executed after the above code, since the 3D graphic is modified, rather than created anew):

Reproducing Figure 5.11 (top):

## clear the surface, decision plane, and decision boundary par3d(userMatrix=diag(4)); pop3d(id=sfc); pop3d(id=qds) for(pl in pls) pop3d(id=pl) ## fit additive natural cubic spline model library(splines) ddat <- data.frame(y=dat$y, x1=dat$x[,1], x2=dat$x[,2]) form.add <- y ~ ns(x1, df=3)+ ns(x2, df=3) fit.add <- glm(form.add, data=ddat, family=binomial(link="logit")) ## compute probabilities, plot classification boundary probs.add <- predict(fit.add, type="response", newdata = data.frame(x1=dat$xnew[,1], x2=dat$xnew[,2])) dat$probm.add <- with(dat, matrix(probs.add, length(px1), length(px2))) dat$cls.add <- with(dat, contourLines(px1, px2, probm.add, levels=0.5)) pls <- lapply(dat$cls.add, function(p) lines3d(p$x, p$y, z=1)) ## plot probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, probs.add, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

Reproducing Figure 5.11 (bottom)

## clear the surface, decision plane, and decision boundary par3d(userMatrix=diag(4)); pop3d(id=sfc); pop3d(id=qds) for(pl in pls) pop3d(id=pl) ## fit tensor product natural cubic spline model form.tpr <- y ~ 0 + ns(x1, df=4, intercept=TRUE): ns(x2, df=4, intercept=TRUE) fit.tpr <- glm(form.tpr, data=ddat, family=binomial(link="logit")) ## compute probabilities, plot classification boundary probs.tpr <- predict(fit.tpr, type="response", newdata = data.frame(x1=dat$xnew[,1], x2=dat$xnew[,2])) dat$probm.tpr <- with(dat, matrix(probs.tpr, length(px1), length(px2))) dat$cls.tpr <- with(dat, contourLines(px1, px2, probm.tpr, levels=0.5)) pls <- lapply(dat$cls.tpr, function(p) lines3d(p$x, p$y, z=1)) ## plot probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, probs.tpr, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

Although the graphics above are static, it is possible to embed an interactive 3D version within a web page (e.g., see the `rgl` vignette; best with Google Chrome), using the `rgl` function `writeWebGL`. I gave up on trying to embed such a graphic into this WordPress blog post, but I have created a separate page for the interactive 3D version of Figure 5.11b. Duncan Murdoch's work with this package is really nice!

Evan said that the brown drawing was a mouse. ]]>

The software company GGASoftware has extended the work of myself and others on the `sas7bdat` R package by developing a Java library called Parso, which also reads `sas7bdat` files. They have worked out most of the remaining kinks. For example, the Parso library reads `sas7bdat` files with compressed data (i.e., written with `COMPRESS=yes` or `COMPRESS=binary`). I hope to eventually bring the project full circle, and incorporate their improvements into the sas7bdat file format documentation and code in the `sas7bdat` package.

The Parso library is made available under terms of the GPLv3, and is also available under a commercial license. So, last weekend, with the help of Tobias Verbeke's `helloJavaWorld` R package template, I implemented an R package that wraps the functionality of the Parso library. The new package, `sas7bdat.parso` (currently hosted exclusively on GitHub), depends on the R package `rJava`, and implements the functions `s7b2csv` and `read.sas7bdat.parso`. The former function is the workhorse, which reads a sas7bdat file and writes a corresponding CSV file. All of the file input/output happens in the Java implementation (for speed and simplicity). The latter function `read.sas7bdat.parso` simply converts a sas7bdat file to temporary (i.e., using `tempfile`) CSV file, and then reads the CSV file using `read.csv`. There may still be some kinks the the Parso library, or in the wrapper R package, but I hope that this additional resource will help finally eliminate the SAS data file barrier that many of us have experienced for years.

Installation of the R package `rJava` is more complicated than simply calling `install.packages("rJava")`. In order for the `rJava` package to work, and hence the `sas7bdat.parso` package, a JDK (Java Development Kit) must be installed. You can download the Oracle JDK from the Oracle website. Once the JDK is installed, the easiest way to install the `sas7bdat.parso` library is using the `install_github` function in the `devtools` package (e.g., `library("devtools"); install_github("biostatmatt/sas7bdat.parso")`). For additional details on installing the `rJava` package, see the RForge site.

set.seed(42) # Consider an infinite population where the association # between two variables x and y is described by the following: # y = x^2 + e # x ~ U(0, 15) # e ~ N(0, 10) # We seek a linear approximation of the relationship between # x and y that takes the form below, and minimizes the average # squared deviation (i.e., the 'best' approximation). # hat(y) = a + b*x # This function simulates "n" observations from the population. simulate <- function(n=20) { x <- runif(n, 0, 15) y <- rnorm(n, x^2, 10) data.frame(y=y,x=x) } # This function finds the 'sample' best linear approximation to the # relationship between x and y. sam_fit <- function(dat) lm(y ~ x, data=dat) # We can approximate the 'population' best linear approximation by # taking a very large sample. Note that this only works well for # statistics that converge to a population quantity. pop_fit <- sam_fit(simulate(1e6)) ## > pop_fit ## ## Call: ## lm(formula = y ~ x, data = simulate(1e+06)) ## ## Coefficients: ## (Intercept) x ## -37.53 15.00

The figure below illustrates the true quadratic curve and the best linear approximation (`pop_fit`), overlaid against 10000 samples.

# This function creates a level confidence region for the intercept # and slope of the best linear approximation, and tests whether # the region includes the corresponding population values. # The 'adj' parameter adjusts the critical value, making the # confidence region larger or smaller. sam_int <- function(dat, val=c(a=0, b=0), level=0.95, adj=0.00) { s <- sam_fit(dat) d <- coef(s) - val v <- vcov(s) c <- qchisq(level+adj, 2) as.numeric(d %*% solve(v) %*% d) < c } # By specifying that the region has 95% confidence, we intend that # the region includes the population quantity in 95% of samples. # We can assess the coverage of the above 95% confidence region by # drawing repeated samples from the population and checking whether # the associated confidence regions include the population values: coverage20 <- mean(replicate(1e4, sam_int(simulate(20), val=coef(pop_fit)))) ## > coverage20 ## [1] 0.855 # Because the true coverage is less than the nominal coverage, # the confidence region is anti-conservative. However, suppose that # we can adjust the critical value of the region so that the true # coverage is equal to the nominal value: coverage20a <- mean(replicate(1e4, sam_int(simulate(20), val=coef(pop_fit), adj=0.045))) ## > coverage20a ## [1] 0.9476

The code above illustrates that if we had access to a population, we can adjust the coverage of a confidence region to be correct. The animation (created using Yihui's animation package) below illustrates the original and corrected confidence regions for ten different samples of size 20, overlaid against 10000 sample estimates of `a` and `b`. Unfortunately, we don't generally have access to the population. Hence, the adjustment must be made by an alternative, empirical, mechanism. In the next post, I will show how to use iterated Monte Carlo (specifically the double bootstrap) to make such an adjustment.

In the prediction framework, we use model diagnostics to verify that the model fits well, which has a direct bearing on the quality of predictions. For example, a line generally does not approximate a quadratic curve. However, it is possible to make accurate inferences about a linear approximation to a quadratic curve. Hence, model fit is not required to make quality inferences. Rather, the requirement is that the associated probability statements are correct.

Assessing model diagnostics is an indirect mechanism to comfort ourselves about the quality of inferences. As an alternative, we might attempt a more direct check, for example, by constructing an empirical estimate of coverage. We may then go further and adjust, or calibrate, the confidence interval to have the correct empirical coverage. These ideas are fundamental parts of the 'double bootstrap', and 'iterated Monte Carlo' methods. For the sake of argument, I will state that this type of empirical check and calibration is sufficient to fully replace model diagnostics for statistical inference. It is also my hypothesis that model diagnostics have been historically favored to iterative Monte Carlo methods (the double bootstrap appeared in the late 1980's) because the latter is more computationally intensive. Current computational tools mitigate, but do not eliminate this concern.

I will present examples with R code in a later post.

]]>#!/bin/sh REXEC="/usr/local/bin/R --vanilla --slave" $REXEC <<EOF # create a random file name pngfile <- paste0(format(Sys.time(), "%Y%m%d%H%M%S"), paste(sample(letters,10), collapse=""), ".png") # create temporary graphic file png(pngfile, type = "cairo") x <- seq(-10, 10, length= 30) y <- x f <- function(x, y) { r <- sqrt(x^2+y^2); 10 * sin(r)/r } z <- outer(x, y, f) z[is.na(z)] <- 1 op <- par(bg = rgb(1,1,1,0)) persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue") persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue", ltheta = 120, shade = 0.75, ticktype = "detailed", xlab = "X", ylab = "Y", zlab = "Sinc( r )") invisible(dev.off()) # write headers pngsize <- file.info(pngfile)[["size"]] cat("Content-type: image/png\n") cat(paste("Content-length: ", pngsize, "\n\n", sep="")) # open pipe to stdout and pass image data con <- pipe("cat", "wb") writeBin(readBin(pngfile, 'raw', n=pngsize), con) flush(con) close(con) # remove intermediate graphic invisible(file.remove(pngfile)) EOF ###]]>

We have a loan with Volkswagen Credit (VC), and today received an offer to postpone our December payment of $447.50, for a small fee of $25. VC does a good job of making the offer read like a friendly holiday gesture:

'Tis the season to spread holiday cheer and join in the spirit of giving. That's why Volkswagen Credit wants to thank loyal customers like you by offering the opportunity to skip your December 2013 payment...

This is a terrible opportunity! It's essentially a new, one month loan. The annualized simple interest rate paid to VC on this loan would have been 25 / 447.50 * 100 * 12 = 67%! That would be considered usurious in Canada.

]]>