```
## This script illustrates the nested versus non-nested
## random effects functionality in the R packages lme4 (lmer)
## and nlme (lme).
library("lme4")
library("nlme")
data("Oxide")
Oxide <- as.data.frame(Oxide)
## In the Oxide data, Site is nested in Wafer, which
## is nested in Lot. But, the latter appear crossed:
xtabs(~ Lot + Wafer, Oxide)
## Create a variable that identifies unique Wafers
Oxide$LotWafer <- with(Oxide, interaction(Lot,Wafer))
## For continuous response 'Thickness',
## fit nested model E[y_{ijk}] = a + b_i + g_{ij}
## for Lot i = 1:8 and Wafer j = 1:3 and Site k = 1:3
## where b_i ~ N(0, \sigma_1)
## g_{ij} ~ N(0, \sigma_2)
## and b_i is independent of g_{ij}
## The following four models are identical:
## lme4
lmer(Thickness ~ (1 | Lot/Wafer), data=Oxide)
lmer(Thickness ~ (1 | Lot) + (1 | LotWafer), data=Oxide)
## Note: the equivalence of the above formulations makes
## clear that the intercept indexed by Wafer within Lot
## has the same variance across Lots.
## nlme
lme(Thickness ~ 1, random= ~1 | Lot/Wafer, data=Oxide)
lme(Thickness ~ 1, random=list(~1|Lot, ~1|Wafer), data=Oxide)
## Note: the second formulation illustrates that lme assumes
## nesting in the order that grouping factors are listed. I
## think that this was a poor implementation decision, and
## that the latter should indicate non-nested grouping.
## Fit non-nested model E[y_{ijk}] = a + b_i + g_j
## for Lot i = 1:8 and Wafer j = 1:3 and Site k = 1:3
## where b_i ~ N(0, \sigma_1)
## g_j ~ N(0, \sigma_2)
## and b_i is independent of g_j
## lme4
lmer(Thickness ~ (1 | Lot) + (1 | Wafer), data=Oxide)
lmer(Thickness ~ (1 | Wafer) + (1 | Lot), data=Oxide)
## nlme: There is no 'easy' way to do this with nlme,
## and I couldn't get this to work with nlme. This is a
## trick that gives a random slope for each level of the
## grouping variables, which are indexed by the levels of
## a dummy grouping variable with only one group. We also
## specify, for each grouping factor, that covariance
## matrix is proportional to the identity matrix.
Oxide$Dummy <- factor(1)
Oxide <- groupedData(Thickness ~ 1 | Dummy, Oxide)
lme(Thickness ~ 1, data=Oxide,
random=pdBlocked(list(pdIdent(~ 0 + Lot),
pdIdent(~ 0 + Wafer))))
```

The image below is my interpretation of the nlme (lme) trick for non-nested (crossed) random effects. The idea is to assign a random slope (no intercept) to each level of the grouping factors, which are each indexed by the levels of a dummy variable with that has exactly one level. The pdIdent function ensures that these random effects are uncorrelated and common variance. The pdBlocked function specifies that the random effects are also independent across the two grouping factors.

]]>A Monte-Carlo simulation can be implemented to study the effects of the control group frequencies, the odds ratio associated with treatment allocation (i.e., the 'treatment effect'), and sample size on the power or precision associated with a null hypothesis test or confidence interval for the treatment effect.

In order to simulate this process, it's necessary to specify each of the following:

- control group frequencies
- treatment effect
- sample size
- testing or confidence interval procedure

Ideally, the control group frequencies would be informed by preliminary data, but expert opinion can also be useful. Once specified, the control group frequencies can be converted to intercepts in the POCL model framework. There is an analytical solution for this; see the link above. But, a quick and dirty method is to simulate a large sample from the control group population, and then fit an intercept-only POCL model to those data. The code below demonstrates this, using the `polr` function from the `MASS` package.

```
## load MASS for polr()
library(MASS)
## specify frequencies of 11 ordered categories
prbs <- c(1,5,10,15,20,40,60,80,80,60,40)
prbs <- prbs/sum(prbs)
## sample 1000 observations with probabilities prbs
resp <- factor(replicate(1000, sample(0:10, 1, prob=prbs)),
ordered=TRUE, levels=0:10)
## fit POCL model; extract intercepts (zeta here)
alph <- polr(resp~1)$zeta
```

As in most other types of power analysis, the treatment effect can represent the minimum effect that the study should be designed to detect with a specified degree of power; or in a precision analysis, the maximum confidence interval width in a specified fraction of samples. In this case, the treatment effect is encoded as a log odds ratio, i.e., a slope parameter in the POCL model.

Given the intercept and slope parameters, observations from the POCL model can be simulated with permuted block randomization in blocks of size four to one of two treatment groups as follows:

```
## convenience functions
logit <- function(p) log(1/(1/p-1))
expit <- function(x) 1/(1/exp(x) + 1)
## block randomization
## n - number of randomizations
## m - block size
## levs - levels of treatment
block_rand <- function(n, m, levs=LETTERS[1:m]) {
if(m %% length(levs) != 0)
stop("length(levs) must be a factor of 'm'")
k <- if(n%%m > 0) n%/%m + 1 else n%/%m
l <- m %/% length(levs)
factor(c(replicate(k, sample(rep(levs,l),
length(levs)*l, replace=FALSE))),levels=levs)
}
## simulate from POCL model
## n - sample size
## a - alpha
## b - beta
## levs - levels of outcome
pocl_simulate <- function(n, a, b, levs=0:length(a)) {
dat <- data.frame(Treatment=block_rand(n,4,LETTERS[1:2]))
des <- model.matrix(~ 0 + Treatment, data=dat)
nlev <- length(a) + 1
yalp <- c(-Inf, a, Inf)
xbet <- matrix(c(rep(0, nrow(des)),
rep(des %*% b , nlev-1),
rep(0, nrow(des))), nrow(des), nlev+1)
prbs <- sapply(1:nlev, function(lev) {
yunc <- rep(lev, nrow(des))
expit(yalp[yunc+1] - xbet[cbind(1:nrow(des),yunc+1)]) -
expit(yalp[yunc] - xbet[cbind(1:nrow(des),yunc)])
})
colnames(prbs) <- levs
dat$y <- apply(prbs, 1, function(p) sample(levs, 1, prob=p))
dat$y <- unname(factor(dat$y, levels=levs, ordered=TRUE))
return(dat)
}
```

The testing procedure we consider here is a likelihood ratio test with 5% type-I error rate:

```
## Likelihood ratio test with 0.05 p-value threshold
## block randomization in blocks of size four to one
## of two treatment groups
## dat - data from pocl_simulate
pocl_test <- function(dat) {
fit <- polr(y~Treatment, data=dat)
anova(fit, update(fit, ~.-Treatment))$"Pr(Chi)"[2] < 0.05
}
```

The code below demontrates the calculation of statistical power associated with sample of size 100 and odds ratio 0.25, where the control group frequencies of each category are as specified above. When executed, which takes some time, this gives about 80% power.

```
## power: n=50, OR=0.25
mean(replicate(10000, pocl_test(pocl_simulate(50, a=alph, b=c(0, log(0.25))))))
```

The figure below illustrates the power associated with a sequence of odds ratios. The dashed line represents the nominal type-I error rate 0.05.

Simulation-based power and precision analysis is a very powerful technique, which ensures that the reported statistical power reflects the intended statistical analysis (often times in research proposals, the proposed statistical analysis is not the same as that used to evaluate statistical power). In addition to the simple analysis described above, it is also possible to evaluate an adjusted analysis, i.e., the power to detect a treatment effect after adjustement for covariate effects. Of course, this requires that the latter effects be specified, and that there is some mechanism to simulate covariates. This can be a difficule task, but makes clear that there are many assumptions involved in a realistic power analysis.

Another advantage to simulation-based power analysis is that it requires implementation of the planned statistical procedure before the study begins, which ensures its feasibility and provides an opportunity to consider details that might otherwise be overlooked. Of course, it may also accelerate the 'real' analysis, once the data are collected.

Here is the complete R script:

```
## load MASS for polr()
library(MASS)
## specify frequencies of 11 ordered categories
prbs <- c(1,5,10,15,20,40,60,80,80,60,40)
prbs <- prbs/sum(prbs)
## sample 1000 observations with probabilities prbs
resp <- factor(replicate(1000, sample(0:10, 1, prob=prbs)),
ordered=TRUE, levels=0:10)
## fit POCL model; extract intercepts (zeta here)
alph <- polr(resp~1)$zeta
## convenience functions
logit <- function(p) log(1/(1/p-1))
expit <- function(x) 1/(1/exp(x) + 1)
## block randomization
## n - number of randomizations
## m - block size
## levs - levels of treatment
block_rand <- function(n, m, levs=LETTERS[1:m]) {
if(m %% length(levs) != 0)
stop("length(levs) must be a factor of 'm'")
k <- if(n%%m > 0) n%/%m + 1 else n%/%m
l <- m %/% length(levs)
factor(c(replicate(k, sample(rep(levs,l),
length(levs)*l, replace=FALSE))),levels=levs)
}
## simulate from POCL model
## n - sample size
## a - alpha
## b - beta
## levs - levels of outcome
pocl_simulate <- function(n, a, b, levs=0:length(a)) {
dat <- data.frame(Treatment=block_rand(n,4,LETTERS[1:2]))
des <- model.matrix(~ 0 + Treatment, data=dat)
nlev <- length(a) + 1
yalp <- c(-Inf, a, Inf)
xbet <- matrix(c(rep(0, nrow(des)),
rep(des %*% b , nlev-1),
rep(0, nrow(des))), nrow(des), nlev+1)
prbs <- sapply(1:nlev, function(lev) {
yunc <- rep(lev, nrow(des))
expit(yalp[yunc+1] - xbet[cbind(1:nrow(des),yunc+1)]) -
expit(yalp[yunc] - xbet[cbind(1:nrow(des),yunc)])
})
colnames(prbs) <- levs
dat$y <- apply(prbs, 1, function(p) sample(levs, 1, prob=p))
dat$y <- unname(factor(dat$y, levels=levs, ordered=TRUE))
return(dat)
}
## Likelihood ratio test with 0.05 p-value threshold
## block randomization in blocks of size four to one
## of two treatment groups
## dat - data from pocl_simulate
pocl_test <- function(dat) {
fit <- polr(y~Treatment, data=dat)
anova(fit, update(fit, ~.-Treatment))$"Pr(Chi)"[2] < 0.05
}
## power: n=50, OR=0.25
mean(replicate(10000, pocl_test(pocl_simulate(50, a=alph, b=c(0, log(0.25))))))
```

]]>As part of my course on statistical learning, we created 3D graphics to foster a more intuitive understanding of the various methods that are used to relax the assumption of linearity (in the predictors) in regression and classification methods.

The authors of our text (The Elements of Statistical Learning, 2nd Edition) provide a Mixture Simulation data set that has two continuous predictors and a binary outcome. This data is used to demonstrate classification procedures by plotting classification boundaries in the two predictors, which are determined by one or more surfaces (e.g., a probability surface such as that produced by logistic regression, or multiple intersecting surfaces as in linear discriminant analysis). In our class laboratory, we used the R package `rgl` to create a 3D representation of these surfaces for a variety of semiparametric classification procedures.

Chapter 6 presents local logistic regression and kernel density classification, among other kernel (local) classification and regression methods. Below is the code and graphic (a 2D projection) associated with the local linear logistic regression in these data:

library(rgl) load(url("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ESL.mixture.rda")) dat <- ESL.mixture ddat <- data.frame(y=dat$y, x1=dat$x[,1], x2=dat$x[,2]) ## create 3D graphic, rotate to view 2D x1/x2 projection par3d(FOV=1,userMatrix=diag(4)) plot3d(dat$xnew[,1], dat$xnew[,2], dat$prob, type="n", xlab="x1", ylab="x2", zlab="", axes=FALSE, box=TRUE, aspect=1) ## plot points and bounding box x1r <- range(dat$px1) x2r <- range(dat$px2) pts <- plot3d(dat$x[,1], dat$x[,2], 1, type="p", radius=0.5, add=TRUE, col=ifelse(dat$y, "orange", "blue")) lns <- lines3d(x1r[c(1,2,2,1,1)], x2r[c(1,1,2,2,1)], 1) ## draw Bayes (True) classification boundary in blue dat$probm <- with(dat, matrix(prob, length(px1), length(px2))) dat$cls <- with(dat, contourLines(px1, px2, probm, levels=0.5)) pls0 <- lapply(dat$cls, function(p) lines3d(p$x, p$y, z=1, color="blue")) ## compute probabilities plot classification boundary ## associated with local linear logistic regression probs.loc <- apply(dat$xnew, 1, function(x0) { ## smoothing parameter l <- 1/2 ## compute (Gaussian) kernel weights d <- colSums((rbind(ddat$x1, ddat$x2) - x0)^2) k <- exp(-d/2/l^2) ## local fit at x0 fit <- suppressWarnings(glm(y ~ x1 + x2, data=ddat, weights=k, family=binomial(link="logit"))) ## predict at x0 as.numeric(predict(fit, type="response", newdata=as.data.frame(t(x0)))) }) dat$probm.loc <- with(dat, matrix(probs.loc, length(px1), length(px2))) dat$cls.loc <- with(dat, contourLines(px1, px2, probm.loc, levels=0.5)) pls <- lapply(dat$cls.loc, function(p) lines3d(p$x, p$y, z=1)) ## plot probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, probs.loc, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

In the above graphic, the solid blue line represents the true Bayes decision boundary (i.e., {x: Pr("orange"|x) = 0.5}), which is computed from the model used to simulate these data. The probability surface (generated by the local logistic regression) is represented in gray, and the corresponding Bayes decision boundary occurs where the plane f(x) = 0.5 (in light gray) intersects with the probability surface. The solid black line is a projection of this intersection. Here is a link to the interactive version of this graphic: local logistic regression.

Below is the code and graphic associated with the kernel density classification (note: this code below should only be executed after the above code, since the 3D graphic is modified, rather than created anew):

## clear the surface, decision plane, and decision boundary pop3d(id=sfc); pop3d(id=qds) for(pl in pls) pop3d(id=pl) ## kernel density classification ## compute kernel density estimates for each class dens.kde <- lapply(unique(ddat$y), function(uy) { apply(dat$xnew, 1, function(x0) { ## subset to current class dsub <- subset(ddat, y==uy) ## smoothing parameter l <- 1/2 ## kernel density estimate at x0 mean(dnorm(dsub$x1-x0[1], 0, l)*dnorm(dsub$x2-x0[2], 0, l)) }) }) ## compute prior for each class (sample proportion) prir.kde <- table(ddat$y)/length(dat$y) ## compute posterior probability Pr(y=1|x) probs.kde <- prir.kde[2]*dens.kde[[2]]/ (prir.kde[1]*dens.kde[[1]]+prir.kde[2]*dens.kde[[2]]) ## plot classification boundary associated ## with kernel density classification dat$probm.kde <- with(dat, matrix(probs.kde, length(px1), length(px2))) dat$cls.kde <- with(dat, contourLines(px1, px2, probm.kde, levels=0.5)) pls <- lapply(dat$cls.kde, function(p) lines3d(p$x, p$y, z=1)) ## plot probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, probs.kde, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

Here are links to the interactive versions of both graphics: local logistic regression, kernel density classification

]]>The solid line represents the Bayes decision boundary (i.e., {x: Pr("orange"|x) = 0.5}), which is computed from the model used to simulate these data. The Bayes decision boundary and other boundaries are determined by one or more surfaces (e.g., Pr("orange"|x)), which are generally omitted from the graphics. In class, we decided to use the R package `rgl` to create a 3D representation of this surface. Below is the code and graphic (well, a 2D projection) associated with the Bayes decision boundary:

library(rgl) load(url("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ESL.mixture.rda")) dat <- ESL.mixture ## create 3D graphic, rotate to view 2D x1/x2 projection par3d(FOV=1,userMatrix=diag(4)) plot3d(dat$xnew[,1], dat$xnew[,2], dat$prob, type="n", xlab="x1", ylab="x2", zlab="", axes=FALSE, box=TRUE, aspect=1) ## plot points and bounding box x1r <- range(dat$px1) x2r <- range(dat$px2) pts <- plot3d(dat$x[,1], dat$x[,2], 1, type="p", radius=0.5, add=TRUE, col=ifelse(dat$y, "orange", "blue")) lns <- lines3d(x1r[c(1,2,2,1,1)], x2r[c(1,1,2,2,1)], 1) ## draw Bayes (True) decision boundary; provided by authors dat$probm <- with(dat, matrix(prob, length(px1), length(px2))) dat$cls <- with(dat, contourLines(px1, px2, probm, levels=0.5)) pls <- lapply(dat$cls, function(p) lines3d(p$x, p$y, z=1)) ## plot marginal (w.r.t mixture) probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, dat$prob, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

In the above graphic, the probability surface is represented in gray, and the Bayes decision boundary occurs where the plane f(x) = 0.5 (in light gray) intersects with the probability surface.

Of course, the classification task is to estimate a decision boundary given the data. Chapter 5 presents two multidimensional splines approaches, in conjunction with binary logistic regression, to estimate a decision boundary. The upper panel of Figure 5.11 in the book shows the decision boundary associated with additive natural cubic splines in x_{1} and x_{2} (4 df in each direction; 1+(4-1)+(4-1) = 7 parameters), and the lower panel shows the corresponding tensor product splines (4x4 = 16 parameters), which are much more flexible, of course. The code and graphics below reproduce the decision boundaries shown in Figure 5.11, and additionally illustrate the estimated probability surface (note: this code below should only be executed after the above code, since the 3D graphic is modified, rather than created anew):

Reproducing Figure 5.11 (top):

## clear the surface, decision plane, and decision boundary par3d(userMatrix=diag(4)); pop3d(id=sfc); pop3d(id=qds) for(pl in pls) pop3d(id=pl) ## fit additive natural cubic spline model library(splines) ddat <- data.frame(y=dat$y, x1=dat$x[,1], x2=dat$x[,2]) form.add <- y ~ ns(x1, df=3)+ ns(x2, df=3) fit.add <- glm(form.add, data=ddat, family=binomial(link="logit")) ## compute probabilities, plot classification boundary probs.add <- predict(fit.add, type="response", newdata = data.frame(x1=dat$xnew[,1], x2=dat$xnew[,2])) dat$probm.add <- with(dat, matrix(probs.add, length(px1), length(px2))) dat$cls.add <- with(dat, contourLines(px1, px2, probm.add, levels=0.5)) pls <- lapply(dat$cls.add, function(p) lines3d(p$x, p$y, z=1)) ## plot probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, probs.add, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

Reproducing Figure 5.11 (bottom)

## clear the surface, decision plane, and decision boundary par3d(userMatrix=diag(4)); pop3d(id=sfc); pop3d(id=qds) for(pl in pls) pop3d(id=pl) ## fit tensor product natural cubic spline model form.tpr <- y ~ 0 + ns(x1, df=4, intercept=TRUE): ns(x2, df=4, intercept=TRUE) fit.tpr <- glm(form.tpr, data=ddat, family=binomial(link="logit")) ## compute probabilities, plot classification boundary probs.tpr <- predict(fit.tpr, type="response", newdata = data.frame(x1=dat$xnew[,1], x2=dat$xnew[,2])) dat$probm.tpr <- with(dat, matrix(probs.tpr, length(px1), length(px2))) dat$cls.tpr <- with(dat, contourLines(px1, px2, probm.tpr, levels=0.5)) pls <- lapply(dat$cls.tpr, function(p) lines3d(p$x, p$y, z=1)) ## plot probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, probs.tpr, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

Although the graphics above are static, it is possible to embed an interactive 3D version within a web page (e.g., see the `rgl` vignette; best with Google Chrome), using the `rgl` function `writeWebGL`. I gave up on trying to embed such a graphic into this WordPress blog post, but I have created a separate page for the interactive 3D version of Figure 5.11b. Duncan Murdoch's work with this package is really nice!

Evan said that the brown drawing was a mouse. ]]>

The software company GGASoftware has extended the work of myself and others on the `sas7bdat` R package by developing a Java library called Parso, which also reads `sas7bdat` files. They have worked out most of the remaining kinks. For example, the Parso library reads `sas7bdat` files with compressed data (i.e., written with `COMPRESS=yes` or `COMPRESS=binary`). I hope to eventually bring the project full circle, and incorporate their improvements into the sas7bdat file format documentation and code in the `sas7bdat` package.

The Parso library is made available under terms of the GPLv3, and is also available under a commercial license. So, last weekend, with the help of Tobias Verbeke's `helloJavaWorld` R package template, I implemented an R package that wraps the functionality of the Parso library. The new package, `sas7bdat.parso` (currently hosted exclusively on GitHub), depends on the R package `rJava`, and implements the functions `s7b2csv` and `read.sas7bdat.parso`. The former function is the workhorse, which reads a sas7bdat file and writes a corresponding CSV file. All of the file input/output happens in the Java implementation (for speed and simplicity). The latter function `read.sas7bdat.parso` simply converts a sas7bdat file to temporary (i.e., using `tempfile`) CSV file, and then reads the CSV file using `read.csv`. There may still be some kinks the the Parso library, or in the wrapper R package, but I hope that this additional resource will help finally eliminate the SAS data file barrier that many of us have experienced for years.

Installation of the R package `rJava` is more complicated than simply calling `install.packages("rJava")`. In order for the `rJava` package to work, and hence the `sas7bdat.parso` package, a JDK (Java Development Kit) must be installed. You can download the Oracle JDK from the Oracle website. Once the JDK is installed, the easiest way to install the `sas7bdat.parso` library is using the `install_github` function in the `devtools` package (e.g., `library("devtools"); install_github("biostatmatt/sas7bdat.parso")`). For additional details on installing the `rJava` package, see the RForge site.