Passing-Bablok Regression: R code for SAS users

While at the Joint Statistical Meeting a few weeks ago I was talking to a friend about various aspects to clinical trials. He indicated that no current R package was able to perfectly reproduce Passing-Bablok (PB) regression so that it exactly matched SAS. He ultimately wrote a couple of functions and kindly shared them with me so that I could share them here.

There is currently an R package (mcr) that will do the Passing-Bablok regression but, as it turns out, the output does not precisely match SAS.  Sadly, I don’t currently have a copy of SAS available to use for this purpose so I can’t independently run code and comparisons.  Though I would be interested in an equivalent comparison.  However, I ran some examples below to compare the R mcr package.

These code snippets that he shared with me will only calculate the slope, intercept and the confidence interval for each coefficient. It’s not CRAN package ready but the short function and output are convenient and easy to use.

One interesting characteristic comparing the two PB regressions (at least in the simulation below) is that the mcr package always has a slightly smaller coefficient.  In the simulation below the maximum difference is .00054.  The minimum difference is somewhere along the lines of zero (1.59e-12). The details of the differences is something I must look into at a later date.

Passing-Bablok Comparison Scatter Plot

Passing-Bablok Regression Differences

library(mcr)

# Function to generate some correlated data
genData = function(numobs=100,randval){
R = matrix(cbind(1,.80, .80,1),nrow=2)
U = t(chol(R))
nvars = dim(U)[1]
set.seed(randval)
random.normal = matrix(rnorm(nvars*numobs,100,10), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = newX
return(raw)
}

PB.reg = function(X,Y,alpha=.05) {

## (1) calculate all pairwise slopes
x < - X y <- Y dat <- cbind(x,y) n <- length(x) S <- array(NA,dim=rep(n,2)) for(i in 1:(n-1)){ for(j in (i+1):n) { if(i != j) { S[i,j] <- (y[i] - y[j])/(x[i] - x[j]) } } } S <- sort(na.exclude(as.vector(S))) K <- sum(S <= -1) - .5 * sum(S == -1) N <- length(S) b <- ifelse(N%%2,S[(N+1)/2+K],mean(S[N/2+K+0:1])) ### CI for b C.gamma <- qnorm(1-alpha/2) * sqrt(n*(n-1)*(2*n+5)/18) M1 <- round((N-C.gamma)/2,0) M2 <- N - M1 + 1 CI.b <- c(LB=S[M1+K],UB=S[M2+K]) a <- median(y - b*x) ### CI for a CI.a <- c(LB=median(y - CI.b["UB"]*x), UB=median(y - CI.b["LB"]*x)) ## return(list(a=a, CI.a=CI.a, b=b, CI.b=CI.b)) } # A quick function to give only b coefficient PB.fit.red = function(X,Y){ fit.lm = PB.reg(X, Y) fit.lm$b } # Generate some one-time use correlated data simData = genData(numobs=20,c(12345)) # Updated function pb.fit = PB.reg(simData[,1], simData[,2]) # Look at just the b coefficient, single example PB.fit.red(simData[,1], simData[,2]) # mcreg function, single example pb.fit2 = mcreg(simData[,1], simData[,2], method.reg="PaBa", method.ci="bootstrap") slot(pb.fit2, "glob.coef")[2] &amp;amp;nbsp; # Run a small simulation nsims = 250 sim.data = matrix(NA, nrow=nsims, ncol=2) # for() loop used for demonstration purposes. # Not efficient code applying a function over the vectors is more efficient for(i in 1:nsims){ simData = genData( numobs=20, i ) # Look at just the b coefficient, multiple simulations sim.data[i,1] = PB.fit.red(simData[,1], simData[,2]) # mcreg function, multiple simulations pb.fit2 = mcreg(simData[,1], simData[,2], method.reg="PaBa", method.ci="bootstrap") sim.data[i,2] = slot(pb.fit2, "glob.coef")[2] } delta = sim.data[,1] - sim.data[,2] mean(delta) max(delta) min(delta) hist(delta, nclass=50, main="Distribution of Differences in PB Regression") plot(sim.data, main="Difference Between Two Types of Passing-Bablok Non-Parametric Regression", xlab="Improved PB Function", ylab="mcr Package") [/sourcecode]  

When Discussing Confidence Level With Others…

This post spawned from a discussion I had the other day. Confidence intervals are notoriously a difficult topic for those unfamiliar with statistics. I can’t really think of another statistical topic that is so widely published in newspaper articles, television, and elsewhere that so few people really understand. It’s been this way since the moment that Jerzy Neyman proposed the idea (in the appendix no less) in 1937.

What the Confidence Interval is Not

There are a lot of things that the confidence interval is not. Unfortunately many of these are often used to define confidence interval.

  • It is not the probability that the true value is in the confidence interval.
  • It is not that we will obtain the true value 95% of the time.
  • We are not 95% sure that the true value lies within the interval of the one sample.
  • It is not the probability that we correct.
  • It does not say anything about how accurate the current estimate is.
  • It does not mean that if we calculate a 95% confidence interval then the true value is, with certainty, contained within that one interval.

The Confidence Interval

There are several core assumption that need to be met to use confidence intervals and often require random selection, independent and identically distribution (IID) data, among others. When one computes a confidence interval repeatedly they will find that the true value lies within the computed interval 95 percent of the time. That means in the long run if we keep on computing these confidence intervals then 95% of those intervals will contain the true value.

Margin of Error by Proportion

The other 5%

When we have a “95% Confidence Interval” it means that if we repeatedly conduct this survey using the exact same procedures then 95% of the intervals would contain the actual, “true value”, in the long run. But that leaves a remaining 5%. Where did that go? This gets into hypothesis testing and rejecting the null (H_{0}) and concluding the alternative (H_{a}). The 5% that is often used is known as a Type I error. It is often identified by the Greek letter alpha (\alpha). This 5% is the probability of making a Type I error and is often called significance level. This means that the probability of an error and rejecting the null hypothesis when the null hypothesis is in fact true is 5%.

The Population

Simply looking at the formulas used to calculate a confidence interval we can see that it is a function of the data (variance and mean). Unless the finite population correction (FPC) is used, it is otherwise not related to the population size. If we have a population of one hundred thousand or one hundred million the confidence interval will be the same. With a population of that size the FPC is so minuscule that it won’t really change anything anyway.

The Margin of Error

A direct component of the confidence interval is the margin of error. This is the number that is most widely seen in the news whether it be print, TV or otherwise. Often, however, the confidence level is excluded and not mentioned in these articles. One can normally assume a 95% confidence level, most of the time. What makes the whole thing difficult is that the margin of error could be based on a 90% confidence level making the margin of error smaller.  Thus giving the artificial impression of the survey’s accuracy.  The graph below shows the sample size needed for a given margin of error.  This graph is based on the conservative 50% proportion.  Different proportions will provide a smaller margin of error due to the math.  In other words .5*.5 maximizes the margin of error (as seen in the graph above), any other combination of numbers will decrease the margin of error.  Often the “magic number” for sample size seems to be in the neighborhood of 1000 respondents (with, according to Pew, a 9% response rate for telephone surveys).

 

The Other Error

Margin of error isn’t the only error.  Keep in mind that the word error should not be confused with there being a mistake in the research.  Error simply means random variation due to sampling.  So when a survey or other study indicates a margin of error of +/- 3% that is simply the error (variation) due to random sampling.  There are all sorts of other types of error that can work its way in to the research including, but not limited to, differential response, question wording on surveys, weather, and the list could go on.  Many books have been written on this topic.

Some Examples

ci-plot

Margin of Error

alpha = .01
reps = 100000
true.mean = 0
true.var = 1

true.prop = .25

raw = replicate(reps, rnorm(100,true.mean,true.var))

# Calculate the mean and standard error for each of the replicates
raw.mean = apply(raw, 2, mean)
raw.se = apply(raw, 2, sd)/sqrt( nrow(raw) )

# Calculate the margin of error
raw.moe = raw.se * qnorm(1-alpha/2)

# Set up upper and lower bound matrix. This format is useful for the graphs
raw.moe.mat = rbind(raw.mean+raw.moe, raw.mean-raw.moe)
row.names(raw.moe.mat) = c(alpha/2, 1-alpha/2)

# Calculate the confidence level
( raw.CI = (1-sum(
as.numeric( apply(raw.moe.mat, 2, min) > 0 | apply(raw.moe.mat, 2, max) < 0 ) )/reps)*100 ) # Try some binomial distribution data raw.bin.mean = rbinom(reps,50, prob=true.prop)/50 raw.bin.moe = sqrt(raw.bin.mean*(1-raw.bin.mean)/50)*qnorm(1-alpha/2) raw.bin.moe.mat = rbind(raw.bin.mean+raw.bin.moe, raw.bin.mean-raw.bin.moe) row.names(raw.bin.moe.mat) = c(alpha/2, 1-alpha/2) ( raw.bin.CI = (1-sum( as.numeric( apply(raw.bin.moe.mat, 2, min) > true.prop | apply(raw.bin.moe.mat, 2, max) <= true.prop ) )/reps)*100 ) par(mfrow=c(1,1)) ind = 1:100 ind.odd = seq(1,100, by=2) ind.even = seq(2,100, by=2) matplot(rbind(ind,ind),raw.moe.mat[,1:100],type="l",lty=1,col=1, xlab="Sample Identifier",ylab="Response Value", main=expression(paste("Confidence Intervals with ",alpha,"=.01")), sub=paste("Simulated confidence Level: ",raw.CI,"%", sep="") , xaxt='n') axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75) axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep("",length(ind.even)), cex.axis=.75) points(ind,raw.mean[1:100],pch=19, cex=.4) abline(h=0, col="#0000FF") size.seq = seq(0, 10000, by=500)[-1] moe.seq = sqrt( (.5*(1-.5))/size.seq ) * qnorm(1-alpha/2) plot(size.seq, moe.seq, xaxt='n', yaxt='n', main='Margin of Error and Sample Size', ylab='Margin of Error', xlab='Sample Size', sub='Based on 50% Proportion') lines(size.seq, moe.seq) axis(side=1, at=size.seq, tcl = -1.0, lty = 1, lwd = 0.5, labels=size.seq, cex.axis=.75) axis(side=2, at=seq(0,15, by=.005), tcl = -0.7, lty = 1, lwd = 0.5, labels=seq(0,15, by=.005), cex.axis=.75) abline(h=seq(0,15,by=.005), col='#CCCCCC') abline(v=size.seq, col='#CCCCCC') size.seq = seq(0,1, by=.01) moe.seq = sqrt( (size.seq*(1-size.seq))/1000 ) * qnorm(1-alpha/2) plot(size.seq, moe.seq, xaxt='n', yaxt='n', main='Margin of Error and Sample Size', ylab='Margin of Error', xlab='Proportion', sub='Based on 50% Proportion') lines(size.seq, moe.seq) axis(side=1, at=size.seq, tcl = -1.0, lty = 1, lwd = 0.5, labels=size.seq, cex.axis=.75) axis(side=2, at=seq(0,15, by=.005), tcl = -0.7, lty = 1, lwd = 0.5, labels=seq(0,15, by=.005), cex.axis=.75) abline(h=seq(0,15,by=.005), col='#CCCCCC') abline(v=.5, col="#CCCCCC") [/sourcecode]

Data Scientists and Statisticians: Can’t We All Just Get Along

It seems that the title “data science” has taken the world by storm.  It’s a title that conjures up almost mystical abilities of a person garnering information from oceans of data with ease.  It’s where a data scientist can wave his or her hand like a Jedi Knight and simply tell the data what it should be.

What is interesting about the field of data science is it’s perceived (possibly real) threat to other fields, namely statistics.  It seems to me that the two fields are distinct areas.  Though the two fields can exist separately on their own each is weak without the other.  Hilary Mason (of Bitly) shares her definition of a data scientist.  I suppose my definition differs from Hilary Mason’s data science definition.  Statisticians need to understand the science and structure of data, and data scientists need to understand statistics.  Larry Wasserman over at the Normal Deviate blog shares his thoughts on statistics and data science.  There are others blogs but these two are probably sufficient.

Data science is emerging as a field of absolutes and that is something that the general public can wrap their heads around.  It’s no wonder that statistician are feeling threatened by data scientists.  Here are two (albeit extreme) examples:

If  a statistician presents an estimate to a journalist and says “here is the point estimate of the number of people listening to a given radio station and states that the margin of error is +/- 3% with a 90% confidence interval” there is almost always a follow-up discussion about the margin of error and how the standard error was calculated (simple random, stratified, cluster) why is it a 90% confidence interval rather than a 95% confidence interval.  And then someone is bound to ask what a confidence interval is anyway?  Then extend this even further and the statistician gives the journalist a p-value?  Now there is an argument between statisticians about hypothesis testing and the terms “frequentist” and “Bayesian” start getting thrown around.

It’s no wonder that people don’t want to work with statisticians.  Not only are they confusing to the general public but the statisticians can’t even agree (even if it’s a friendly disagreement) on what is correct.  Now if we take the following data scientist example:

A data scientist looks through a small file of 50 billion records where people have listened to songs through a registration-based online radio station (e.g. Spotify, Pandora, TuneIn, etc.).  This data scientist then merges and matches the records to a handful of public data sources to give the dataset a dimensionality of 10000.  The data scientist then simply reports that there are X number of listeners in a given metro area listening for Y amount of time and produces a a great SVG graph that can be dynamically updated each week with the click of a button on a website. It is a fairly simple task and just about everyone can understand what is means.

I feel that there will always be a need for a solid foundation in statistics.  There will always exists natural variation that must be measures and accounted.  There will always be data that is so expensive that only a limited number of observations can feasibility be collected.  Or suppose that a certain set of data is so difficult to actually obtain that only a handful of observations can even be collected.  I would conjecture that a data scientist would not have a clue what to do what that data without help from someone with a background in statistics.  At the same time if a statistician was told that there is a 50 billion by 10000 dimension dataset sitting on a Hadoop cluster then I would also guess that many statisticians would be hard pressed to set the data up to analyze without consulting a data scientist.  But at the same time a data scientist would probably struggle if they were asked to take those 10000 dimensions and reduce that down to a digestible and understandable set.

 

DNATake another example of genetic sequencing.  A data scientist could work the data and discover that in one sequence there is something different.  Then a domain expert can come in and find that the mutation is in the BRCA1 gene and that the BRCA1 gene relates to breast cancer.  A statistician can then be consulted and find the risk and probability that the particular mutation will result in an increased mortality and what the probability will be that the patient will ultimately get breast cancer.

Ultimately, the way I see it the two disciplines need to come together and become one.  I see no reason why is can’t be part of the curriculum in statistics department to teach students how to work with real world data.  Those working in the data science and statistics fields need to have the statistical training while having the ability to work with data regardless of the location, format, or size.

JSM 2013 – Wednesday

I was able to attend a continuing education short course workshops at the JSM conference that proved to be quite insightful.  The discussion was on data mining and was titled “Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets”.  The presentation was given by Dan Steinberg and the examples that he gave were based on a proprietary software called SPM (Salford Predictive Modeler).  I have not personally used the software so I’m in no position to endorse or discourage its use.  I generally prefer open source solutions unless there is resounding evidence to use commercial products.  So I’m interested in seeing how this software operates.  The slides for this presentation (as well as other continuing education courses) are available on their website http://info.salford-systems.com/jsm-2013-ctw.  Much of the workshop dealt with a dataset relating to car insurance fraud and how to use CART models and Random Forests.  As an aside I made a post a while back giving some examples in R on those models.  The workshop was educational and informative on how to approach these types of problems using a different software package.  I’m particularly interested in comparing SPM to R or seeing if others have already run some comparisons.

JSM 2013 – Tuesday

The Joint Statistical Meeting in Montreal has proven to be very good.   Here are a few highlight from Tuesday’s sessions.  There is one major problem that exists and that is there are too many good sessions to attend.  During one time block I had six session that I wanted to go to.  Unfortunately, it is simply not possible to make it to all of them.  However, the reoccurring theme is that if you don’t know at least R then you will quickly be left in the dust.  Based on the sessions so far knowing R is a must and knowing other languages such as Java, Scala or Python will certainly be good.

Session on Analytics and Data Visualization in Professional Sports

image

During the morning I attended a session on statistics in sports.  It was mostly several sabermetric presentations with some basketball in there too.  One presentation caught my attention due to the granularity of the data that the presenter used.  Benjamin Baumer’s R package is called openWAR which is an open source version of WAR (Wins Above Replacement) in baseball.  With the data that package accesses it is able to identify every play as well as the spatial location of where the batter hit the ball on the field.  If someone is interested in sports statistics or just interested in playing with a lot of publicly available data then openWAR is great resource (currently available on GitHub at https://github.com/beanumber/openWAR).  This presentation also discussed the distribution of the players on the field and their ability to field the ball once it was hit.  A different presentation from Sportvision presented on the location and trajectory of the ball as the pitcher throws the ball.  Sportvision also shows the location in the strike zone of where the batter hits the ball the hardest.  They are the same company that do the 1st & 10 graphics (i.e. the yellow line needed for a 1st down).

Session on Statistical Computing: Software and Graphics

I attended the Statistical Computing session and 5 of the 6 presentations were on R packages.  The first was a presentation on Muste the R implementation of Survo.  I have not used Survo before but I will certainly do some research into it.  The next presentation was by Stephan Ritter the maintainer for the relaxnet and widenet.  The third presentation was by David Dahl, the maintainer for jvmr.  With this package one can integrate Scala and Java into R without any special compilation.  TIBCO Spotfire then presented the TIBCO Enterprise Runtime for R (TERR).  This looks to be an interesting solution to some of the data management issues that exist in R.  The presenter indicated that it does a very good job at managing the system resources.  The fifth presentation discussed the Rcpp package and the final presentation by Christopher Fonnesbeck was on PyMC which allows a user to perform Bayesian statistical analysis in Python.

JSM 2013 – Monday

I am currently attending the 2013 Joint Statistical Meeting in Montreal. I will try to share a few of the things that I take away each day.

image

Last night (Monday) I attended the JSM keynote speaker with Nate Silver and it proved to be a very interesting discussion.  Silver is best known for his work on http://www.fivethirtyeight.com.  His speech was good and focused on the journalistic component of statistics.  He shared that, in his opinion, statisticians should blog and contribute to journalism.  He also added, though it’s a matter of personal opinion, and I don’t agree, that a statistician should get a few years of hands on work before going on for an advanced degree.  I’m of the philosophy that you just do it all at once, otherwise you might find yourself in a position where, for one reason out another, you simply can’t go back to school.

I think Silver gave the quote of the conference.  Silver was asked his opinion on the difference between Data Scientist and Statistician.  He response was, “Data Scientist is just a sexed up version of Statistician”. He then added that they are effectively redundant and just do good work and call yourself whatever you want.

The question was also asked during three Q&A portion why he feels election exit polls should not be trusted.  I disagree with Silver on this point.  He feels that exit polls are wrong and his arguments include the sample design of the exit poll (cluster design).  His argument was that a cluster design increases the Margin of Error.  This is a true statement but it misses the whole point of sample design and the fact that the exit poll uses a 99.9% confidence level to call a race. Which, that alone, increases the Margin of Error.  This is due to news networks not being in the business of calling races wrong and looking foolish.  Exit polls serve their purpose and have ultimately been quite accurate in calling races.  Exit polls also serve the purpose to help give context to why a candidate won or lost.

Imputing Missing Data With Expectation – Maximization

It can be fairly common to find missing values in a dataset. Having only a few missing values isn’t generally a problem and those records can be deleted listwise. In other words the entire record is simply removed from the analysis. The problem is even with a limited amount missing data, that can translate into a significant number of records that are omitted. In the example below, about two-thirds of the records would end up being omitted due to missing values.

The distribution of the missing values in the data is very important. If the data are missing at random then that is less serious than when there is a pattern of missing value that are, at least to some extent, dependent on the missing variables.

There are many approaches that can be used to impute missing data. The easiest way is to simply calculate the mean of each variable and substitute that for each of the missing values. The problem with this is that it reduces the variance and the absolute value of the covariance. Another common approach is called Expectation – Maximization. This technique iteratively goes through the data while still preserving the covariance structure of the data.

library(e1071)

raw < - replicate(10, rpois(50,100)) raw.orig <- raw rand.miss <- rdiscrete(50,probs=rep(1:length(raw)), values=seq(1,length(raw)) ) raw[rand.miss] <- NA raw <- data.frame(raw) var(na.omit(raw) ) var(raw.imputed) EMalg <- function(x, tol=.001){ missvals <- is.na(x) new.impute<-x old.impute <- x count.iter <- 1 reach.tol <- 0 sig <- as.matrix(var(na.exclude(x))) mean.vec <- as.matrix(apply(na.exclude(x),2,mean)) while(reach.tol != 1) { for(i in 1:nrow(x)) { pick.miss <-( c( missvals[i,]) ) if ( sum(pick.miss) != 0 ) { inv.S <- solve(sig[!pick.miss,!pick.miss]) # we need the inverse of the covariance # Run the EM new.impute[i,pick.miss] <- mean.vec[pick.miss] + sig[pick.miss,!pick.miss] %*% inv.S %*% (t(new.impute[i,!pick.miss])- t(t(mean.vec[!pick.miss]))) } } sig <- var((new.impute)) mean.vec <- as.matrix(apply(new.impute,2,mean)) if(count.iter > 1){ # we don’t want this to run on the first iteration or else if fails
for(l in 1:nrow(new.impute)){
for(m in 1:ncol(new.impute)){
if( abs((old.impute[l,m]-new.impute[l,m])) > tol ) {
reach.tol < - 0 } else { reach.tol <- 1 } } } } count.iter <- count.iter+1 # used for debugging purposes to ensure process it iterating properly old.impute <- new.impute } return(new.impute) } raw.imputed <- EMalg(raw, tol=.0001) plot(raw.imputed[,1], raw.imputed[,2], pch=16, main="Scatterplot of Missing Data", sub="Missing Values in Red", xlab="X",ylab="Y") # overlay the imputed values on the plot plot.imputed <- raw.imputed[ row.names( subset(raw, is.na( raw[,2] ) | is.na( raw[,3]) ) ),] points(plot.imputed[,2],plot.imputed[,3], pch=16, col='red') &nbsp; [/sourcecode]

Example Graph of Missing Data

Once the missing values are established it is important to review the data and do the standard assumption tests before proceeding with further analysis.  This is one of many approaches for imputing missing data.  Other approaches include random forests or some machine learning approaches to train the classifier directly over the missing data.

Some Common Approaches for Analyzing Likert Scales and Other Categorical Data

Analyzing Likert scale responses really comes down to what you want to accomplish (e.g. Are you trying to provide a formal report with probabilities or are you trying to simply understand the data better). Sometimes a couple of graphs are sufficient and a formalize statistical test isn’t even necessary. However, with how easy it is to conduct some of these statistical tests it is best to just formalize the analysis. There are several approaches that can be used. Here are just a few of them.

The code to set up the data for some testing is as follows.  Note that this is the same code used in Plotting Likert Scales:


set.seed(1234)
library(e1071)
probs < - cbind(c(.4,.2/3,.2/3,.2/3,.4),c(.1/4,.1/4,.9,.1/4,.1/4),c(.2,.2,.2,.2,.2))
my.n <- 100
my.len <- ncol(probs)*my.n
raw <- matrix(NA,nrow=my.len,ncol=2)
raw <- NULL
for(i in 1:ncol(probs)){
raw <- rbind(raw, cbind(i,rdiscrete(my.n,probs=probs&#91;,i&#93;,values=1:5)))
}
raw <- data.frame(raw)
names(raw) <- c("group","value")
raw$group <- as.factor(raw$group)
raw.1.2 <- subset(raw, raw$group %in% c(1,2))

&#91;/sourcecode&#93;

T-TEST

I might as well get this one out of the way. It sure is easy to take this approach which helps explains why this is probably one of the more controversial approaches. Even something like Excel will spit this out without much thought. You have to stretch the assumptions of the t-test to its outer limits. So if taking this approach one must very carefully verify the t-test assumptions. Most notably:

-Z Follows a Standard Normal
-Variance S^2 Follows a Chi Square Distribution
-Variance From two Populations should Be Equal (unless using Welch's test).
-Two Populations Should Be Independent

Two independent populations assumption gets a little sticky unless you truly are looking at two different populations from your data (e.g. Male/Female or Hispanic/Non-Hispanic). In other words just because you're comparing two different questions from your questionnaire doesn't necessarily mean you have two independent populations.

&#91;sourcecode language="css"&#93;

> t.test(raw.1.2$value ~ raw.1.2$group, var.equal=TRUE)

Two Sample t-test

data: raw.1.2$value by raw.1.2$group
t = 2.5622, df = 198, p-value = 0.01114
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1151745 0.8848255
sample estimates:
mean in group 1 mean in group 2
3.4 2.9

CHI SQUARE / FISHER EXACT TEST

This ends up being the better approach and it’s relatively easy to understand. The chi square test is designed to handle categorical frequency data and test the association between two variables.

When the sample size is too small and the assumptions of the chi square test no longer are satisfied then an alternative option is to use Fisher’s Exact Test. The classical example of this is Fisher’s Lady Tasting Tea problem. Though this is designed for a 2×2 table there are ways to generalize it to larger tables and R makes it quite simple.


( c.test < - chisq.test(raw$group, raw$value) )

Pearson's Chi-squared test

data: raw$group and raw$value
X-squared = 195.1726, df = 8, p-value < 2.2e-16

&#91;/sourcecode&#93;

&nbsp;

In the above example some of the cells are quite small which could mean that the chi square approach may not work.  So Fisher's 2 x 2 test can be expanded and we can test this data.  However, keep in mind the assumptions on the martingale being fixed. Due to limitations in workspace size in R I have just found it easiest to simulate the p-value and achieve the desired outcome that way.

&#91;sourcecode language="css"&#93;

sim.table <- table(raw$group, raw$value)

fisher.test(sim.table, simulate.p.value=TRUE, B=1e6) # Simulate due to workspace limitations

Fisher's Exact Test for Count Data with simulated p-value (based on 1e+06
replicates)

data: sim.table
p-value = 1e-06
alternative hypothesis: two.sided

&#91;/sourcecode&#93;


WILCOXON SIGNED TEST

This is used when the data come from a related sample and are from the same population. In other words this works well on a matched pairs sample.  So assuming the group 1 and group 2 come from the same population and are just a different measurement we can take this approach.

&#91;sourcecode language="css"&#93;

wilcox.test(raw.1.2$value&#91;raw.1.2$group==1&#93;, raw.1.2$value&#91;raw.1.2$group==2&#93;, paired=TRUE)

&#91;/sourcecode&#93;

MANN-WHITNEY

This tests whether two independent samples are the same.  In this case the only difference between the Mann-Whitney test and the Wilcoxon Signed Test is that the paired sample is specified in the Wilcoxon Signed Test

&#91;sourcecode language="css"&#93;

wilcox.test(raw.1.2$value&#91;raw.1.2$group==1&#93;, raw.1.2$value&#91;raw.1.2$group==2&#93;)

&#91;/sourcecode&#93;

KRUSKAL-WALLIS TEST
Analysis of Variance equivalent for categorical data.   I feel that this is probably very underused.  This is probably do to ANOVA being beyond the scope of most casual analysts and then throwing in categorical data makes it that much more obscure.  Like the ANOVA is also assumes independent populations.  But once you understand exactly what you're testing and what type of data you're dealing with the implementation of the test is quite simple:

&#91;sourcecode language="css"&#93;

kruskal.test(raw$value ~ raw$group)

Kruskal-Wallis rank sum test

data: raw$value by raw$group
Kruskal-Wallis chi-squared = 13.9105, df = 2, p-value = 0.0009536

&#91;/sourcecode&#93;

This can be compared to the parametric ANOVA

&#91;sourcecode language="css"&#93;

fit < - lm(raw$value ~ raw$group)
anova(fit)

Analysis of Variance Table

Response: raw$value
Df Sum Sq Mean Sq F value Pr(>F)
raw$group 2 14.91 7.4533 4.0289 0.01878 *
Residuals 297 549.44 1.8500
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Using R to Produce Scalable Vector Graphics for the Web

Statistical software is normally used during the analysis stage of a project and a cleaned up static graphic is created for the presentation.  If the presentation is in web format then there are some considerations that are needed. The trick is to find ways to implement those graphs in that web format so the graph is of the highest possible quality.  If all that is needed is an image then simply saving the graph as a JPG or PNG and posting it to a website is quite simple and usually sufficient.  However, if more flexibility and higher quality is needed then some additional work will be needed.

One of the best way to present a graph is using vectors (as opposed to raster graphics).  What this means is that if one uses vectors graphics then a user can zoom in and there won’t be any degradation in image quality.  Several formats support vector graphics including PDF and SVG.

Scalable Vector Graphics are a great way to put together graphs using an XML-based format. This means it can be easily implemented directly into a website and, as an added bonus, it can become a dynamic image changing with user input.   R will generate the base structure of the graphic but dynamic SVG requires a bit more work outside of R.   Most modern browsers (IE 8 is not considered modern anymore so it is not supported) support this type of graphic format . Supported browsers include  IE 9, Firefox, and Chrome.

Using the example from a previous post I can convert the image into Scalable Vector Graphic.  The code that R produces into the SVG file can be copied and pasted directly into a web page. I have a side-by-side comparison of the graphs using earthquake data from the week prior to June 28, 2013.

Earthquakes Week Prior to 2013-06-08

library(maps)
library(maptools)
library(rgdal)
svg(filename = “c:\\eq.svg” ,
width = 7, height = 7
)

eq = read.table(file=”http://earthquake.usgs.gov/earthquakes/catalogs/eqs7day-M1.txt”, fill=TRUE, sep=”,”, header=T)
plot.new()
my.map < - map("state", interior = FALSE, plot=F) x.lim <- my.map$range[1:2]; x.lim[1] <- x.lim[1]-1; x.lim[2] <- x.lim[2]+1; y.lim <- my.map$range[3:4]; y.lim[1] <- y.lim[1]-1; y.lim[2] <- y.lim[2]+1; map("state", interior = FALSE, xlim=x.lim, ylim=y.lim) map("state", boundary = FALSE, col="gray", add = TRUE) title("Magnitude 1+ Earthquakes Over the Past 7 Days") eq$mag.size <- NULL eq$mag.size[eq$Magnitude>=1 & eq$Magnitude < 2] < - .75 eq$mag.size[eq$Magnitude>=2 & eq$Magnitude < 3] < - 1.0 eq$mag.size[eq$Magnitude>=3 & eq$Magnitude < 4] < - 1.5 eq$mag.size[eq$Magnitude>=4] < - 2.0 eq$mag.col <- NULL eq$mag.col[eq$Magnitude>=1 & eq$Magnitude < 2] < - 'blue' eq$mag.col[eq$Magnitude>=2 & eq$Magnitude < 3] < - 'green' eq$mag.col[eq$Magnitude>=3 & eq$Magnitude < 4] < - 'orange' eq$mag.col[eq$Magnitude>=4] < - 'red' points(x=eq$Lon,y=eq$Lat,pch=16,cex=eq$mag.size, col=eq$mag.col) eq$magnitude.text <- eq$Magnitude eq$magnitude.text[eq$Magnitude < 4] <- NA text(x=eq$Lon,y=eq$Lat,col='black',labels=eq$magnitude.text,adj=c(2.5),cex=0.5) legend('bottomright',c('M 1-2','M 2-3','M 3-4','M4+'), ncol=2, pch=16, col=c('blue','green','orange','red')) box() dev.off() [/sourcecode]   The package ggplot2 has a function that will identify that one wants an SVG file based on the filename provided.  This graph shows the depth of the earthquake compared to the magnitude.

 

library(ggplot2)
( g <- ggplot(eq, aes(x = Magnitude, y = Depth)) + geom_point() + geom_point(data = eq, color = eq$mag.col, size = 3) ) ggsave(g, file="c:\\eqDepthMagnitude.svg") [/sourcecode]    

Software Packages for Graphs and Charts

Graphs can be an important feature of analysis. A graph that has been well designed and put together can make summary statistics much more readable and increase the interpretability. It also makes reports and articles looks more professional.

There are many software packages that are available to design great graphs and charts.  This seems to be a constant discussion I have with others over what graphing software should be used. There are some packages out there that simply make a graph. You can even use something like the default Microsoft Excel or Powerpoint. But then you end up with an “Excel Graph”. Whenever I see one of those graphs, sadly, my knee-jerk reaction is to investigate further into the analysis. So there is something to be said about having professional looking graphs.  However, with that said given some time and work I can agree that Excel can create some attractive graphs.

There are a lot of packages out there and trying to learn every one of them is almost impossible.  My suggestion is to find one that produces a high quality graph and stick with it. There are many commercial software packages including the most common: SPSS, SAS, Minitab.  There’s even Photoshop/Illustrator for those who want absolute control over a completely custom graph. I’ll highlight a few (open source) packages that I have used for various purposes here.

1) R — This does, of course, do statistical analysis but it can produce some very high quality graphs. It will produce just about every graph imaginable. And if it doesn’t already do the graph then chances are you can write a function or package to do it. R graphs are built from the ground up. Usually, if you don’t want something in the graph then it won’t be there unless you say otherwise.

2) Gretl — This is a full feature statistical software package that handles a handful of graphs.  It’s specifically designed for economics and quite easily performs general statistical tests and even handles iterative estimation (i.e. Monte Carlo).  A review of the software package is available at the Journal of Statistical Software.  The graphs are quick and easy to make and there is a command line option for more control.  The graphs are gnuplot and are of high enough quality that they can be used in publications.  The range of graphs are limited but the ones that is does are pretty good: there’s the standard boxplot, scatterplot, and qqplot.

3) Orange — I was working with several large datasets several years ago and looked into using Orange.  This is a very powerful data mining package.  It also has a graphing module to produce some very good graphs.  It produces the standard graphs but it really focuses on the data mining graphs and plots.  These include hierarchical clustering, scatterplots, it also provides a survey plot for identifying correlations between two variables.

4) GGobi  — I was first introduced to GGobi back in about 2000.  It is a great way to visualize multi-dimensional data.  It can be a bit mind-boggling at times trying to wrap you head around a 5 dimensional space being projected on a two-dimensional plane. I’ve also used it to help spot outliers in a 4+ dimensional space and to just gain a better understanding of the data.

5) gnuplot — This is strictly a graphing program.  And it does a good job at creating them.  It will make a wide variety of graphs and has many built-in demos that can be easily modified and rerun on a new dataset. It also makes it really easy to rotate three-dimensional scatterplots by simply clicking and dragging the image.  Here is an example from gnuplot.

Example gnuplot

 

These are just a few of the software packages that I have used and with which I have had success.  There are many graphing packages out there and I would be interested in other programs that others have used to create graphs and charts.