Beta Distribution and the NJ U.S. Senate Election

The beta distribution is highly flexible distribution and applies to many situations and environments. The beta distribution applies well when there are percentages. The upcoming New Jersey U.S. Senate election on Wednesday fits that criterion quite well. So here I applied the beta distribution to some pre-election polls where the numbers were obtained through the poll aggregator www.realclearpolitics.com.

The candidates for New Jersey election this Wednesday — to fill the vacant seat left by the death of Frank Lautenberg — are Cory Booker and Steve Lonegan. Though there are other third-party candidates running the race it is effectively between Booker and Lonegan. Though more complex models can be used reducing the candidates to two the beta distribution can be applied to these data and the outcomes and a simple simulation can be achieved using the given data.

Some Historical Notes

This general election is on a non-standard Election Day (Wednesday, October 16th). It happens to be the first time that a New Jersey general election has been held on a Wednesday. Aside from the current Republican senator who was appointed by Chris Christie the last time there was a Republican U.S. Senator in New Jersey was back in the early 1980’s and even then he too was appointed to the office.

2013 NJ US Senate -- Booker v. Lonegan

The Beta Distribution

As can be seen from the elections since 1990 the Democratic candidate has won by an average of about 8.9%.

2012 — Menendez: 58.9% v. Kyrillos: 39.4%
2008 — Lautenberg: 55.5% v. Zimmer: 42.5%
2006 — Menendez: 53.3% v. Kean Jr.: 44.3%
2002 — Lautenberg: 53.9% v. Forrester: 44.0%
2000 — Corzine: 50.1% v. Franks: 47.1%
1996 — Torricelli: 52.7% v. Zimmer: 42.6%
1994 — Lautenberg: 50.3% v. Haytaian: 47.0%
1990 — Bradley: 50.5% v. Whitman: 47.4%

Based on recent pre-election polling it looks like Booker will likely win by a similar margin and maybe a little higher than the average of 8.9% and, based on pre-election polls, closer to 12 percentage points.  The marginal difference between Booker and Lonegan is distributed as a beta distribution and we can see that the threshold of zero (0) is out in the far tail of the distribution.  So based on historical elections and current pre-election polling it seems that the likelihood that Booker will win is very high.

Histogram of Simulated Differences for 2013 U.S. Senate Election

Example Code


library(MCMCpack)
## Set up several of the recent polls but will only work with the most recent on
raw.1 = NULL
raw.1 = data.frame( rbind(
Quinnipiac = c(.53,.41,899),
RSC = c(.50,.39,729),
FD= c(.45,.29,702),
Mon = c(.53, .40,571)
)
)
raw.1 = rbind(raw.1, c(apply(raw.1[,1:2],2,weighted.mean,raw.1[,3]),sum(raw.1[,3])))
names(raw.1) = c("Cand1","Cand2","size")
raw.1$Other.und = 1-raw.1$Cand1-raw.1$Cand2
raw.1.no.und = data.frame(raw.1[5,1:2] + raw.1[5,1:2]/sum(raw.1[5,1:2])*raw.1[5,4],size=raw.1[5,3],Other.und=0)
raw = rbind(raw.1, raw.1.no.und)
###################################################################
## More than two candidates so Beta distribution won't work
## Function to randomly generate data from a dirichlet distribution
###################################################################
j= 4
prob.win = function(j,export=1){
p=rdirichlet(100000,
raw$size[j] *
c(raw$Cand1[j], raw$Cand2[j], 1-raw$Cand1[j]-raw$Cand2[j])+1
)
if(export==1){
mean(p[,1]>p[,2])
} else {
return(p)
}
}

( cand1.win.probs = sapply(1:nrow(raw),prob.win) )

sim.dir = prob.win(4,export=2) ## set simulated data for plotting and determining parameters
sim.dir.diff = sim.dir[,1]-sim.dir[,2] ## Get the marginal. From a Dirichlet the is distributed as a Beta.
sim.dir = cbind(sim.dir, sim.dir[,1]-sim.dir[,2])
## The shape parameters (shape1 and shape2) might need some manual adjusting and tweaking.
## In this case I ran the function a few time to set the start value close to the output
fit.distr.1 = fitdistr(sim.dir[,1], "beta",
start=list(shape1=302,shape2=270))
fit.distr.2 = fitdistr(sim.dir[,2], "beta",
start=list(shape1=229,shape2=343))
fit.distr.margin = fitdistr(sim.dir[,4], "beta",
start=list(shape1=5,shape2=5))
## Could also draw a histogram of simulated data
curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]),
ylim=c(0,20), xlim=c(.3,.6), col='blue', lty=1, lwd=2, ylab="Density", xlab="theta",
main="Distribution of the NJ U.S. Senate Election 2013",
sub=paste("Probability that Booker beats Lonegan: ", round(cand1.win.probs[6],2) ) ) ## Candidate 1
curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col='red', lty=2, lwd=2) ## Candidate 2

abline(v=c(median(sim.dir[,1]), median(sim.dir[,2])), col=c('blue','red'), lwd=2, lty=c(1,2,3))
legend("topleft",c("Booker","Lonegan"), lwd=2, col=c('blue','red'), lty=c(1,2))
## Draw a histogram of simulated data
hist(sim.dir[,4], nclass=100, main="Histogram of the Candidate Differences", xlab="Candidate Difference")
abline(v=0, col=c('black'), lwd=2, lty=c(1))

Random Sequence of Heads and Tails: For R Users

Rick Wicklin on the SAS blog made a post today on how to tell if a sequence of coin flips were random.  I figured it was only fair to port the SAS IML code over to R.  Just like Rick Wicklin did in his example this is the Wald-Wolfowitz test for randomness.  I tried to match his code line-for-line.

flips = matrix(c('H','T','T','H','H','H','T','T','T','T','T','T','T','H','H','H','T','H','T','H','H','H','T','H','H','H','T','H','T','H'))

RunsTest = function(flip.seq){
  u = unique(flip.seq) # unique value (should be two)
  
  d = rep(-1, nrow(flip.seq)*ncol(flip.seq)) # recode as vector of -1, +1
  d[flip.seq==u[1]] = 1
  
  n = sum(d > 0) # count +1's
  m = sum(d < 0) # count -1's
  
  dif = c(ifelse(d&#91;1&#93; < 0, 2, -2), diff( sign(d) )) # take the lag and find differences
  
  R = sum(dif==2 | dif==-2) # count up the number of runs
  
  ww.mu = 2*n*m / (n+m) + 1 # get the mean
  ww.var = (ww.mu-1)*(ww.mu-2)/(n+m-1) # get the variance
  sigma = sqrt(ww.var) # standard deviation
  
  # compute test statistics
  if((n+m) > 50){
    Z  = (R-ww.mu) / sigma
  } else if ((R-ww.mu) < 0){
    Z = (R-ww.mu+0.5) / sigma
  } else {
    Z = (R-ww.mu-0.5)/sigma
  }
  
  
  pval = 2*(1-pnorm(abs(Z))) # compute a two-sided p-value 
  
  ret.val = list(Z=Z, p.value=pval)
return(ret.val)
}

runs.test = RunsTest(flips)
runs.test


> runs.test
$Z
[1] -0.1617764

$p.value
[1] 0.8714819

That’s Smooth

I had someone ask me the other day how to take a scatterplot and draw something other than a straight line through the graph using Excel.  Yes, it can be done in Excel and it’s really quite simple, but there are some limitations when using the stock Excel dialog screens. So it is probably in one’s best interest to use a higher quality statistical software package.

There are several ways to add a curve to a scatter plot. This could be something as simple as adding a polynomial to the model or something a little more complex like a kernel curve, LOESS, or splines.

Smoothers

Linear vs. Non-Linear

First, a difference needs to be made between linear and non-linear regression. Linear regression can also include drawing “lines” that, when plotted on a graph, actually have a curved shape. Polynomial regression is a type linear regression that produces curves and is referred to a curvilinear regression. Specifically, if you can take the partial derivative of the equation and the result is a function that still includes the unknown parameters then the model is considered nonlinear. Polynomials continue to remain a linear combination. Take the following linear model equation.

  y = B_{0} + x\beta_{1} + x^{2}\beta_{2} + x^{3}\beta_{3}

The partial derivatives, as seen below, are no longer functions of \beta. This means that even though it is a polynomial this is a linear model.

\begin{array}{l} \frac{dy}{d\beta_{0}}=1\\  \frac{dy}{d\beta_{1}}=x\\  \frac{dy}{d\beta_{2}}=x^{2} \\  \frac{dy}{d\beta_{3}}=x^{3}  \end{array}

Example Data

An example that I often use is the speed of a car and the fuel economy. This often shows a strong curvilinear relationship. A lot of other data including biological data and data relating to speed and weight can often have a form of curved relationship.  This graph is from the 2012 Fuel Economy Guide.

FuelEconomy.gov Speed and MPG

Data can also be simulated that is curvilinear.   In this case I added a cubic component and show the straight line regression.

Simulated Curve

 The example I’ll use is the cars dataset that is shipped with R in the datasets library.  It has two variables and is the speed of the car and the stopping distance.

Polynomials

Polynomials can be a very good choice because a straight forward closed-form solution can easily be obtained.  Additionally, polynomial models are fairly easy to implement and the models can be easily compared to other predictive models.

Polynomial

 

Smoothing Spline

Like other smoothers the spline uses a range of the value to determine its smoothness.  This is referred to as the knot.  The more knots the tighter the fit of the model.  Fewer knots produce a smoother curve.  Here the natural spline (green) and the smoothing spline (blue) are fairly similar.

 Splines

Kernel Smoothing

Kernel smoothing uses a weighted average of the corresponding x/y point. The sensitivity is based on a predefined value called bandwidth.  The greater the bandwidth the smoother the curve will be as it dampened down by the additional data points but it doesn’t handle localized variation very well.  A smaller bandwidth can be very sensitive to local variation and will allow some very dramatic changes in the curve.  For example here is a comparison of a bandwidth of 1.0 and 5.0.  The smoother curve is, of course, the larger bandwidth.

Kernel Comparison

 

LOESS and LOWESS

This is a non-parametric locally weighted regression using a nearest neighbor approach.  It too uses a value to control the smoothing.  In R this is referred to as span but can also be referred to as bandwidth, similar to kernel smoothing.  I nice feature the comes with LOESS is it’s ability to produce uncertainly around the prediction.  This means that confidence intervals can easily be produced and plotted.

 LOESS

Closing Thoughts

It can be a little troublesome trying to determine which is best model or whether the curve is over/under fit.  It may also depend on how sensitive to local variation you want to allow.  There are options available to help determine the approach you may want to use.  For example the MGCV package in R has some features to help with this.  As can be seen in this example all the approaches result in very similar solutions.  If all you need is some sort of smoother through a scatterplot then it may be best to use the simplest approach.  However, if one is trying to make predictions then a lot more care needs to be taken to fit the model properly.

Example Code


library(graphics)
library(splines) # Used for the ns() function -- (natural cubic splines)

R = matrix(cbind(1,.99, .99,1),nrow=2)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 1000
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,10,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1")
raw$predictor1.3 = raw$predictor1^3
raw$predictor1.2 = raw$predictor1^2
fit = lm(raw$response ~ raw$predictor1.3)

plot(raw$response ~ raw$predictor1.3, pch=16, cex=.4, xlab="Predictor", ylab="Response", main="Simulated Data with Slight Curve")
abline(fit)

x = with(cars, speed)
y = with(cars, dist)
eval.length = 50

# This LOESS shows two different R function arriving at the same solution.
# Careful using the LOESS defaults as they differ and will produce different solutions.
fit.loess = loess.smooth(x, y, evaluation = eval.length,
family="gaussian", span=.75, degree=1)
fit.loess2= loess(y ~ x, family="gaussian",
span=.75, degree=1)

## Set a simple 95% CI on the fit.loess model
new.x = seq(min(x),max(x), length.out=eval.length)
ci = cbind(
predict(fit.loess2, data.frame(x=new.x)),
predict(fit.loess2, data.frame(x=new.x))+
predict(fit.loess2, data.frame(x=new.x), se=TRUE)$se.fit*qnorm(1-.05/2),
predict(fit.loess2, data.frame(x=new.x))-
predict(fit.loess2, data.frame(x=new.x), se=TRUE)$se.fit*qnorm(1-.05/2)
)

## Linear Model
fit = lm(y ~ x )

## Polynomial
fit.3 = lm(y ~ poly(x,3) )

## Natural Spline
fit.ns.3 = lm(y ~ ns(x, 3) )
## Smoothing Spline
fit.sp = smooth.spline(y ~ x, nknots=15)

plot(x,y, xlim=c(min(x),max(x)), ylim=c(min(y),max(y)), pch=16, cex=.5,
ylab = "Stopping Distance (feet)", xlab= "Speed (MPH)", main="Comparison of Models"
, sub="Splines")
## Add additional models on top of graph. It can get cluttered with all the models.

## LOESS with Confidence Intervals
matplot(new.x, ci, lty = c(1,2,2), col=c(1,2,2), type = "l", add=T)
## Linear
lines(new.x, predict(fit, data.frame(x=new.x)), col='orange', lty=3)
## Polynomial
lines(new.x, predict(fit.3, data.frame(x=new.x)), col='light blue', lty=4)
## Natural Spline
lines(new.x, predict(fit.ns.3, data.frame(x=new.x)), col='green', lty=5)
## Smoothing Spline
lines(fit.sp, col='blue', lty=6)
## Kernel Curve
lines(ksmooth(x, y, "normal", bandwidth = 5), col = 'purple', lty=7)
legend("topleft",c("Linear","Polynomial","Natural Spline","Smoothing Spline","Kernel"),
col=c('black','light blue','green','blue','purple'), lty=c(3,4,5,6,7), lwd=2)

 

The Uncertainty of Predictions

There are many kinds of intervals in statistics.  To name a few of the common intervals: confidence intervals, prediction intervals, credible intervals, and tolerance intervals. Each are useful and serve their own purpose.

I’ve been recently working on a couple of projects that involve making predictions from a regression model and I’ve been doing some work on estimation and the variability of the estimates. Recently, in my case, I need to predict an individual outcome drawn from the distribution (this differs from estimating the mean of that distribution). Often I end up estimating several simultaneous individual outcomes. It follows that we need to assume the underlying regression model still applies and is appropriate for this new individual observation.

Prediction vs. Confidence Intervals

Prediction Intervals and Confidence Intervals

Prediction intervals are useful in that it gives us an understanding of likely values that an individual observation may be.  There is a subtle, but big, difference between prediction intervals and confidence intervals.  With a confidence interval we are trying to obtain an upper and lower limit for the parameter.  For a prediction interval we are interested in the upper and lower limits on an individual observation.  One key item to note is that prediction intervals will always be wider than the corresponding confidence intervals.  The reason for this is that when making a new prediction on an observation ( \hat{y_{i}} ) we encounter the variability in the observation estimate as well as variation within the probability distribution of y.

These intervals often get confused with confidence intervals (which those alone have many nuances). But they really need to be kept separate because calling a prediction interval a confidence interval (or vice versa) will provide a false understanding of the variability.

Doing Many Prediction Intervals

Sometimes it is necessary to estimate several new observations.  In this case we need an overall family confidence level of (1-\alpha). To compensate for simultaneously estimating many observations there are a couple of difference procedures: Scheffé and Bonferroni.  It seems, based on various conferences that I’ve attended, that Bonferroni is most common, probably because it is not only conservative in its application but because it is so simple to implement.

Bayesian Posterior Predictive Distribution

It’s worth mentioning the posterior predictive distribution using a Bayesian approach.  In Bayesian statistics the parameter is treated as random and often unknown (as compared to frequentists where it’s fixed and often unknown).   So here we can use the posterior predictive distribution for new data. It also happens to be a good way to see how well the model performed.

Posterior Prediction Distribution

Example Code

library(LearnBayes)
alpha = 0.05
y = rnorm(100,0, 1)
x1= y+rnorm(100,1,1)
x2= y+rnorm(100,2,2)

##### Least-squares fit
fit=lm(y~x1+x2,x=TRUE,y=TRUE)

new = data.frame(x1 = seq(-5, 5, 0.5), x2=seq(-3, 7, 0.5))

pred.w.plim = predict(fit, new, interval = “prediction”)
pred.w.clim = predict(fit, new, interval = “confidence”)
matplot(new$x1, cbind(pred.w.clim, pred.w.plim[,-1]),
lty = c(1,2,2,3,3), col=c(1,2,2,3,3), type = “l”,
ylab = “predicted y”, xlab= “x1″, main=”Confidence vs. Prediction Intervals”)
legend(“topleft”, c(“Confidence”,”Prediction”), lwd=2, lty=c(2,3), col=c(2,3), cex=.7)

##### Sampling from posterior
theta.sample=blinreg(fit$y,fit$x,5000)

######### Model checking via posterior predictive distribution
pred.draws=blinregpred(fit$x,theta.sample)
pred.sum=apply(pred.draws,2,quantile,c(alpha/2,1-alpha/2))

par(mfrow=c(1,1))
ind=1:ncol( pred.sum )
ind.odd=seq(1,ncol( pred.sum ), by=2)
ind.even=seq(2,ncol( pred.sum ), by=2)
matplot(rbind(ind,ind),pred.sum,type=”l”,lty=1,col=1,
xlab=”Identifier”,ylab=”Response Variable y”, main=”Posterior Predictive Distribution”
, xaxt=’n’)
axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75)
axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep(“”,length(ind.even)), cex.axis=.75)
points(ind,y,pch=19, cex=.4)

out=(y>pred.sum[2,] | y

Waiting in One Line or Multiple Lines

Whenever I go to the grocery store it always seems to be a lesson in statistics. I go get the things I need to buy and then  I try to select the checkout register that will decrease the amount of time I have to wait. Inevitably, I select the one line where there is some sort of problem and I just sit there and wait and wait.  I will often mark the lines that I am not waiting in to see which line I would have made it through faster. So then the question is whether or not I get out of line and try to find a new line that is faster. As a statistician I know that wait time is often modeled using a memory-less exponential distribution and I could very easily choose the line with the fewest number of people in front of me only find that the cash register ran out of paper or there is some other delay. So it’s a cruel statistical game trying to find the one line that minimizes my wait time. But this game could easily be simplified, as well as removing the anxiety of customers watching others who arrived after them finish their transaction sooner. By eliminating multiple lines and having all customers go through one line — and then break off at the very end  — the customers would generally end up getting through the line faster and with less variability. However, I do understand that sometimes there are operational limitations due to the physical store layout or otherwise.

I ran a simulation on the two wait line strategies. The first is when there is only one line that everyone goes through and then break off at the end to the next available cash register.  I have often seen this approach in places like Banks, Airport Security, and U.S. Customs. The second strategy is that there are multiple lines and once you’re in the line you stay there until you complete your transaction.  This approach is often seen at places like Costco, Walmart, Target, and just about every other grocery store.

Multiple Lines Wait Time Max/Min

The assumption I made for this simulation is that the wait time is distributed as an exponential distribution with a mean of three minutes (EXP(\theta=3)).  This next graph shows a comparison — when using a single line strategy — between the maximum wait time and the minimum wait time.

Single Line Wait Time Max/Min

This final graph shows the range difference for each of the service locations (cash registers).  When using multiple lines there is a wide range between the maximum total time of one of the locations and the minimum time of the locations.  However, when using only one line that range gets dampened down due to the process as it tries to equalize the wait time at each of the locations.  Ultimately, in order to get every last person out of the system it will take roughly the same amount of time.  But what ends up happening is that there could be a whole group of people stuck in one of the problem lines thus creating higher variability.  And anyone who is involved in process control, such as manufacturing, knows that higher variability is generally not a good thing and is less efficient.

Single and Multiple Wait Time Range Difference


library(reshape2) ## For use in the acast() function
service.locations = 3 ## i.e. Number of cash registers
total.obs = service.locations*10 ## Just needed a number and wanted to make it divisible by the service locations
nsims = 10000 ## Number of simulation replicates
x = replicate(nsims, rexp(total.obs, 1/3))

## Set some loading variables

cum.sum = aggreg.multi.mat = aggreg.one.mat = NULL

for(j in 1:ncol(x)){
## Multiple lines assigned randomly (sequentially)
wait.multiple = cbind( (seq(1:total.obs) %% service.locations)+1, x[,j] )
aggreg.values = aggregate(wait.multiple[,2], by=list( wait.multiple[,1] ), sum)
aggreg.multi.mat = rbind(aggreg.multi.mat, t( acast(aggreg.values, Group.1~., value.var="x") ) )

## One line and breaking off at the very end
wait.bucket = matrix(NA, ncol=service.locations,nrow(x))
queue = x[,j]

## Preload the first service locations
for(d in (1:service.locations)){
wait.bucket[d,d] = x[d,j]
}

## Cumulative sum without NA then find min location time as it will serve the next customer
for(i in service.locations+1:(nrow(x)-service.locations) ){
for(k in 1:service.locations){
cum.sum[k] = sum(wait.bucket[1:i,k], na.rm=T)

}
wait.bucket[i,which(cum.sum==min(cum.sum), arr.ind=TRUE)] = x[i,j]
}
aggreg.one.mat = rbind(aggreg.one.mat, apply(wait.bucket, 2, sum, na.rm=T))

}

## View the graphs
my.hist.one.1 = hist( apply(aggreg.one.mat,1,max), nclass=100, plot=FALSE)
my.hist.one.2 = hist( apply(aggreg.one.mat,1,min), nclass=100, plot=FALSE)
max.counts = max(my.hist.one.1$counts, my.hist.one.2$counts)
par(mfrow=c(2,1))
hist( apply(aggreg.one.mat,1,max), nclass=100, xlim=c(0,60), co=3, main=expression(paste("Single-Line -- Distribution of Max Wait Time With EXP(",theta,"=3)")), xlab="Total Minutes Per Register")
hist( apply(aggreg.one.mat,1,min), nclass=100, xlim=c(0,60), col=2, main=expression(paste("Single-Line -- Distribution of Min Wait Time With EXP(",theta,"=3)")), xlab="Total Minutes Per Register")

my.hist.data.1 = hist( apply(aggreg.multi.mat,1,max), nclass=100, plot=FALSE)
my.hist.data.2 = hist( apply(aggreg.multi.mat,1,min), nclass=100, plot=FALSE)
max.counts = max(my.hist.data.1$counts, my.hist.data.2$counts)
par(mfrow=c(2,1))
hist( apply(aggreg.multi.mat,1,max), nclass=100, xlim=c(0,60), ylim=c(0,max.counts), co=3, main=expression(paste("Multiple Lines -- Distribution of Max Wait Time With EXP(",theta,"=3) ")), xlab="Total Minutes Per Register")
hist( apply(aggreg.multi.mat,1,min), nclass=100, xlim=c(0,60), col=2, main=expression(paste("Multiple Lines -- Distribution of Min Wait Time With EXP(",theta,"=3) ")), xlab="Total Minutes Per Register")

par(mfrow=c(2,1))
hist( apply(aggreg.one.mat,1,max)-apply(aggreg.one.mat,1,min), nclass=100, xlim=c(0,60), col=2, main="Single Line", xlab="Range Difference Between Max and Min in Minutes of Service Locations")
hist( apply(aggreg.multi.mat,1,max)-apply(aggreg.multi.mat,1,min), nclass=100, xlim=c(0,60), col=3, main="Multiple Lines", xlab="Range Difference Between Max and Min in Minutes of Service Locations")

Profile Likelihood for New Jersey U.S. Senate Special Election

As it stands right now Cory Booker has a very good chance of winning the New Jersey Special U.S. Senate election on October 16 to replace Frank Lautenberg and fill the remainder of his term for the next 15 months.  So with the election only about a month away I took advantage of some of the data from the pre-election polls to produce a likelihood estimate based on the four most recent polls listed on www.realclearpolitics.com.

Contour Plot of Joint Likelihood

Though it may not be the perfect choice of distributions, the normal distribution will work well and we can easily use the likelihood function to produce the profile likelihood of the data.  This makes it easier to describe one parameter at a time while eliminating the nuisance parameter.  In this case we’re particularly interested in the estimate, \mu.

Likelihood approaches can be useful as it not only shows where the likelihood is maximized but it also shows other likely values of \theta.  Furthermore, likelihood approaches can have better small-sample properties over those based on asymptotic convergence (e.g. standard errors).

There are some differences between Likelihood and Bayesian/Frequentist approaches.  In this case we can produce what is referred to as a likelihood interval. However, likelihood intervals suffer from calibration problems where say a ‘5% likelihood’ does not have specific meaning.  However, 5% or 10% probability does have meaning.  So really, a probability-based inference can be the better option. But in some cases probability-based conclusions can simply be ridiculous and effectively have no useful value.  In those cases probability-based inference is probably not the best choice.

In some cases the likelihood interval is the same as the confidence interval.  Particularly for interval estimates, using a normal mean case, we can choose a cutoff of 15% such that it corresponds to a 95% confidence interval in Frequentist terms.

How is the Likelihood Interval interpreted

  • It can be interpreted the same way as a confidence intervals if an exact or large-sample approximation is justified.  The likelihood must be reasonably regular (i.e. the log-likelihood can be approximated by a quadratic function). Though this requirement can be quite forgiving.
  • If the likelihood is certainly not regular (can be in small-sample problems or non-normal distributions) then the interval can be interpreted as a pure likelihood.

The following graphs show the estimate for Cory Booker as well as the variance of the four polls.  Given these data (as of Friday, Sep. 13) and assuming this distribution the maximum likelihood estimate shows that Cory Booker is at 55.5%. You’ll also note that since this value is the MLE this is the same value that www.realclearpolitics.com uses on their website. We can also see that the other likely values are somewhere between 50.5% and 60.5%.

Contour Plot of Joint Likelihood

Cory Booker Profile Likelihood

 


## Profile Likelihoods

par(mfrow=c(1,1))

x = c(64,50,54,54)

boxplot(x, col=3, main='Boxplot of All Polls', ylab='Cory Booker Percent')

n = length(x)

np = 500
sx = sqrt(var(x))

mu.theta = seq(mean(x)-8,mean(x)+8,len=np)
sigma.theta = seq(sx/2.75,sx*2.75,len=np)

logLikeFun < - function(mu, x){
## However, only the middle portion of this equation is necessary
a = -n/2 * log( 2 * pi ) - n/2 * log( sum((mu-x)^2)/n ) - 1/(2 * sum((mu-x)^2)/n )*sum((mu-x)^2)
a
}
logLikeFunSigma = function(mu,sigma){
a = -n/2*log(sigma^2) -
(sum(x^2) - 2*mu*sum(x) + n*mu^2)/(2*sigma^2)
-a
}
li = function(th,like,alpha=0.15){
that = mean(th&#91;like==max(like)&#93;)
lowth = th&#91;th < that&#93;
lowlik = like&#91;th < that&#93;
if (length(lowth) < 2){
lowval = min(th)
}
if (length(lowth) > 1){
lowval = approx(lowlik,lowth,xout=alpha)$y
}

upth = th[th > that]
if (length(upth) < 2 ){
return(c(lowval,max(th)))
}
if (length(upth) > 1){
uplik = like[th > that]
upval = approx(uplik,upth,xout=alpha)$y
return(c(lowval,upval))
}
}
## Joint Likelihood
ll.joint = outer(mu.theta, sigma.theta,'logLikeFunSigma')
like.joint = exp(min(ll.joint)-ll.joint)
contour(mu.theta, sigma.theta^2, like.joint,
xlab=expression(mu), ylab=expression(sigma^2),
level=c(.1,.3,.5,.7,.9), lwd=2,
main="Joint Likelihood Contour of\nCory Booker Estimate and Variance of Polls")

## Profile Likelihood for mean mu
log.like = lapply(mu.theta, logLikeFun, x)
log.like = unlist(log.like)
like = exp(log.like - max(log.like)) ## normalize the value to 1.0

## Profile Likelihood (mu, sigma^2=sigma.hat^2)
se = sqrt(var(x)*(n-1)/n)/sqrt(n)
estimate.like = dnorm(mean(x), mean=mu.theta, sd=se)
estimate.like = estimate.like/max(estimate.like)

## Profile Likelihood (mu, sigma=1)
log.like.sigma = logLikeFunSigma(mu.theta, sigma=1)
sigma.like = exp(min(log.like.sigma) - log.like.sigma)

## Profile LIkelihohod sigma^2

plot(mu.theta, like, type='n', main=expression(paste('Profile Likelihood of ',mu,' for Cory Booker')),
ylab='Likelihood', xlab=expression(mu), ylim=c(0,1)) ## create blank graph of the correct size
lines(mu.theta, like,lty=1,lwd=2, col=2)
lines(mu.theta, estimate.like, lty=2, lwd=2, col=3)
text(mean(x)+.75,0,format(mean(x),digits=3) )
abline(v=c(mean(x),li(mu.theta, like),li(mu.theta, estimate.like)), lty=1, lwd=c(2.5,1,1,1,1), col=c(1,2,2,3,3))
abline(h=.15)
legend('topright',c(expression(paste('L(',mu,')')),expression(paste('L(',mu,',',sigma^2,'=',hat(sigma)^2,') ' ))), lwd=2, col=c(2,3), bg='#FFFFFF')

 

A Monty Simulation

I was listening to Science Friday from Sep 6th. and one of the discussions by Ira Flatow was on the well known Monty Hall Problem.  This problem has been hashed out many times and in fact I was first introduced to the probability aspects of the problem while taking my introductory to statistics class when I was an undergraduate.  Though the solution is not initially intuitive it is fairly straight forward and a discussion is available on Wikipedia. I figured I would take a few minutes to play the game a couple million times.  This problem is so ubiquitous that Mythbusters did an episode on this back in 2011 where they too kept on playing the game over and over.  So there is no need for me to detail it out but here’s a brief simulation showing the probability of winning, if you change doors, is 0.66.  The following graphs show the cumulative results of the first 100 iterations comparing the two different strategies.

Monty Hall Simulation


library(e1071)
doors = c("A","B","C")
nsim=1000000

sim.mat = matrix(NA, nrow=nsim, ncol=5)
winning.loc = rdiscrete(nsim,doors, prob=rep(1,length(doors)))
first.choice = rdiscrete(nsim,doors, prob=rep(1,length(doors)))
sim.mat[,1] = winning.loc
sim.mat[,2] = first.choice

for(i in 1:length(winning.loc)){
if(winning.loc[i] != first.choice[i]){
## Contestant has selected the goat. Therefore the host must show the other goat and not the winner.
sim.mat[i,3] = doors[ !(doors %in% c(winning.loc[i], first.choice[i])) ]
} else {
rand.goat.options = doors[ !(doors %in% c(winning.loc[i], first.choice[i])) ]
## Contestant has selected the winner and the host then randomly select one of the other two.
rand.goat = rdiscrete(1, rand.goat.options , prob=rep(1,length(rand.goat.options)) )
sim.mat[i,3] = rand.goat
}
## Second choice changes
sim.mat[i,5] = doors[ !(doors %in% c(sim.mat[i,3], first.choice[i])) ]
}
## Second choice is the same as the first choice
sim.mat[,4] = sim.mat[,2]
## names = TRUE.VAL, FIRST.CHOICE, GOAT.LOC.SHOWN, SECOND.CHOICE.STAY
sim.val.stay = as.numeric( (sim.mat[,4]==sim.mat[,1])*1 )
sim.val.change = as.numeric( (sim.mat[,5]==sim.mat[,1])*1 )
mean(sim.val.stay)
mean(sim.val.change)

## Probability by iteration
covergence.prob.change = cumsum(sim.val.change)/cumsum(rep(1,length(sim.val.change)))
covergence.prob.stay = cumsum(sim.val.stay)/cumsum(rep(1,length(sim.val.stay)))

par(mfrow=c(1,2))
plot(head(covergence.prob.stay, 100), pch=16, cex=.5,
ylab="Probability",
main="Probability of Winning\nKeeping the Same Door")
plot(head(covergence.prob.change, 100), pch=16, cex=.5, ylab="Probability",
main="Probability of winning\nChanging Doors")

The Eve of the NYC Democratic Mayoral Primary Election

It is the eve of the New York City Democratic mayoral primary election.  This is a simple follow-up on my post from last Friday as I was curious how the final pre-Election Day polling was going to fall.  It’s fairly clear who is going to get the most votes.  Though we have a very good idea who is going to take the nomination it is still not guaranteed.  Using the same line of reasoning that I used in my previous post we can calculate the probability that Bill de Blasio is not only going to get the most vote but also achieve the threshold of 40% of the vote to avoid a run-off that is scheduled for October 1st.  The following shows the five leading candidates and their distribution of vote.  For the distribution shown in the graph below I proportionally assigned the remaining undecided vote each of the five main candidates.  There are other candidates that will likely get a few votes but their vote will only be a few percentage points.  For the purpose of this estimation and the graph I am only showing the five candidates.

The results once again come from www.realclearpolitics.com and I only used the most recent two polls and averaged the two results.  The biggest discrepancy between the two polls in in Christine Quinn where there is a seven point difference. Another item to note is the methodology between the two polls NBC/WSJ/Marist use phone operators while PPP uses IVR.  The first graph simply shows results from the combined two polls.  The second graph I take the undecided voters and proportionally assign them to top five candidates.  If we believe that the undecided voters will break this way then de Blasio will almost certainly win and avoid the run-off.

This graph simply takes the votes as provided by PPP and NBC/WSJ/Marist.  This does not include the undecided voters though we know that some of the undecided will likely go to each of the candidates.

FinalNYCDemocraticEstimation

The following graph proportionally assigns the undecided vote to each of the five top candidates.  This shows that Bill de Blasio has an extremely high likelihood of not only getting the most votes but also obtaining the minimum threshold.

NYC Democratic Mayor Primary Election Including Undecided Voters

library(MCMCpack)</p>
## Set up several of the recent polls but will only work with the most recent on
raw.1 = NULL
raw.1 = data.frame( rbind(
PPP = c(.38,.19,.13,.09,.05,750),
NBCWSJMAR = c(.36,.20,.20,.07,.05,505)
)
)
raw.1 = rbind(raw.1, c(apply(raw.1[,1:5],2,weighted.mean,raw.1[,6]),sum(raw.1[,6])))
names(raw.1) = c("Cand1","Cand2","Cand3","Cand4","Cand5","size")
raw.1$Other.und = 1-raw.1$Cand1-raw.1$Cand2-raw.1$Cand3-raw.1$Cand4-raw.1$Cand5
raw.1.no.und = data.frame(raw[3,1:5] + raw[3,1:5]/sum(raw[3,1:5])*raw[3,7],size=raw[3,6],Other.und=0)
raw = rbind(raw.1, raw.1.no.und)
###################################################################
## More than two candidates so Beta distribution won't work
## Function to randomly generate data from a dirichlet distribution
###################################################################
j= 4
prob.win = function(j,export=1){
p=rdirichlet(100000,
raw$size[j] *
c(raw$Cand1[j], raw$Cand2[j], raw$Cand3[j], raw$Cand4[j], raw$Cand5[j], 1-raw$Cand1[j]-raw$Cand2[j]-raw$Cand3[j]-raw$Cand4[j]-raw$Cand5[j])+1
)
if(export==1){
mean(p[,1]>p[,2] & p[,1]>=.40) ## de Blasio need to beat Thompson AND get >= .40
} else {
return(p)
}
}

( cand1.win.probs = sapply(1:nrow(raw),prob.win) )

sim.dir = prob.win(4,export=2) ## set simulated data for plotting and determining parameters
## The shape parameters (shape1 and shape2) might need some manual adjusting and tweaking.
## But for demonstration purposes this will suffice
fit.distr.1 = fitdistr(sim.dir[,1], "beta",
start=list(shape1=10,shape2=10))
fit.distr.2 = fitdistr(sim.dir[,2], "beta",
start=list(shape1=10,shape2=10))
fit.distr.3 = fitdistr(sim.dir[,3], "beta",
start=list(shape1=10,shape2=10))
fit.distr.4 = fitdistr(sim.dir[,4], "beta",
start=list(shape1=10,shape2=10))
fit.distr.5 = fitdistr(sim.dir[,5], "beta",
start=list(shape1=10,shape2=10))

## Could also draw a histogram of simulated data
curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]),
ylim=c(0,60), xlim=c(0,.50), col=1, lty=1, lwd=2, ylab="Density", xlab="theta",
main="Distribution of the Democratic NYC Mayor Election 2013",
sub=paste("Probability that de Blasio beats Thompson AND gets >= 40% of Vote: ", round(cand1.win.probs[4],2) ) ) ## Candidate 1
curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col=3, lty=2, lwd=2) ## Candidate 2
curve(dbeta(x,fit.distr.3$estimate[1],fit.distr.3$estimate[2]), add=T, col=4, lty=3, lwd=2) ## Candidate 3
curve(dbeta(x,fit.distr.4$estimate[1],fit.distr.4$estimate[2]), add=T, col=5, lty=3, lwd=2) ## Candidate 4
curve(dbeta(x,fit.distr.5$estimate[1],fit.distr.5$estimate[2]), add=T, col=6, lty=3, lwd=2) ## Candidate 5
abline(v=c(median(sim.dir[,1]), median(sim.dir[,2]), median(sim.dir[,3]), median(sim.dir[,4]), median(sim.dir[,5])), col=c(1,3,4,5,6), lwd=2, lty=c(1,2,3))
legend("topright",c("de Blasio","Thompson","Quinn","Weiner","Liu"), lwd=2, col=c(1,3,4,5,6), lty=c(1,2,3,4,5))

Probability of Avoiding a Run-off in the NYC 2013 Democratic Primary Election

The New York City mayoral Democratic primary election is taking place this coming Tuesday (Sep. 10th) and there are several candidates in the running. Bill de Blasio is the front runner and is expected to win. However, there is a catch. Even if he takes the plurality of the vote he may not actually win. In New York if the candidate does not get at least 40% of the vote then the election goes into a run-off on October 1st. Meaning that Bill Thompson or Christine Quinn (whoever takes 2nd) will then have a second chance to win. This type of problems can be a little difficult to handle using standard univariate statistics. When there are only two candidates then a simple Beta distribution can be used to adequately model the data. However, with there being nine declared candidates the Beta distribution just doesn’t work very well. The Dirichlet distribution is the multivariate generalization of the Beta and this can be used to simulate the outcome when there are multiple candidates. A convenient feature of the Dirichlet is that the marginal distributions are distributed as a Beta making computations on the marginals fairly simple.

I gave a presentation at an AAPOR conference a while back where I used this approach but expanded the model to all 50 states and then applied winner-take-all electoral votes (except for Maine and Nebraska which have a proportion style). In this way I was able to calculate the posterior distribution of the electoral vote and calculate credible intervals and other distributional characteristics. This approach has proved to be a very good way to make estimates and a great way to visualize the distribution of the data.

So then the question is whether Bill de Blasio will get more votes than the next candidate and what is the probability that he will take it all without a run-off. There have been several polls by many well respected polling organizations including Quinnipiac, NY Times and NBC, among others.

This post is designed to be a more technical discussion of the underlying and distributional components of the 2013 NYC Democratic Mayoral election (and really elections in general) rather than for a comprehensive discussion. For demonstration simplicity sake I will take the most recent available poll on www.realclearpolitics.com which was Quinnipiac who fielded from Aug 8th to Sep 1.  More polls are certain to arrive after this weekend so it will be interesting to see what happens to the probability distributions prior to Election Day.

Here we can see that de Blasio comes very close to obtaining not only a plurality of votes but the necessary 40%. However, the one item that may not necessarily be emphasized in these polls is that there is still a group of undecided voters. Normally, that does not make a major difference if it is expected that the undecided voters will break proportionally (or even somewhat proportionally) among other voters. Because when the candidates take the most votes they simply win. In this case even if a small portion of the voters start to break for de Blasio then that could put him over the top and prevent a run-off. So then what is the probability that he can avoid a run-off? Well, by running a simulation we can put in those constraints that he must beat the next candidate AND get more than 40% of the vote. In this case I set Thompson as the next candidate to compare.

Based on the given data his chances of avoiding a run-off are not that great as can be seen in the graph below. However, if he gains his proportion of the undecided voters then that will increase his probability of avoiding a run-off. Not but much but it ends up being just short of 50/50 chance.  The given model can, of course, be improved with some work.

Maybe I’ll put a little bit of work into the upcoming New Jersey Senate election with Cory Booker and the New Jersey (with Chris Christie) & Virginia (with Cuccinelli & McAuliffe) Governor elections.  I’ll see if time permits.

 2013 NYC Democratic Mayor Distribution

 


library(MCMCpack)

## Set up several of the recent polls but will only work with the most recent on
raw.1 = NULL
raw.1 = data.frame( rbind(
Quinnipiac = c(.43,.20,.18,750),
NYT = c(.32,.18,.17,505),
Quinnipiac2= c(.36,.20,.21,602),
amNY = c(.29, .24,.17, 600),
NBC = c(.24, .18,.24, 355),
Quinnipiac.adj = c(.465, .226, .194, 750) ## adjust for undecided
)
)
names(raw.1) = c("Cand1","Cand2","Cand3","size")
raw.1$Other = 1-raw.1$Cand1-raw.1$Cand2-raw.1$Cand3
raw = raw.1

###################################################################
## More than two candidates so Beta distribution won't work
## Function to randomly generate data from a dirichlet distribution
###################################################################

prob.win = function(j,export=1){
p=rdirichlet(100000,
raw$size[j] *
c(raw$Cand1[j],raw$Cand2[j],raw$Cand3[j],1-raw$Cand1[j]-raw$Cand2[j]-raw$Cand3[j])
+1)
if(export==1){
mean(p[,1]>p[,2] &amp; p[,1]>=.4) ## de Blasio need to beat Thompson AND get >= .40
} else {
return(p)
}
}

( cand1.win.probs = sapply(1:nrow(raw),prob.win) )
sim.dir = prob.win(1,export=2) ## set simulated data for plotting and determining parameters
## The shape parameters (shape1 and shape2) might need some manual adjusting
fit.distr.1 = fitdistr(sim.dir[,1], "beta",
start=list(shape1=5,shape2=5))
fit.distr.2 = fitdistr(sim.dir[,2], "beta",
start=list(shape1=4,shape2=7))
fit.distr.3 = fitdistr(sim.dir[,3], "beta",
start=list(shape1=2,shape2=9))

## Could also draw a histogram of simulated data
curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]),
ylim=c(0,4), col=1, lty=1, lwd=2, ylab="Density", xlab="theta",
main="Distribution of the Democratic NYC Mayor Election 2013",
sub=paste("Probability that de Blasio beats Thompson AND gets > 40% of Vote: ", round(cand1.win.probs[1],2) ) ) ## Candidate 1
curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col=3, lty=2, lwd=2) ## Candidate 2
curve(dbeta(x,fit.distr.3$estimate[1],fit.distr.3$estimate[2]), add=T, col=4, lty=3, lwd=2) ## Candidate 3
abline(v=c(median(sim.dir[,1]), median(sim.dir[,2]), median(sim.dir[,3])), col=c(1,3,4), lwd=2, lty=c(1,2,3))
legend("topright",c("de Blasio","Thompson","Quinn"), lwd=2, col=c(1,3,4), lty=c(1,2,3))

The Beta Prior, Likelihood, and Posterior

The Beta distribution (and more generally the Dirichlet) are probably my favorite distributions.  However, sometimes only limited information is available when trying set up the distribution.  For example maybe you only know the lowest likely value, the highest likely value and the median, as a measure of center.  That information is sufficient to construct a basic form of the distribution.  The idea of this post is not to elaborate in detail on Bayesian priors and posteriors but to give a real working example of using a prior with limited knowledge about the distribution, adding some collected data and arriving at a posterior distribution along with a measure of its uncertainty.

The example below is a simple demonstration on how a prior distribution and current data can be combined and form a posterior distribution.  Earlier this year I gave a presentation at a conference where I modified this simple version of my code to be substantially more complex and I used the Dirichlet distribution to make national predictions based on statewide and local samples.   So this approach has some very useful applied statistical properties and can be modified to handle some very complex distributions.

Prior, Likelihood, and Posterior Distributions

A Note on Posterior Intervals

The posterior interval (also called a credible interval or credible region) provides a very intuitive way to describe the measure of uncertainty.  Unlike a confidence interval (discussed in one of my previous posts), a credible interval does in fact provide the probability that a value exists within the interval.  With this interval it is based on calculating the probability of different values given the data.  The graph above shows the different values (identified as theta) as well as the simulated posterior interval limits (alpha=.05 in black and .01 in light gray).  In other words the probability that theta is a member of the 95% credible interval is 0.95 (written as P\left( \theta \in CI \right) = 0.95).

 

The Prior and Posterior Distribution: An Example

The code to run the beta.select() function is found in the LearnBayes package.  This is a great function because by providing two quantiles one can determine the shape parameters of the Beta distribution.  This is useful to find the parameters (or a close approximation) of the prior distribution given only limited information.  If additional quantiles are known then they can be incorporated to better determine the shape parameters of the Beta distribution.


library(LearnBayes)
Q = data.frame(
quantile=c(
median=0.5,
maximum=0.99999,
minimum=0.00001),
prior=c(
median=0.85,
maximum=0.95,
minimum=0.60)
)

optimalBeta = function(Q) {
q1q = Q$quantile[1]
q1p = Q$prior[1]
q2q = Q$quantile[2]
q2p = Q$prior[2]
q3q = Q$quantile[3]
q3p = Q$prior[3]

# find the beta prior using quantile1 and quantile2
q.med = list(p=q1q, x=q1p)
q.max = list(p=q2q, x=q2p)
q.min = list(p=q3q, x=q3p)

# prior parameters using median and max, and median and min
prior.A = beta.select(q.med,q.max)
prior.B = beta.select(q.med,q.min)

prior.Aa = prior.A[1]
prior.Ab = prior.A[2]

prior.Ba = prior.B[1]
prior.Bb = prior.B[2]

## find the best possible beta prior
## Set a start and stop point range to find the best parameters
if (prior.Aa < prior.Ba) {
start.a = prior.Aa
stop.a = prior.Ba
} else {
start.a = prior.Ba
stop.a = prior.Aa
}

if (prior.Ab < prior.Bb) {
start.b = prior.Ab
stop.b = prior.Bb
} else {
start.b = prior.Bb
stop.b = prior.Ab
}
seq.a = seq(from=start.a, to=stop.a, length.out=1000)
seq.b = seq(from=start.b, to=stop.b, length.out=1000)

seq.grid = expand.grid(seq.a, seq.b)

prior.C.q1 = qbeta(q1q, seq.grid&#91;,1&#93;, seq.grid&#91;,2&#93;)
prior.C.q2 = qbeta(q2q, seq.grid&#91;,1&#93;, seq.grid&#91;,2&#93;)
prior.C.q3 = qbeta(q3q, seq.grid&#91;,1&#93;, seq.grid&#91;,2&#93;)

## Different distance measurements, manhattan, euclidean, or otherwise.
## It would be interesting to run a simulation to measure a variety of distance measurements.
prior.C.delta = abs(prior.C.q1 - q1p) + abs(prior.C.q2 - q2p) + abs(prior.C.q3 - q3p)
## prior.C.delta = sqrt( (prior.C.q1 - q1p)^2 + (prior.C.q2 - q2p)^2 + (prior.C.q3 - q3p)^2 )

optimize.seq = cbind(seq.grid, prior.C.q1, prior.C.q2, prior.C.q3, prior.C.delta)

## Minimize the delta, if the min-delta is not unique then choose the first occurence
best.a = optimize.seq&#91;,1&#93;&#91; optimize.seq&#91;,6&#93;==min(optimize.seq&#91;,6&#93;)&#93;&#91;1&#93;
best.b = optimize.seq&#91;,2&#93;&#91; optimize.seq&#91;,6&#93;==min(optimize.seq&#91;,6&#93;)&#93;&#91;1&#93;

return(list(a=best.a,b=best.b))
}

prior.dist = optimalBeta(Q)

##########################################################
## Take a look at only the prior
##########################################################
curve(dbeta(x,prior.dist$a,prior.dist$b)) # plot the prior
abline(v=Q$prior&#91;1&#93;)

##########################################################
## Take a look at only the likelihood with given successes
##########################################################
calcLikelihood = function(successes, total){
curve(dbinom(successes,total,x)) # plot the likelihood
}

calcLikelihood(45, 50) ## e.g. 45/50 sucesses
## calculate some properties of the Beta distribution
calcBetaMode = function(aa, bb) {
beta.mode = (aa - 1)/(aa + bb - 2)
return(beta.mode)
}
calcBetaMean = function(aa, bb) {
beta.mean = (aa)/(aa + bb)
return(beta.mean)
}
calcBetaVar = function(aa, bb) {
beta.var = (aa * bb)/(((aa + bb)^2) * (aa + bb + 1))
return(beta.var)
}
calcBetaMedian = function(aa, bb) {
beta.med = (aa-1/3)/(aa+bb-2/3)
return(beta.med)
}
calcBetaSkew = function(aa, bb) {
beta.skew = ( 2*(bb-aa)*sqrt(aa+bb+1) ) /( (aa+bb+2)/sqrt(aa+bb) )
return(beta.skew)
}

##########################################################
## Take a look at the prior, likelihood, and posterior
##########################################################
priorToPosterior = function(successes, total, a, b) {
## Note the rule of succession
likelihood.a = successes + 1
likelihood.b = total - successes + 1

## Create posterior
posterior.a = a + successes;
posterior.b = b + total - successes
theta = seq(0.005, 0.995, length = 500)

## Calc density
prior = dbeta(theta, a, b)
likelihood = dbeta(theta, likelihood.a, likelihood.b)
posterior = dbeta(theta, posterior.a, posterior.b)

## Plot prior, likelihood, and posterior

## Can be used to scale down the graph if desired.
## However, the density is different for each prior, likelihood, posterior
m.orig = apply( cbind(prior, likelihood, posterior), 2, max)
m = max(c(prior, likelihood, posterior))

plot(theta, posterior, type = "l", ylab = "Density", lty = 2, lwd = 3,
main = paste("Prior: beta(", round(a,2), ",", round(b,2), "); Data: B(", total, ",", successes, "); ",
"Posterior: beta(", round(posterior.a,2), ",", round(posterior.b,2), ")", sep=""), ylim = c(0, m), col = 1)
lines(theta, likelihood, lty = 1, lwd = 3, col = 2)
lines(theta, prior, lty = 3, lwd = 3, col = 3)
legend("topleft",y=m, c("Prior", "Likelihood", "Posterior"), lty = c(3, 1, 2),
lwd = c(3, 3, 3), col = c(3, 2, 1))

prior.mode = calcBetaMode(a, b)
likelihood.mode = calcBetaMode(likelihood.a, likelihood.b)
posterior.mode = calcBetaMode(posterior.a, posterior.b)
prior.mean = calcBetaMean(a, b)
likelihood.mean = calcBetaMean(likelihood.a, likelihood.b)
posterior.mean = calcBetaMean(posterior.a, posterior.b)
prior.med = calcBetaMedian(a, b)
likelihood.med = calcBetaMedian(likelihood.a, likelihood.b)
posterior.med = calcBetaMedian(posterior.a, posterior.b)
prior.var = calcBetaVar(a, b)
likelihood.var = calcBetaVar(likelihood.a, likelihood.b)
posterior.var = calcBetaVar(posterior.a, posterior.b)
prior.skew = calcBetaSkew(a, b)
likelihood.skew = calcBetaSkew(likelihood.a, likelihood.b)
posterior.skew = calcBetaSkew(posterior.a, posterior.b)

print(paste("Mode: prior=",prior.mode,"; Likelihood=",likelihood.mode,"; Posterior=",posterior.mode))
print(paste("Mean: prior=",prior.mean,"; Likelihood=",likelihood.mean,"; Posterior=",posterior.mean))
print(paste("~Approx Median (for a and b > 1): prior=",prior.med,"; Likelihood=",likelihood.med,", for Posterior=",posterior.med))
print(paste("Var: prior=",prior.var,"; Likelihood=", likelihood.var,"; Posterior=",posterior.var))
print(paste("Skewness: prior=",prior.skew,"; Likelihood=",likelihood.skew,"; Posterior=",posterior.skew))
return(list(a=posterior.a,b=posterior.b))
}

posterior.out = priorToPosterior(25,50, prior.dist$a, prior.dist$b) # 25/50 is current data
beta.sim = rbeta(1000000,posterior.out$a, posterior.out$b)
abline(v=quantile(beta.sim, prob=c(.05/2, 1-.05/2)), col='#000000', lwd=2)
abline(v=quantile(beta.sim, prob=c(.01/2, 1-.01/2)), col='#EEEEEE', lwd=2)