About

We are studying the question of how US-based PT programs’ required observation hours (henceforth referred to as “criterion hours”) for prospective students vary with geographic location.

Dataset

The dataset was sourced from the PTCAS website (for programs & criterion) & Google Maps (for program locations). It has the following shape:

dim(oh_data)
## [1] 249   8

The first row indicates the number of programs (rows) and the later indicates the number of descriptive qualities (columns). Previewing it:

head(oh_data, n=10)

Outlining the non-trivial columns:

Summary Statistics

Requirement Category

oh_data %>%
  ggplot() +
    geom_bar(aes(as.factor(Category))) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab('Requirement Category') +
    ylab('Number of Programs')

where the categories are:

categories

Criterion Hours

Here’s a summary of the number of criterion hours:

summary(oh_data$criterion_hours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.00   40.00   50.00   65.03  100.00  300.00      24

The proportion of programs with criterion hours are:

sum(!is.na(oh_data$criterion_hours)) / nrow(oh_data)
## [1] 0.9036145

The proportion of programs with required or highly recommended hours are:

sum(oh_data$Category < 4) / nrow(oh_data)
## [1] 0.9638554

The distribution of hours across all programs looks like:

oh_data %>%
  filter(!is.na(criterion_hours)) %>%
  ggplot() +
    geom_histogram(aes(criterion_hours), bins = 15)

Peeking at distributions conditioned on the requirement category (specifically 1 & 2, which do impose a requirement):

oh_data %>%
  filter(!is.na(criterion_hours) & Category %in% c(1,2)) %>%
  ggplot() +
    geom_violin(aes(as.factor(Category), criterion_hours))

The distributions are similar, with modes at 40 & 100 hours.

Mean criterion hours:

mean(oh_data$criterion_hours, na.rm = TRUE)
## [1] 65.03111

Standard deviation:

sd(oh_data$criterion_hours, na.rm = TRUE)
## [1] 38.94598

Incorporating Geography

For analysis, we will only include programs in the contiguous US:

oh_analysis_data <- oh_data %>%
  filter(Instituition != 'University of Puerto Rico' & Instituition != 'Andrews University')

Visualizing

Points

First, we view each program as a point on the US map, where color indicates the hours requirement:

us <- map_data('state')
usa <- map_data('usa') %>%
  filter(region == 'main')

gg <- oh_analysis_data %>%
  filter(!is.na(criterion_hours)) %>%
  ggplot()  +
    geom_polygon(data = us, aes(x=long, y = lat, group = group), fill='grey', color='white') +
    geom_point(aes(x=Longitude, y=Latitude, color = criterion_hours), size=10) + 
    scale_color_distiller(palette=1, direction=1) + 
    guides(color=guide_colorbar(title='Required Hours',
                               barwidth = 2, barheight = 20, 
                               title.theme=element_text(size=15), 
                               label.theme = element_text(size=20))) +
    ggtitle('PTCAS Program Criterion Hours') +
    theme(plot.title = element_text(size=30),
          axis.text=element_text(size=20),
          axis.title=element_text(size=20)) +
  xlab('Longitude') +
  ylab('Latitude') +
    theme(plot.title = element_text(size=30),
        axis.text=element_text(size=20),
        axis.title=element_text(size=20),
        axis.line=element_blank(),axis.text.x=element_blank(),
        axis.text.y=element_blank(),axis.ticks=element_blank(),
        axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),plot.background=element_blank())

gg

There is heavy concentration of programs in the Northeast US, so we zoom in:

gg + coord_cartesian(xlim = c(-79, -67), ylim = c(35, 47))

Heatmap

To build a more interprettable visual, we need to:

  • Populate dead space by imputing expected number of required hours (modeling)
  • Capture trends
  • Capture pockets

RuleFit is a model fitting procedure that identifies “similar” hyperrectangles (2D rectangles, in our case) in the data. Regularization (LASSO) & cross validation are used to fit the model. The final model is selected using the cross validation error minimizing regularization parameter; this is more aggressive (i.e. higher false positive rate) than using the regularization parameter within a standard error of the minimum.

train_data <- oh_analysis_data %>%
  filter(!is.na(criterion_hours))
model <- xrf(criterion_hours ~ Latitude + Longitude, 
             train_data, 
             family = 'gaussian',
             xgb_control = list(nrounds = 5, max_depth = 2),
             deoverlap = TRUE)

box <- us %>%
  summarize(
    min_lat = min(lat),
    max_lat = max(lat),
    min_long = min(long),
    max_long = max(long)
  )
points <- expand.grid(
  Latitude = seq(box$min_lat, box$max_lat, by = .1),
  Longitude = seq(box$min_long, box$max_long, by = .1),
   criterion_hours = -1 # https://github.com/holub008/xrf/issues/9
) %>%
  filter(point.in.polygon(Longitude, Latitude, usa$long, usa$lat) > 0)

points$ehours <- predict(model, points , lambda = 'lambda.min')[,1]

ggplot(points) +
  geom_raster(aes(x = Longitude, y = Latitude, fill = ehours), interpolate = TRUE) +
  geom_polygon(data = us, aes(x=long, y = lat, group = group), fill=NA, color='black') + 
  scale_fill_distiller(palette=1, direction=1) + 
  guides(fill=guide_colorbar(title='Required Hours',
                             barwidth = 2, barheight = 20, 
                             title.theme=element_text(size=15), 
                             label.theme = element_text(size=20),
                             draw.ulim = FALSE, draw.llim = TRUE)) +
  ggtitle('PTCAS Program Criterion Hours By Geography') +
  theme(plot.title = element_text(size=30),
        axis.text=element_text(size=20),
        axis.title=element_text(size=20),
        axis.line=element_blank(),axis.text.x=element_blank(),
        axis.text.y=element_blank(),axis.ticks=element_blank(),
        axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),plot.background=element_blank())

There certainly appears to be a trend of decreasing criterion hours as Longitude increases (moving towards the East). There appears to be pockets of (regionally) high criterion in the LA area & the “central” east coast, while New England is a pocket of particularly low criterion

Correlations

Main Effect Lat/Long

Visualizing

To visualize main effect correlations, we look at marginal x/y plots of latitude and longitude. Smoothing splines are used to trend the data.

oh_analysis_data %>%
  filter(!is.na(criterion_hours)) %>%
  ggplot(aes(Latitude, log(criterion_hours))) +
    geom_point() +
    geom_smooth(method='lm')

Criterion hours look pretty flat across Latitude, with perhaps a slight decrease moving North.

oh_analysis_data %>%
  filter(!is.na(criterion_hours)) %>%
  ggplot(aes(Longitude, log(criterion_hours))) +
    geom_point() +
    geom_smooth(method="lm")

There is a definite decrease in criterion hours moving west to east, with a possible jump upwards at the coast.

Inference (Non-zero Effects)

To determine if the main effect correlations are non-zero, we perform a bootstrap on the correlation coefficients, building 95% intervals:

bs <- bootstrap(oh_analysis_data %>% filter(!is.na(Minimum.Hours)), corr_lat = cor(Latitude, Minimum.Hours, method='spearman'), corr_long = cor(Longitude, Minimum.Hours, method='spearman'))

summary.bootstrap(bs)
## Number of boostrap samples: 100 
## Empirical interval level: 0.95 
## 
##  statistic       mean standard_error   ci_lower   ci_upper
##   corr_lat -0.1212494     0.08005684 -0.2732268 0.04184330
##  corr_long -0.1351852     0.06714396 -0.2492979 0.00153078

Both effects are fairly unconvincing, showing weak strength of relationship and questionable sigificance.

Spatial Autocorrelation

Spatial autocorrelation is a measure of the extent to which any given observation depends on the value of other observations in the shared locality. Do required hours exhibit this property?

We should already feel confident that yes, they do, since Latitude (a geographic component) correlates. However, we can answer specific questions using different types of autocorrelation. In this section, we use Moran’s I to compute autocorrelation, while varying the weighting matrix to answer specific questions.

morani <- function(x, W) {
  n <- length(x)
  stopifnot(n == dim(W)[1] && n == dim(W)[2])
  
  squared_deviation_sum <- 0
  weighted_covariance_sum <- 0
  xbar <- mean(x)
  for (i in 1:n) {
    deviation_i <- x[i] - xbar
    for (j in 1:n) {
      weighted_covariance_sum <- weighted_covariance_sum + W[i, j] * deviation_i * (x[j] - xbar)
    }
    squared_deviation_sum <- squared_deviation_sum + deviation_i ^ 2
  }
  
  (n / sum(W)) * (weighted_covariance_sum / squared_deviation_sum)
}

Nearest Neighbors

Here we posit the question: Do programs typically have similar criterion to their nearest neighboring programs? We select a parameterization of k=10 neighboring programs, which is a plausibly small enough set of schools that a program would consider in their own determination, while large enough to likely provide a signal. Distance between programs is simply the shortest distance between the two programs on a globe (Haversine distance).

haversine_distance <- function(lat1, lon1, lat2, lon2) {
  haversine(c(lat1, lon1), c(lat2, lon2))
}

# points must be a dataframe with Latitude and Longitude columns
# returns distances in KM exponeniated by distance_exp
compute_pairwise_distances <- function(points, distance_exp = 1) {
  distances <- matrix(0, nrow=nrow(points), ncol=nrow(points))
  for (i in 1:nrow(points)) {
    for (j in i:nrow(points)) {
      dist <- haversine_distance(points[i, 'Latitude'], points[i, 'Longitude'],
                                points[j, 'Latitude'], points[j, 'Longitude']) ^ distance_exp
      distances[i, j] <- dist
      distances[j, i] <- dist
    }
  }
  
  distances
}

compute_nearest_neighbors <- function(d, k=5) {
  stopifnot(dim(d)[1] == dim(d)[2])
  stopifnot(dim(d)[1] > k)
  
  neighbor_graph <- matrix(0, nrow=nrow(d), ncol=nrow(d))

  for (i in 1:nrow(d)) {
    kth_distance <- sort(d[i,])[k + 1]
    nn_ix <- which(d[i,] <= kth_distance)
    nn_ix <- nn_ix[nn_ix != i]
    neighbor_graph[i, nn_ix] <- 1
  }
  
  neighbor_graph
}

relevant_oh_data <- oh_analysis_data %>%
  filter(!is.na(criterion_hours))

program_distances <- compute_pairwise_distances(relevant_oh_data)
nn <- compute_nearest_neighbors(program_distances, 10)
(moran_observed <- morani(relevant_oh_data$criterion_hours, nn))
## [1] 0.1419268

Moran’s I is calculated as .142; at over 200 programs, its expectation under the null is nearly 0. This value, which suggests a positive autocorrelation (programs take similar criterion to those programs around them), is quite a bit larger, but how do we assess significance? Monte Carlo simulation will be used to build a null distribution of Moran’s I (i.e. the empirical distribution when no autocorrelation exists):

# Spatial Monte-Carlo: randomly reassign values to points and recompute Moran's I to build null distribution
spatial_mc <- function(data, trials = 1000) {
  program_distances <- compute_pairwise_distances(data)
  nn <- compute_nearest_neighbors(program_distances, 10)
  
  empirical_distribution <- c()
  for (trail_ix in 1:trials) {
    resampled_outcome <- sample(data$criterion_hours, nrow(data), replace=TRUE)
    moran_resampled <- morani(resampled_outcome, nn)
    empirical_distribution <- c(empirical_distribution, moran_resampled)
  }
  
  empirical_distribution
}

moran_null <- spatial_mc(relevant_oh_data)
data.frame(moran_i = moran_null) %>%
ggplot() +
  geom_histogram(aes(moran_i)) +
  geom_vline(xintercept = moran_observed) + 
  xlab("Moran's I") +
  ggtitle("Spatial Autocorrelation: Monte Carlo Null Distribution vs. Observed")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1 - sum(moran_null < moran_observed) / length(moran_null)
## [1] 0

Moran’s I shows high signficance, with p < .001.

To better visualize this correlation, here is a plot of program observation hours (x-axis) against mean observation hours of neighboring 5 programs (y-axis):

nn_mean <- c()
for (ix in 1:nrow(nn)) {
  neighbor_ixs <- which(nn[ix,] > 0)
  nn_mean <- c(nn_mean, mean(relevant_oh_data[neighbor_ixs, 'criterion_hours']))
}
relevant_oh_data$nn_mean <- nn_mean
ggplot(relevant_oh_data, aes(criterion_hours, nn_mean)) +
  geom_jitter() +
  geom_smooth(method='lm')

Notice that the linear model visually suggests a smaller effect than reality, since there are several high leverage points near the max extent of required hours. Moran’s I is robust to these high leverage points. Nonetheless, the trend is visually apparent, particularly when the x axis is log-transformed:

ggplot(relevant_oh_data, aes(log(criterion_hours), nn_mean)) +
  geom_jitter() +
  geom_smooth(method='lm')

Haversine Distance (straight line distance)

Haversine distance can be used to demonstrate the effect of continuous distance on required hours. The subtle difference from nearest neighbors is that continuous distance considers all programs, not just the closest k.

# invert, so that nearer programs have higher weight
haversine_weights <- 1 / program_distances
haversine_weights[is.infinite(haversine_weights)] <- 0
(haversine_moran <- morani(relevant_oh_data$criterion_hours, haversine_weights))
## [1] 0.1988289

And significance testing:

spatial_continuous_mc <- function(data, trials = 1000) {
  haversine_weights <- 1 / program_distances
  haversine_weights[is.infinite(haversine_weights)] <- 0
  
  empirical_distribution <- c()
  for (trail_ix in 1:trials) {
    resampled_outcome <- sample(data$criterion_hours, nrow(data), replace=TRUE)
    moran_resampled <- morani(resampled_outcome, haversine_weights)
    empirical_distribution <- c(empirical_distribution, moran_resampled)
  }
  
  empirical_distribution
}

haversine_null <- spatial_continuous_mc(relevant_oh_data)
data.frame(moran_i = haversine_null) %>%
ggplot() +
  geom_histogram(aes(moran_i)) +
  geom_vline(xintercept = haversine_moran) + 
  xlab("Moran's I") +
  ggtitle("Spatial Autocorrelation: Monte Carlo Null Distribution vs. Observed")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1 - sum(haversine_null < haversine_moran) / length(haversine_null)
## [1] 0.007

There is certainly significance, although the result is less strong; this fits the hypothesis that effects are local - downweighting by a linear function of Haversine distance likely doesn’t model how little a given program considers the criterion hours of a program across the country. In other words, a nearest neighbor approach more correctly captures how program administrators think about the criterion hours of programs around them.

Visually the programs required hours against distance weighted mean of observation hours:

continuous_means <- c()
for (ix in 1:nrow(nn)) {
  continuous_means <- c(continuous_means, mean(program_distances[ix,] * relevant_oh_data$criterion_hours))
}

relevant_oh_data$continuous_mean <- continuous_means
ggplot(relevant_oh_data, aes(criterion_hours, continuous_means)) +
  geom_jitter() +
  geom_smooth(method='lm') +
  ylab('Distance weighted required hours')

Interpretation of this visual is less intuitive.

More importantly, this test does not align with nearly as plausible of a hypothesis: that programs look at a discrete set of nearby programs to determine their own criteria (see Nearest Neighbors section).

Program Density Estimation

Are criterion hours tied to geographic program density? One hypothesis is that increased program density correlates with descreased availability of student openings in that locale, and hence lower criterion.

density_estimate_grid <- kde2d(oh_analysis_data$Longitude, oh_analysis_data$Latitude, n=100)

gr <- with(density_estimate_grid, data.frame(expand.grid(x,y), as.vector(z)))
names(gr) <- c("Longitude", "Latitude", "z")
density_estimate <- loess(z~Longitude*Latitude, data=gr)

oh_analysis_data$density_estimate <- predict(density_estimate, oh_analysis_data)

ggplot(oh_analysis_data, aes(density_estimate, log(criterion_hours))) +
  geom_point() +
  geom_smooth(method='lm')
## Warning: Removed 24 rows containing non-finite values (stat_smooth).
## Warning: Removed 24 rows containing missing values (geom_point).

cor.test(oh_analysis_data$criterion_hours, oh_analysis_data$density_estimate, 
         use="complete.obs", alternative = "less", method="spearman")
## Warning in cor.test.default(oh_analysis_data$criterion_hours,
## oh_analysis_data$density_estimate, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  oh_analysis_data$criterion_hours and oh_analysis_data$density_estimate
## S = 2034316, p-value = 0.06694
## alternative hypothesis: true rho is less than 0
## sample estimates:
##        rho 
## -0.1006872

The trend is possibly there (weak significance, at p=.06), and the scatter plot shows that the effect size is small (< 5 hours variation across the full range of program density values).