Spatial autocorrelation is a measure of the extent to which any given observation depends on the value of other observations in the shared locality. Do required hours exhibit this property?
We should already feel confident that yes, they do, since Latitude (a geographic component) correlates. However, we can answer specific questions using different types of autocorrelation. In this section, we use Moran’s I to compute autocorrelation, while varying the weighting matrix to answer specific questions.
Nearest Neighbors
Here we posit the question: Do programs typically have similar criterion to their nearest neighboring programs? We select a parameterization of k=10
neighboring programs, which is a plausibly small enough set of schools that a program would consider in their own determination, while large enough to likely provide a signal. Distance between programs is simply the shortest distance between the two programs on a globe (Haversine distance).
haversine_distance <- function(lat1, lon1, lat2, lon2) {
haversine(c(lat1, lon1), c(lat2, lon2))
}
# points must be a dataframe with Latitude and Longitude columns
# returns distances in KM exponeniated by distance_exp
compute_pairwise_distances <- function(points, distance_exp = 1) {
distances <- matrix(0, nrow=nrow(points), ncol=nrow(points))
for (i in 1:nrow(points)) {
for (j in i:nrow(points)) {
dist <- haversine_distance(points[i, 'Latitude'], points[i, 'Longitude'],
points[j, 'Latitude'], points[j, 'Longitude']) ^ distance_exp
distances[i, j] <- dist
distances[j, i] <- dist
}
}
distances
}
compute_nearest_neighbors <- function(d, k=5) {
stopifnot(dim(d)[1] == dim(d)[2])
stopifnot(dim(d)[1] > k)
neighbor_graph <- matrix(0, nrow=nrow(d), ncol=nrow(d))
for (i in 1:nrow(d)) {
kth_distance <- sort(d[i,])[k + 1]
nn_ix <- which(d[i,] <= kth_distance)
nn_ix <- nn_ix[nn_ix != i]
neighbor_graph[i, nn_ix] <- 1
}
neighbor_graph
}
relevant_oh_data <- oh_analysis_data %>%
filter(!is.na(criterion_hours))
program_distances <- compute_pairwise_distances(relevant_oh_data)
nn <- compute_nearest_neighbors(program_distances, 10)
(moran_observed <- morani(relevant_oh_data$criterion_hours, nn))
## [1] 0.1419268
Moran’s I is calculated as .142; at over 200 programs, its expectation under the null is nearly 0. This value, which suggests a positive autocorrelation (programs take similar criterion to those programs around them), is quite a bit larger, but how do we assess significance? Monte Carlo simulation will be used to build a null distribution of Moran’s I (i.e. the empirical distribution when no autocorrelation exists):
# Spatial Monte-Carlo: randomly reassign values to points and recompute Moran's I to build null distribution
spatial_mc <- function(data, trials = 1000) {
program_distances <- compute_pairwise_distances(data)
nn <- compute_nearest_neighbors(program_distances, 10)
empirical_distribution <- c()
for (trail_ix in 1:trials) {
resampled_outcome <- sample(data$criterion_hours, nrow(data), replace=TRUE)
moran_resampled <- morani(resampled_outcome, nn)
empirical_distribution <- c(empirical_distribution, moran_resampled)
}
empirical_distribution
}
moran_null <- spatial_mc(relevant_oh_data)
data.frame(moran_i = moran_null) %>%
ggplot() +
geom_histogram(aes(moran_i)) +
geom_vline(xintercept = moran_observed) +
xlab("Moran's I") +
ggtitle("Spatial Autocorrelation: Monte Carlo Null Distribution vs. Observed")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
1 - sum(moran_null < moran_observed) / length(moran_null)
## [1] 0
Moran’s I shows high signficance, with p < .001.
To better visualize this correlation, here is a plot of program observation hours (x-axis) against mean observation hours of neighboring 5 programs (y-axis):
nn_mean <- c()
for (ix in 1:nrow(nn)) {
neighbor_ixs <- which(nn[ix,] > 0)
nn_mean <- c(nn_mean, mean(relevant_oh_data[neighbor_ixs, 'criterion_hours']))
}
relevant_oh_data$nn_mean <- nn_mean
ggplot(relevant_oh_data, aes(criterion_hours, nn_mean)) +
geom_jitter() +
geom_smooth(method='lm')
Notice that the linear model visually suggests a smaller effect than reality, since there are several high leverage points near the max extent of required hours. Moran’s I is robust to these high leverage points. Nonetheless, the trend is visually apparent, particularly when the x axis is log-transformed:
ggplot(relevant_oh_data, aes(log(criterion_hours), nn_mean)) +
geom_jitter() +
geom_smooth(method='lm')