This was originally published for Nested Knowledge. A "nest" is Nested Knowledge's product—effectively a meta-analysis.

Statistical Inference on Citation Networks and Literature Search

In this article, we explore the citation network encountered in a literature search for ischemic stroke treatment devices.

About

Literature Search

In order to build a meta-analytical database, Nested Knowledge searches a large volume of publications indexed in medical journal literature databases (e.g. PubMed). Searches are executed via keyword search or MeSH tag matching, often resulting in large result sets of studies to consider for inclusion in our database. The decision to use a study is based on inclusion criteria (e.g. the study must be a clinical trial, it must be published in English). These criteria are automated as much as possible, but the process of inclusion/exclusion still requires a good deal of time from our expert reviewers.

Citation Networks

Citation networks provide a medium to visualize & reason about a set of studies. For the graph theory uninitiated, studies may be formulated as points (nodes or vertices) and citations as connections between studies (edges). In the below example, we have 3 studies: A, B, & C. The connections between these points can be read as: "study A cites study B", and "study B cites study C".
Simple citation network
Citations networks can handle arbitrary pairwise relations- for example studies with multiple citations or no citations.
More complex citation network
 

Ischemic Stroke Therapies

The first nest we built at Nested Knowledge investigates the question of how choice of therapy impacts patient outcomes for ischemic stroke sufferers. Treatments include thrombolytic therapies (e.g. IV-tPA) and thrombectomy (e.g. stent retrievers, contact aspiration). As such, we used literature search terms like: among 20 total terms. After automated & expert (manual) inclusion/exclusion, we have a list of studies like:
                        title search_term exclusion_reason
	Alternative technique for...       trevo     Not related 
	[Mechanical thrombectomy ...       trevo Foreign Language
	Haloperidol and Risperido...       trevo        non-human
	Optimizating Clot Retriev...       trevo     Not related 
	Safety and Efficacy of Me...       trevo    Meta-analysis
	Mechanical thrombectomy w...       trevo     Not related 
	Association of Rewarming ...       trevo     Not related 
	Functional outcomes and r...       trevo    Meta-analysis
	...
All 8 of these studies were excluded for being non-English, non-human, or simply not being related to the research question. In total, 3948 studies (!) were considered, and 67 were included, giving a hit rate of around 2%.

Citation Network

Using citation data, we can visualize the studies in the literature search. Note that the ~2,000 studies without citation were excluded from the network, to de-clutter the presentation.
Citation network
A couple characteristics jump out:

Adding in Inclusion

Let's make the network more useful by adding in whether a study was included (orange) or excluded (green).
Study Inclusion Citation Network
Woah! There appears to be some interesting things going on here. Below are some observations, including possible explanations: These insights lead to a hypothesis:
Included studies tend to lie close to one another on the citation network
where "lying close" is defined by shortest path between two studies. This hypothesis is meaningful because, if true, it suggests a method for reducing time spent on literature search inclusion/exclusion. After a certain point in time in the search, if a study is not near other included studies, we might feel confident that the study will be excluded. We could then move it to the bottom of the priority list or even auto-exclude it, resulting in saved time for our expert reviewers. So, how do we test the significance of this hypothesis?

Inference on Networks

Spatial Autocorrelation

Spatial autocorrelation is a measure of how dependent the effect or occurrence of some phenomenon is on its surroundings. For example, the red coloration of the dots on the left show a high degree of spatial autocorrelation, while the ones on the right do not.
Autocorrelation vs. Independence
In this framework, our hypothesis can be reformulated:
Study inclusion is spatially autocorrelated.
Moran's I is a commonly used statistic for measuring spatial autocorrelation. For network data, Moran's I effectively takes a parameter describing how many neighbors to include (i.e. the degree of autocorrelation that we expect). We can visualize this parameter with the below network:
Shortest Paths Example
Focused on point 0: In other words, this parameter controls how far reaching we expect the autocorrelation to be.

Monte Carlo Simulation

Hypothesis testing is centered around demonstrating that a "null hypothesis" is not justifiable, which leads to a mutually exclusive alternative. Our (alternative) hypothesis is that "study inclusion is spatially autocorrelated". So, our null hypothesis could be "study inclusion is independent of spatial location". If this null hypothesis is true, it means that the observed location of inclusion is indistinguishable from random assignment of inclusion to locations.  Monte Carlo simulation is the literal implementation of this idea. The procedure is to:
  1. Compute the statistic of interest on the observed data
  2. Randomly reassign inclusion to studies (without replacement) and compute the statistic of interest
  3. Repeat (2) many times
The extent to which the statistic computed in (1) is extreme in the empirical distribution computed in (2) indicates statistical significance.

Testing Our Hypothesis

Below is a histogram of the Monte Carlo empirical distribution with Moran's I parameter of 1 and again for 2. The observed Moran's I is the black vertical bar.
Moran's I (1st nearest neighbors)

Moran's I (2nd nearest neighbors)
Given that 100 Monte Carlo trials were run, both p-values are reported as < .01, but the significance is certainly more convincing than that. Note the autocorrelation appears even more significant the more far-reaching we allow the autocorrelation to be (parameter=2).

In Action

Given the significant effect with plausible causal mechanisms, we'll be implementing inclusion probability ordering in literature search for future nests. On the immediate horizon is a comparison of flow diverter efficacy for hemorrhagic stroke- stay tuned!