This
is an explanatory note for the ‘How to’ training videos tinyurl.com/intrstats3 (or directly Youtube3
).
The Excel worksheet is posted online at Tinyurl.com/SPATIALSSM (or directly Dataverse).
The
Research Question we start with is simply:
“Do
states with more residents in poverty have longer/shorter life expectancies? By
how much?”
First, the spoiler: naïve or a-spatial analyses will overestimate the effect, almost always (depending on the extent of spatial ‘excess similarity’ of values in both variables, between neighboring states)[i]. A visual 'proof' is below: neighboring states 'push up/down' their neighbors' values, one variable at a time.
*
To run a spatial lag regression in Excel, one needs 2 pieces of distinct data: (1).
The 2 variables for the US states; (2). The ‘shape file’ of the US
states, i.e. the geographic information systems (GIS) set of files encoding the
location of, and boundaries between, the states.
***
The steps involved in this are:
1. Obtain a 49x49 matrix data file marking which states neighbors which other state
* Find a ‘shape file’
for the US states online: e.g. Census – States level (cb_2018_us_state_20m.zip)
* (to go from the 51 states file to the
contiguous 49, use QGIS[ii]) Unzip in a folder, and
then in GeoDa (free) open it; in Tools \
Weights Manager \ Create, Select an ID variable, choose an ID (GEOID e.g. or
better the 2 letter state abbreviation one), and Contiguity Weight \ Queen Contiguity:
what saves then is a *.gal file, in essence a text file: open it with Notepad
e.g. to see its structure; for CT e.g., it’s 2 lines: The state name, the total
number of its ‘queen contiguity’ neighbors, then on the next line the names of
the neighbors
* This is the data we’ll process in Excel to
turn into a 49 * 49 matrix, full with 0’s except spots where the column state
is the neighbor of the row state (plus, we scale the non-zero numbers to add up
to 1: so for CT’s 3 neighbors, each cell gets a .33): all this is shown in the
Excel file Poverty_lifeexp_matrix_reg.xlsx in successive worksheets: queen_49Orig,
HowTo, ProcessStandardize, STANDimport49b49. This last one
will be used to turn a OLS regression into a spatial lag regression: that’s
all!
2. Generate the spatial lag variables
*
Multiply
each variable, a column of 49 rows (a vertical vector) by the standardized
weight matrix (in worksheet Reg_Lag): formula is simply MMULT(B2:AX50,P2:P50);
the result is the spatial lag derivative of the initial Life Expectancy variable
found in P2:P50!
3. Run the spatial lag regression
*
Use the
mean centered data in columns B, C, and D to run a multiple regression by hand’
in Excel; it merely means implementing (in steps however, the formula entered
in 1 chink did not run!) the formula for the beta/regression coefficient found
in Greene, p.23 eq. 3-10.
β p +1
x 1 = (X’p+1 x
N ·XN x p+1)-1
· X’ p+1 x N ·y N x 1
for p predictors (here p = 2), N = 49 states, X are the predictors, y is the outcome. The “+1” addition is because one would need a ‘vector of 1s’ added in the matrix of predictors, this is added in the Excel.
*
What we see is that the naïve β = -0.44, while the proper spatial β
= -0.31.
Conclusion:
States
with 10% points more residents in poverty have a lower life expectancy at birth
by 3.7 months; naïve analyses would yield instead an inflated (biased up) 5.3
months value.
*
Now anyone can run a spatial regression without much fuss; working in Stata or
R this can be done quite quickly, but what’s happening behind the scenes would
be lost: we unpacked it here.
Some more details:
A. Keeping track of the matching by state
is essential: many options exist for this, best in this instance is to use the
2 letter state abbreviation, and keep checking whether one messes the order or
not, at each step: copy and paste alongside the columns to check.
Alternatively, Excel can also do ‘matching’, see e.g. WWW.
* For larger files, like the ~3,080 US
counties, or the ~65,000 US census tracts, this ‘by-hand’ process becomes a
little cumbersome (Excel could still do it… ), so other automated options are
recommended: Stats’s sp module is a simple and instructive one: see Chuck
Huber’s ‘how to’ blog posting. See also Di Liu’s post.
B. Checking the results can be done in
GeoDa straight away: see Luc Anselin’s Guide (a PDF here)
C. Accounting for the spatial ‘auto’-correlation
is much like accounting for the prior time values – where the true meaning of ‘auto’
comes from: prior values of the same variable are the main ‘driver’ of current values;
one can easily a prior time (=time lag) outcome as co-predictor too, along with
the spatial lag co-predictor.
*******Additional resources****************************
Some books to
refer to when needing stats reviewing/reminding
* Kenny, D. A.
(1987). Statistics for the social and behavioral sciences:
Little, Brown Boston.
* Greene_2002_Econometric
Analysis
*Reference
cited**
Cameron, A., & Trivedi, P. (2009). Microeconometrics Using Stata. College
Station, TX: Stata Press.
Footnotes:
[i]
This is commonly called ‘nonindependence’ or less intuitively
‘auto’-correlation, even though the concept applies to 1 variable only: %
poverty exhibits this, and separately life expectancy exhibits it too, the
extent of is is given (commonly) by Moran’s I, which is ‘kind of’ a
correlation, meaning theoretically ranging from -1 to +1. At least two features
however makes it pretty different: (1). Its ‘null’ (no non-independence…) value
is not 0, but ; (2). The ‘what correlates with what’ is less visible, economists
call it more properly “correlated observations”, E(yi, yj)
= 0, see (Cameron & Trivedi, 2009), p. 81 .