EvaluationHelp: Run a Spatial lag regression in Excel

This is an explanatory note for the ‘How to’ training videos tinyurl.com/intrstats3 (or directly Youtube: youtube.com/watch?v=K_Emygh7axo ). The Excel worksheet is posted online at Tinyurl.com/SPATIALSSM (or directly Dataverse).

The Research Question we start with is simply:

“Do states with more residents in poverty have longer/shorter life expectancies? By how much?”

First, the spoiler: naïve or a-spatial analyses will overestimate the effect, almost always (depending on the extent of spatial ‘excess similarity’ of values in both variables, between neighboring states)[i]. A visual 'proof' is below: neighboring states 'push up/down' their neighbors' values, one variable at a time.

* To run a spatial lag regression in Excel, one needs 2 pieces of distinct data: (1). The 2 variables for the US states; (2). The ‘shape file’ of the US states, i.e. the geographic information systems (GIS) set of files encoding the location of, and boundaries between, the states.

*** The steps involved in this are:

1. Obtain a 49x49 matrix data file marking which states neighbors which other state

* Find a ‘shape file’ for the US states online: e.g. Census – States level (cb_2018_us_state_20m.zip)

* (to go from the 51 states file to the contiguous 49, use QGIS[ii]) Unzip in a folder, and then in GeoDa (free) open it; in Tools \ Weights Manager \ Create, Select an ID variable, choose an ID (GEOID e.g. or better the 2 letter state abbreviation one), and Contiguity Weight \ Queen Contiguity: what saves then is a *.gal file, in essence a text file: open it with Notepad e.g. to see its structure; for CT e.g., it’s 2 lines: The state name, the total number of its ‘queen contiguity’ neighbors, then on the next line the names of the neighbors

CT 3

NY MA RI

* This is the data we’ll process in Excel to turn into a 49 * 49 matrix, full with 0’s except spots where the column state is the neighbor of the row state (plus, we scale the non-zero numbers to add up to 1: so for CT’s 3 neighbors, each cell gets a .33): all this is shown in the Excel file Poverty_lifeexp_matrix_reg.xlsx in successive worksheets: queen_49Orig, HowTo, ProcessStandardize, STANDimport49b49. This last one will be used to turn a OLS regression into a spatial lag regression: that’s all!

2. Generate the spatial lag variables

* Multiply each variable, a column of 49 rows (a vertical vector) by the standardized weight matrix (in worksheet Reg_Lag): formula is simply MMULT(B2:AX50,P2:P50); the result is the spatial lag derivative of the initial Life Expectancy variable found in P2:P50!

3. Run the spatial lag regression

* Use the mean centered data in columns B, C, and D to run a multiple regression by hand’ in Excel; it merely means implementing (in steps however, the formula entered in 1 chink did not run!) the formula for the beta/regression coefficient found in Greene, p.23 eq. 3-10.

β_{p +1
x 1} = (X’_{p+1 x
N}·X_{N x p+1})^-1 · X’_{p+1 x N}·y _{N x 1}

for p predictors (here p = 2), N = 49 states, X are the predictors, y is the outcome. The “+1” addition is because one would need a ‘vector of 1s’ added in the matrix of predictors, this is added in the Excel.

* What we see is that the naïve β = -0.44, while the proper spatial β = -0.31.

Conclusion:

States with 10% points more residents in poverty have a lower life expectancy at birth by 3.7 months; naïve analyses would yield instead an inflated (biased up) 5.3 months value.

* Now anyone can run a spatial regression without much fuss; working in Stata or R this can be done quite quickly, but what’s happening behind the scenes would be lost: we unpacked it here.

Some more details:

A. Keeping track of the matching by state is essential: many options exist for this, best in this instance is to use the 2 letter state abbreviation, and keep checking whether one messes the order or not, at each step: copy and paste alongside the columns to check. Alternatively, Excel can also do ‘matching’, see e.g. WWW.

* For larger files, like the ~3,080 US counties, or the ~65,000 US census tracts, this ‘by-hand’ process becomes a little cumbersome (Excel could still do it… ), so other automated options are recommended: Stats’s sp module is a simple and instructive one: see Chuck Huber’s ‘how to’ blog posting. See also Di Liu’s post.

B. Checking the results can be done in GeoDa straight away: see Luc Anselin’s Guide (a PDF here)

* There are two ways to check this in GeoDa: B.a. Run a Classic Regression, then a Spatial Lag (with Weight File defined); B.b. Create a spatial lag variable using the Calculator \ Spatial lag option.

C. Accounting for the spatial ‘auto’-correlation is much like accounting for the prior time values – where the true meaning of ‘auto’ comes from: prior values of the same variable are the main ‘driver’ of current values; one can easily a prior time (=time lag) outcome as co-predictor too, along with the spatial lag co-predictor.

*******Additional resources****************************

Some books to refer to when needing stats reviewing/reminding

* Kenny, D. A. (1987). Statistics for the social and behavioral sciences: Little, Brown Boston.

* Greene_2002_Econometric Analysis

*Reference cited**

Cameron, A., & Trivedi, P. (2009). Microeconometrics Using Stata. College Station, TX: Stata Press.

Footnotes:

[i] This is commonly called ‘nonindependence’ or less intuitively ‘auto’-correlation, even though the concept applies to 1 variable only: % poverty exhibits this, and separately life expectancy exhibits it too, the extent of is is given (commonly) by Moran’s I, which is ‘kind of’ a correlation, meaning theoretically ranging from -1 to +1. At least two features however makes it pretty different: (1). Its ‘null’ (no non-independence…) value is not 0, but ; (2). The ‘what correlates with what’ is less visible, economists call it more properly “correlated observations”, E(y_i, y_j) = 0, see (Cameron & Trivedi, 2009), p. 81 .

[ii] Handling 'shape files’ to delete unwanted regions, and for ‘joining’ and other operations, can be best done in QGIS; this is another task, see e.g. WWW.

EvaluationHelp

Friday, December 27, 2024

Run a Spatial lag regression in Excel

No comments:

Post a Comment

About Me

Blog Archive