This is an
explanatory note for the ‘How to’ training videos tinyurl.com/intrstats1 (or directly Youtube1 ) & tinyurl.com/intrstats2 (or Youtube2
).
I provide
details to assist in answering some research questions (RQ), using simulated
data, with several basic statistical tests: chi-square test (then McNemar) and
t-test, for ‘independent’ and ‘dependent samples’. The Excel worksheet is
posted online at Tinyurl.com/101statsexcel (or Osf.Io
). The RQs are motivated by a study on
weight loss, whose data is posted also online at Dataverse (Coman, 2024) and center around body mass index (weight), Hemoglobin
A1c (blood glucose), gender, and time. I asked: RQ1: Are there more overweight
males than females? (& RQ.1.b. Research Question 1: Do males and females
differ in body mass index?); RQ.2. Research Question 2: Does BMI levels change?;
RQ.3. Research Question 1: Is the level of HgA1c predicted by BMI?
These RQs
invite directly analyses best equipped to answer them. [i]
All these
tests merely compare: differences, against some standard reference level
of similarity (no-difference):
1. Do cases (persons/patients) differ in
their 1 variable only value, say BMI? They may, they may not: if all had the
same BMI, there would be nothing to explain. If ½ of the sample seem to have
somewhat similar high BMI values, another ½ some somewhat similar low BMI
values, the differences are mainly between the low and high ‘clusters’ (we may
have 2 classes of folks, and within-class differences are rather small-ish,
compared to the between-classes).
1.a. These questions beat
around a causal bush, to be honest: differences in BMI are of interest
mostly because of the obesity epidemic in several countries, so what we truly
want to know is not just ‘what explains differences in male BMI’, but what
determines John’s BMI and Jake’s BMI, so that we can tell John to exercise 30
min/day and tell Jake to exercise 45 min/day (whatever comes out from analyses),
if they want to drop their BMI by some 5 kg/m2 (the unit for BMI).
1.b. Eventually, this ‘what
drives differences’ knowledge is needed for another practical (and causal)
inquiry: How much average weight loss would prevent say half of those who (are
not yet now, but) would become diabetic in a year, to actually become diabetic?
2. Are ‘these folks’ different from ‘those
folks’ (diabetic vs. ‘normal’) in terms of something else, like weight (BMI)?
This “2 variable” question can take on different ‘shapes’ depending on how we ‘carve out’ each variable: from a ‘both continuous’ first step, we can look at a graph like below, where each diamond is a person, and split it into 2 halves, either vertically, or horizontally, or into 4 quadrants, using some ‘arbitrary’ lines (in our case set at the sample means).
2.a. If we ignore where diamonds sit in each
quadrant, and just compare the 4 ‘groups’ of folks, we fall back on a
2-categorical variables RQ setup: this is handled in the Excel we work through
in the Youtube-Training-1 in the “Are
there more overweight males than females?” section.[ii]
***
Note that a 2x1 table of counts (of the 2 combinations (0,.), and (1,.) of
normal/over-weight) in which one instead enters the means of the other
variable, HgA1c here, turns the data into a format ripe for a comparison of
means line of questioning: a t-test of independent samples would fit here like
a glove (the one-way Anova test will yield identical results)!
***
Also note that, if we add a 3rd variable in a 2x2 table (like normal/over-weight
and normal/diabetic), say blood pressure, in the form of the mean of each
cross-group, one ends up with a two-way Anova structure in which there are 2 ‘main
effects’ on blood pressure [iii].
2.b. The 2 continuous variables shown in the
scatter plot invite questions of ‘going hand-in-hand’: are there most of the
folks situated in the Low&Low (0,0) and High&High (1,1) quadrants, with
only a few in the other 2? Then we have a positive relation; if we push this
mental exercise to placing ALL the diamonds on a straight line (at a 45 degree angle),
the 2 variables become identical [iv].
*** We show
how to run a simple linear regression analysis using Excel’s ‘powers’, but also
how to run a multiple regression, using Excel’s matrix multiplication (and Greene’s
formulas, p. 23, eq. 3-10).
Some cold showers:
A. The statistical tests themselves are
related, and each ‘falls back’ on another under some limiting constraints [v]; they also rest on
specific assumptions, which can/not be relaxed handily (e.g. equality of
variances in t-tests); better way to ‘open up the black box’ of such
mathematical straight jackets is to model all ‘parts’ flexibly, e.g. in
multiple-group structural equation models (SEM), like in this article (Coman et al., 2014).
B. Using mathematical formulas to derive
specific estimates (e.g. the standard error of the mean difference) can only
take us so far: statistics is not as exact as arithmetic/algebra[vi].
*Additional resources**
Some
books to refer to when needing stats reviewing/reminding
*
Devore 2016 Probability and Statistics for Engineering and
the Sciences
*
Kenny, D. A. (1987). Statistics for the social and behavioral sciences:
Little, Brown Boston.
*
Hernán MA, Robins JM (2019). Causal
Inference. Boca Raton: Chapman & Hall/CRC. (SAS , Stata R, Python)
*
Greene_2002_Econometric
Analysis
*
Barreto, H., & Howland, F. (2005). Introductory
Econometrics: Using Monte Carlo Simulation with Microsoft Excel
*References cited**
Coman,
E. (2024). Data and appendix for:
"Restructuring basic statistical curricula: mixing older analytic methods
with modern software tools in psychological research. Retrieved from: https://doi.org/10.7910/DVN/QDXM7U
Coman, E. N., Iordache, E., Dierker, L., Fifield, J.,
Schensul, J. J., Suggs, S., & Barbour, R. (2014). Statistical power of
alternative structural models for comparative effectiveness research:
advantages of modeling unreliability https://pubmed.ncbi.nlm.nih.gov/26640421/.
Journal of Modern Applied Statistical
Methods, 13(1), 71-90.
Stevens, J. (2009). Applied
multivariate statistics for the social sciences: Lawrence Erlbaum.
Footnotes:
[i] Note that one puts the horse behind the carriage when “dichotomizing a continuous variable then using statistical tests for a categorical variable”! One in fact either asks the question in a continuous framework (Does BMI differ between biological genders?) OR in a categorical framework (Are there more/less overweight persons among males vs/ females?) It is the RQ that should trigger transforming a variable, not the search for a convenient analytic model. The ‘what does overweight mean?’ additional research question is buried when one gallantly splits a continuous variable around some convenient value, like the sample mean: for some specific variables, like HgA1c this becomes essential: what HgA1c qualifies a patient as ‘diabetic’? (i.e. “ When does diabetes ‘comes into existence’?”)
[ii]
Note that we used ‘biological gender’ here, where we could have used ‘diabetic
vs. not’, just to give more weight to this ‘categorical’ variable meaning:
biological gender itself however can be conceptualized as a continuous measure,
and it has been, in cases where the gender assignment is questioned (like this tennis
example, or the recent Olympics boxing controversy, see ‘unspecified
gender eligibility tests’).
[iii] More
generally, Anova is a special case of the log linear model where the cell frequencies
are replaced by the cell means of a third variable (see (Stevens, 2009),
ch.14 Categorical Data Analysis: The Log Linear Model).
[iv]
This is the end point of the problem called multi-collinearity: we use two
variables in statistical models, but unbeknownst to us, they are correlated 1.0,
i.e. one is a linear combination of the other one, so we don’t have 2, but 1!.
[v] *
t and z test are equivalent for samples n>30; t uses sample variance, z
needs population variance (www);
* F = t2: If you square a
t-statistic, you get an F-statistic with 1 degree of freedom in the numerator (www1
& www2).
* When the denominator degrees of freedom in an
F-statistic become very large, the F-distribution approaches a chi-square
distribution: chi-squared = (numerator degrees of freedom) * F (www).
[vi] in
math 1 ≠ 2, ever, while statistically, sometimes 1 = 2 can ‘happen’: if 1 and 2 represent the means of say $cash boys and girls in a
classroom have on them, we may conclude they ‘have the same amounts of cash’,
depending on the variability of individual values (if the second mean is within
1.96 standard errors of the other mean. The t-test formula for 2 independent
samples means is t = (mean1 - mean2) / sqrt((sd1^2/n1) + (sd2^2/n2)), where sd1
and sd2 are the 2 standard deviations; for say 10 boys and 10 girls, with sd1 =
1.2 and sd2 = 1.2, t = 1/ 0.536656315 =
1.863389981, which is smaller than the 1.96 value that corresponds to a very
small chance (<.05) of observing such a difference between the sample means,
if the two population means were in fact equal (‘null’ hypothesis): we hence
cannot reject the ‘null’ so 1 and 2 are statistically (significantly)
indistinguishable.
No comments:
Post a Comment