Data analysis and statistics

rich-ramsey.github.io/talks/sbs-retreat-25/

Richard Ramsey
www.rich-ramsey.com

Aim

Provide the lab with a general context for thinking about data analysis and statistics.

Warning

I’m not a statistician. I’m not a statistician. I’m not a statistician…

Overview


  1. Background.
  2. Caution.
  3. Regression.
  4. Summary.


Background

Sampling from a larger population

Image from: https://danawanzer.github.io/stats-with-jamovi/

My background


  • Cognitive neuroscience / experimental psychology.

  • Social perception and cognition.

  • I am not a statistician. (Did I say that already?)

  • More recently: open science, methods and meta-science.

My background


My undergrad and postgrad stats classes looked like this:

  • Each week described a different statistical test.

  • Your job was to choose the right test for a given type of data and run the test (usually via point-and-click in SPSS).

  • Then you interpret the p-value.

  • Job done. Easy, right?

My background

In the wake of the reproducibility crisis, I felt the need to become more statistically literate.

  • Enter:
    • Richard McElreath’s excellent textbook (McElreath 2020)
    • Solomon Kurz’s brilliant translation into tidyverse principles (Kurz 2023).
    • Various papers and books by Andrew Gelman.

Caution

Champagne inference on a beer budget


A quote from Andrew Gelman (Gelman, 2024):

once the data have been collected, the most important decisions have already been done

Champagne inference on a beer budget

All statistical models are fundamentally limited and need to be framed within the wider scientific context (McElreath 2020), such as:

  • The importance of theory
  • Open data and materials
  • Pre-registration
  • Meta-analyses
  • Computational modelling
  • Data science
  • Experimental design
  • And many more besides

Champagne inference on a beer budget

Before we make inferences and draw conclusions, we should spend more time (Scheel et al. 2021):

  • Forming concepts.
  • Developing valid measures.
  • Identifying boundary conditions and auxillary assumptions.
  • And so on…

Caution (!) is required


  • Statistical inference is not magical.
  • Inferences rest on many assumptions and data quality.
  • Inferences are likely to be fragile/tentative/suggestive.
  • So, be cautious!
  • And try to create research designs that do not rely too heavily on one particular part of your inferential model.

Towards statistical thinking


  • Develop a statistical philosophy rather than rely on historical statistical rituals (Gigerenzer 2018).
  • There are many different approaches to the same question.
  • Have a sense of how your approach fits into the mix of options, in terms of pros and cons.
  • Be able to defend and justify your choices explicitly.

Regression

Single-level linear regression

\[\color{red}{Y_i} = \color{orange}{\beta_0} + \color{green}{\beta_1} \color{blue}{X_i} + \color{violet}{\varepsilon_i}\]

  • \(\color{red}{Y_i}\): The outcome/response variable for observation \(i\)
  • \(\color{orange}{\beta_0}\): The intercept (value of \(Y\) when \(X = 0\))
  • \(\color{green}{\beta_1}\): The slope (change in \(Y\) for one unit increase in \(X\))
  • \(\color{blue}{X_i}\): The predictor variable for observation \(i\)
  • \(\color{violet}{\varepsilon_i}\): The error term (residual) for observation \(i\)

Common statistical tests are forms of linear regression

https://lindeloev.github.io/tests-as-linear/

Homework

  • Go away, get some data and run single level regressions in R.
  • Use different types of data - continuous and categorical predictors etc.
  • I bet there are a million tutorials online.
  • Aim: understand the basics of single level regression - the rest builds on this core foundation.

Multi-level linear regression

\[\color{red}{Y_{ij}} = (\color{orange}{\gamma_{00}} + \color{yellow}{u_{0j}}) + (\color{green}{\gamma_{10}} + \color{cyan}{u_{1j}})\color{blue}{X_{ij}} + \color{violet}{\varepsilon_{ij}}\]

  • \(\color{red}{Y_{ij}}\): The outcome variable for observation \(i\) in group \(j\)
  • \(\color{orange}{\gamma_{00}}\): The fixed effect intercept (population average)
  • \(\color{yellow}{u_{0j}}\): The random/varying intercept for group \(j\)
  • \(\color{green}{\gamma_{10}}\): The fixed effect slope (population average)
  • \(\color{cyan}{u_{1j}}\): The random/varying slope for group \(j\)
  • \(\color{blue}{X_{ij}}\): The predictor variable for observation \(i\) in group \(j\)
  • \(\color{violet}{\varepsilon_{ij}}\): The error term for observation \(i\) in group \(j\)

Multi-level regression: why bother?


  • One general and versatile way to approach data analysis.
  • It avoids picking the “right” statistical test.
  • There is an active community of users and lots of resources.
  • It is suitable for most data in psychology and human neuroscience, which typically has a nested (multi-level) structure e.g., sub-groups within a bigger group.
  • It takes advantage of partial pooling / shrinkage.

Summary vs trial-level data

summary / aggregated data

trial-level data

Partial pooling or shrinkage

let’s build in an intuition

Partial pooling or shrinkage

let’s plot it

Fixed and varying effects


# specify the model formula with only fixed or population-level terms
formula = dv ~ 1 + condition 


# and now add varying intercepts per participant
formula = dv ~ 1 + condition + 
  (1 | participant)


# and now add varying intercepts and slopes per participant
formula = dv ~ 1 + condition + 
  (1 + condition | participant)

Estimation approaches - basics

Adapted from Kruschke & Liddell, 2018

Estimation approaches - parameters

Estimation approaches - wider reading

Summary

Science needs more David Bowie


  • Statistical reform is important.

  • But science is not a one-trick pony.

  • We need more David Bowie.

https://vocal.media/beat/reinventing-david-bowie

Thanks for your attention


And thanks to these fine folks:

  • John Bartlett for his tutorial on reproducible presentations in R (this is solid gold).

  • Lisa Debruine for sharing lots of example presentations

Resources:

And here’s my stuff


References

Gigerenzer, Gerd. 2018. “Statistical Rituals: The Replication Delusion and How We Got There.” Advances in Methods and Practices in Psychological Science 1 (2): 198–218. https://doi.org/10.1177/2515245918771329.
Kruschke, John. 2014. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. Academic Press. https://books.google.com?id=FzvLAwAAQBAJ.
Kurz, A. Solomon. 2023. Statistical Rethinking with Brms, Ggplot2, and the Tidyverse: Second Edition. Version 0.4.0. https://bookdown.org/content/4857/.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC press.
Scheel, Anne M., Leonid Tiokhin, Peder M. Isager, and Daniël Lakens. 2021. “Why Hypothesis Testers Should Spend Less Time Testing Hypotheses.” Perspectives on Psychological Science 16: 744–55. https://doi.org/10.1177/1745691620966795.
Winter, Bodo. 2019. Statistics for Linguists: An Introduction Using R. Routledge. https://books.google.com?id=8cbADwAAQBAJ.