Stat 305A: Linear Models (and more)

Overview

This course is about the linear model. It is mainly a course about applied statistics, using the linear model to illustrate important concepts. The structure is as follows: we work through linear models in increasing order of complexity, pausing to talk about statistical ideas along the way to understand how and when to use them.
In regression we're working primarily with real valued responses. The main tool for regression is the linear model, in all it's glory ranging from the humble one sample t test to more elaborate methods like splines and wavelets. We also look at competing methods that are sometimes better than linear regression, because the focus is on the problems not the tools. The mathematics and computation involved in regression are comparatively simple. Applied statistics remains difficult because connecting methods appropriately to a given problem context is hard. That is the focus of this course.
Here is the syllabus. The first 2/3 or so refer to chapters in the scribed notes by Eric Min Eric Min described below.
Later lectures are on newly added topics.

Goals

Learn about the linear model \(Y=X\beta+\varepsilon\) in depth and detail. Concepts, use cases, distribution theory, computation, geometric insight, problems and fixes, regularization.

Use applied statistics tools: bootstrap, cross-validation, permutations, and more.

Cross-cutting concepts: reproducibility, random effects, sparsity, Bayes, causal inference and more.

Prerequisites

This is not a first course in linear models. It is designed to be the last course on linear models for first year statistics PhD students. Many students will already have:

hands on experience modeling data

strong preparation in probability

linear algebra

analysis

lots of statistics theory

we will be using R

With hard work, you can make up one or one and a half deficits. More than that, and you will feel lost. Try one or two of 141, 191, 202, 203, 216 first.

Classes

Sapp Teaching and Learning Center: STLC 111
Tuesday, Thursday 10:30 to 11:50

Instructor

Art Owen

Sequoia Hall 130

My sunet id is owen

Office: Wednesday 11am to noon

TAs

Day Time TA Office email Meeting room
Monday 3-5pm Claire Donnat Sequoia 237 cdonnat@stanford.edu Green Earth Sciences Bldg 131
Tuesday 4-6pm Rina Friedberg Sequoia 233 rinafriedberg@gmail.com Sequoia 105 (Girschick)
Thursday 5-7pm Youngtak Sohn Sequoia 231 youngtak@stanford.edu Sequoia 207 (Bowker)
Friday 9-11am Dan Kluger Sequoia 241 kluger@stanford.edu GESB 131

Day	Time	TA	Office	email	Meeting room
Monday	3-5pm	Claire Donnat	Sequoia 237	cdonnat@stanford.edu	Green Earth Sciences Bldg 131
Tuesday	4-6pm	Rina Friedberg	Sequoia 233	rinafriedberg@gmail.com	Sequoia 105 (Girschick)
Thursday	5-7pm	Youngtak Sohn	Sequoia 231	youngtak@stanford.edu	Sequoia 207 (Bowker)
Friday	9-11am	Dan Kluger	Sequoia 241	kluger@stanford.edu	GESB 131

Notes

Here are some scribed notes by Eric Min along with some notations/corrections by Rob Tibshirani who used them in 2016/17. I am deeply indebted to Eric and Rob for their help. Eric carefully took down these nice notes in a classroom with awkward sight lines and poor acoustics. Rob fixed some of Eric's typos and some of my oversights. There may still be few errors and omissions, but these notes are still the best thing for students who want to read ahead. I thank Raj Krishnakumar for scribing some notes on instrumental variables that fit into Eric's notes at Chapter 15.2.
Instructor scribed notes for selected lectures.

Day Notes Day Notes
09/24 Intro, linear model, notation 09/26 Probability review | Noncentral distributions
10/01 Least squares (includes SVD) 10/03 One sample case
10/08 Two sample case 10/10 k sample case (we did not do random effects)
10/15 Plain linear regression (Min Ch 9) 10/17 Multiple regression
10/22 Variable selection etc (Min Ch 13/14) 10/24 Ridge (Min Ch14)
10/29 Midterm 10/31 Added variable plots, GLS (Min Ch 16)
11/05 Robust regression and outliers (Min Ch 16) 11/07 Bootstrapping regression (Min Ch 17)
11/12 A/B tests and ANCOVA 11/14 Regression discontinuity and instrumental variables
11/19 Bayes I; drawing from Peter Hoff's book (largely Ch 1 and 5) 11/21 Bayes II and random effects (largely Hoff's 8,9, Min's 11)
12/03 Quantile regression 12/05 Review

Here are some further notes to supplement the class. These were written a few years ago and are a bit more formal than what I would write now, but they are useful for stat 305A. Note: the PDFs have chapter numbers that are not unique.
Overview of 305A| Review of relevant probability| Linear least squares| one way ANOVA| multi-way ANOVA

Day	Notes	Day	Notes
09/24	Intro, linear model, notation	09/26	Probability review \| Noncentral distributions
10/01	Least squares (includes SVD)	10/03	One sample case
10/08	Two sample case	10/10	k sample case (we did not do random effects)
10/15	Plain linear regression (Min Ch 9)	10/17	Multiple regression
10/22	Variable selection etc (Min Ch 13/14)	10/24	Ridge (Min Ch14)
10/29	Midterm	10/31	Added variable plots, GLS (Min Ch 16)
11/05	Robust regression and outliers (Min Ch 16)	11/07	Bootstrapping regression (Min Ch 17)
11/12	A/B tests and ANCOVA	11/14	Regression discontinuity and instrumental variables
11/19	Bayes I; drawing from Peter Hoff's book (largely Ch 1 and 5)	11/21	Bayes II and random effects (largely Hoff's 8,9, Min's 11)
12/03	Quantile regression	12/05	Review

From the office of accessible education

syllabus statement

Texts

The following book ``R for Data Science'' by Garrett Grolemund and Hadley Wickham is available online. Reading it online could be a good way to come up to speed with R. An earlier one is ``Introductory Statistics with R'' by Peter Dalgaard. Available online from Stanford accounts here.
That book explains how to use R. There are R tutorials below as well. Here is a scary story about using a spreadsheet for data analysis. Excel turned gene names into calendar dates in an irreversible way. This is not the only issue. You're much better off writing code to do your data analysis.
Sometimes students ask about other books on regression. Here are some others that are relevant to this course. Googling with the title and author should pull them up.

``Regression: Linear Models in Statistics'', Bingham and Fry 2010. Stanford library has digital version Bingham and Fry
``Linear regression analysis'', Seber and Lee 2003, More theoretical.

``Regression analysis by example'', Chatterjee and Hadi 2012, Examples.

``Beyond ANOVA'', Rupert G. Miller Jr., 1986 (reprinted 1997), Covers classical material.

``Statistical methods'', Snedecor and Cochran, 1937 (many updates), very pragmatic, valuable for people with all prerequisites except experience with statistics.

``Plane answers to complex questions'', Christiansen, (4th ed. 2011), makes maximal use of linear algebra.

``Applied regression analysis'', Draper and Smith, (1966 ++), classic text. Your professors' professors learned or taught from this one.

Several texts are on four hour reserve at the Li and Ma Science library.

Problem Sets 40%

Here is a problem set guide for students taking this course. Here is a guide for TAs grading this course.

The problem sets are available to students registered in the class. The existence of a new problem set will be announced in class.
I post them here as they are added. (Canvas not yet published.)

Be sure to give Axess a working email address:

I expect to send a small number of important emails about problem sets and the homework there. Most other announcements will be made in class. If you email me about the class, be sure to have stat 305 in your subject line. Otherwise, your email won't show when I search for course related emails.

Late penalties apply:

Upload your work to gradescope where it will get a time stamp. Work becomes late at midnight on the due date. We will count days late on each problem set. Each day late is penalized by 10% of the homework value. Homework more than 3 days late will ordinarily get 0. If you're travelling, you can email a pdf file. For sickness, interviews and other events, up to 3 late days total are forgiven at the end of the quarter. (Work late enough to get zero does not get redeemed though.)

Midterm Exam 25%

The midterm is on Tuesday October 29 in class.
The midterm is closed book and is also closed to notes, calculators and phones. You may be asked to supply short derivations or proofs, to give advice on how to handle some hypothetical data, or diagnose a problem based on some regression output.

Final Exam 35%

The exam is on Thursday December 12 from 3:30pm to 6:30pm. Do not book travel that conflicts with this date. University policy is that students may not register for two classes with exams at the same time.
The exam is closed book and is also closed to notes, calculators and phones. You may be asked to supply short derivations or proofs, to give advice on how to handle some hypothetical data, or diagnose a problem based on some regression output.

Supplementary materials

Big picture

Peter Norvig on why it takes a long time to get good at something. He does not talk about applied statistics but his points apply here too. If you're talented and work really hard at it, you can get very good at applied statistics in about 10 years. This course is designed to get you started and speed you along the way. Malcolm Gladwell wrote something similar about 10,000 hours in his Outliers book, but I like Norvig's discussion more. There has been some followup saying Gladwell got the details wrong and oversimplified the research he reported on. I'm still convinced that getting good at statistics takes time and practice and attention to what you're doing though it also makes sense that there is no magic number. The goal of this class is to move you faster along the curve than would normally happen in the 10 weeks or so we spend.
Here is Andrew Gelman's blog. He writes often about getting the right answer from statistics. Sometimes he says things I've thought but never seen in writing before. Sometimes I see completely new insights. He has lots of good specific examples of data analysis gone wrong with careful point by point critiques. Good statistical practice is not just about the math or the computing but about how they interact with the underlying science and goals.
Eugene Yan's The First Rule of Machine Learning: Start without Machine Learning. The idea is to first explore the data graphically and with summary statistics, thinking about the use cases. Get something simple going. Then later consider replacing it with a black box once you've built up some domain knowledge.
xkcd on correlation versus causation. This is the first funny statistics joke I have ever seen. Lawyers have so many more to choose from.

Cautions about statistics and data science

Should data be handled with extreme caution, or should you try things and see what happens? If you're too cautious you end up with analysis paralysis (ready, aim, aim, aim, aim). If you're incautious other problems can happen. Fortune favors the bold. So does misfortune. Here are some cautionary tales. There is no sure fire resolution of this dilemma.

John Ioannidis explains why (he thinks that) most published research findings are false. This is scary stuff. Almost nobody believes that the errors he talks about apply to them.

You can be too data driven (scroll down for the best comments about measurement bias)

data considered toxic (it can leak and hurt people)

Jason Kottke's list of cheats including a classifier that learns that a lesion is more likely to be cancerous if the photo includes a ruler. The linked article for this was from Stanford.

Statistical material

Rudy Angeles' R tutorial (2011/12)
Student's t Wikipedia page In R: qt, pt, dt, rt
F distribution Wikipedia page
GCV for ridge regression
Rudy Angeles' Ridge+Lasso Notes (2006/07)
Empirical likelihood overview 4up | 1up | for regression
Bootstrapping regression:
Hal Varian on causal inference in economics and marketing
Angrist and Krueger on instrumental variables
McFadden on instrumental variables
Gelman and Zelizer's regression discontinuity example
Paul Holland's article on Neyman-Rubin causality.
Dimitris Bertsimas and Rahul Mazumder's work on mixed integer optimization for super robust regression
Related: best subset regression

Subject matter material

Human body temperature
Repeated CIs for the Hubble constant thanks to Justin Dyer
Cardboard boxes | and their strength

Scribing

One year we had a pretty challenging classroom. It included poor acoustics and sight lines and chairs that were awkwardly positioned. I had each lecture scribed by one or two students.

Scribing a class lecture, the grad student desperately attempts to create something of value: