Stat 305A: Linear Models (and more)

Overview

This course is about the linear model. It is mainly a course about applied statistics, using the linear model to illustrate important concepts. The structure is as follows: we work through linear models in increasing order of complexity, pausing to talk about statistical ideas along the way to understand how and when to use them.

In regression we're working primarily with real valued responses. The main tool for regression is the linear model, in all it's glory ranging from the humble one sample t test to more elaborate methods like splines and wavelets. We also look at competing methods that are sometimes better than linear regression, because the focus is on the problems not the tools. The mathematics and computation involved in regression are comparatively simple. Applied statistics remains difficult because connecting methods appropriately to a given problem context is hard. That is the focus of this course.

Here is the syllabus. The first 2/3 or so refer to chapters in the scribed notes by Eric Min Eric Min described below.
Later lectures are on newly added topics.


Goals

  1. Learn about the linear model \(Y=X\beta+\varepsilon\) in depth and detail. Concepts, use cases, distribution theory, computation, geometric insight, problems and fixes, regularization.
  2. Use applied statistics tools: bootstrap, cross-validation, permutations, and more.
  3. Cross-cutting concepts: reproducibility, random effects, sparsity, Bayes, causal inference and more.

Prerequisites

This is not a first course in linear models. It is designed to be the last course on linear models for first year statistics PhD students. Many students will already have: With hard work, you can make up one or one and a half deficits. More than that, and you will feel lost. Try one or two of 141, 191, 202, 203, 216 first.

Classes

Sapp Teaching and Learning Center: STLC 111
Tuesday, Thursday 10:30 to 11:50

Instructor

Art Owen
Sequoia Hall 130
My sunet id is owen
Office: Wednesday 11am to noon

TAs

Day Time TA Office email Meeting room
Monday 3-5pm Claire Donnat Sequoia 237 cdonnat@stanford.edu Green Earth Sciences Bldg 131
Tuesday 4-6pm Rina Friedberg Sequoia 233 rinafriedberg@gmail.com Sequoia 105 (Girschick)
Thursday 5-7pm Youngtak Sohn Sequoia 231 youngtak@stanford.edu Sequoia 207 (Bowker)
Friday 9-11am Dan Kluger Sequoia 241 kluger@stanford.edu GESB 131

Notes

Here are some scribed notes by Eric Min along with some notations/corrections by Rob Tibshirani who used them in 2016/17. I am deeply indebted to Eric and Rob for their help. Eric carefully took down these nice notes in a classroom with awkward sight lines and poor acoustics. Rob fixed some of Eric's typos and some of my oversights. There may still be few errors and omissions, but these notes are still the best thing for students who want to read ahead. I thank Raj Krishnakumar for scribing some notes on instrumental variables that fit into Eric's notes at Chapter 15.2.

Instructor scribed notes for selected lectures.
Day Notes Day Notes
09/24 Intro, linear model, notation 09/26 Probability review | Noncentral distributions
10/01 Least squares (includes SVD) 10/03 One sample case
10/08 Two sample case 10/10 k sample case (we did not do random effects)
10/15 Plain linear regression (Min Ch 9) 10/17 Multiple regression
10/22 Variable selection etc (Min Ch 13/14) 10/24 Ridge (Min Ch14)
10/29 Midterm 10/31 Added variable plots, GLS (Min Ch 16)
11/05 Robust regression and outliers (Min Ch 16) 11/07 Bootstrapping regression (Min Ch 17)
11/12 A/B tests and ANCOVA 11/14 Regression discontinuity and instrumental variables
11/19 Bayes I; drawing from Peter Hoff's book (largely Ch 1 and 5) 11/21 Bayes II and random effects (largely Hoff's 8,9, Min's 11)
12/03 Quantile regression 12/05 Review

Here are some further notes to supplement the class. These were written a few years ago and are a bit more formal than what I would write now, but they are useful for stat 305A. Note: the PDFs have chapter numbers that are not unique.
Overview of 305A| Review of relevant probability| Linear least squares| one way ANOVA| multi-way ANOVA


From the office of accessible education

syllabus statement

Texts

The following book ``R for Data Science'' by Garrett Grolemund and Hadley Wickham is available online. Reading it online could be a good way to come up to speed with R. An earlier one is ``Introductory Statistics with R'' by Peter Dalgaard. Available online from Stanford accounts here.

That book explains how to use R. There are R tutorials below as well. Here is a scary story about using a spreadsheet for data analysis. Excel turned gene names into calendar dates in an irreversible way. This is not the only issue. You're much better off writing code to do your data analysis.

Sometimes students ask about other books on regression. Here are some others that are relevant to this course. Googling with the title and author should pull them up.

Several texts are on four hour reserve at the Li and Ma Science library.

Problem Sets 40%

Here is a problem set guide for students taking this course. Here is a guide for TAs grading this course.

The problem sets are available to students registered in the class. The existence of a new problem set will be announced in class.
I post them here as they are added. (Canvas not yet published.)
Be sure to give Axess a working email address:
I expect to send a small number of important emails about problem sets and the homework there. Most other announcements will be made in class. If you email me about the class, be sure to have stat 305 in your subject line. Otherwise, your email won't show when I search for course related emails.
Late penalties apply:
Upload your work to gradescope where it will get a time stamp. Work becomes late at midnight on the due date. We will count days late on each problem set. Each day late is penalized by 10% of the homework value. Homework more than 3 days late will ordinarily get 0. If you're travelling, you can email a pdf file. For sickness, interviews and other events, up to 3 late days total are forgiven at the end of the quarter. (Work late enough to get zero does not get redeemed though.)

Midterm Exam 25%

The midterm is on Tuesday October 29 in class.

The midterm is closed book and is also closed to notes, calculators and phones. You may be asked to supply short derivations or proofs, to give advice on how to handle some hypothetical data, or diagnose a problem based on some regression output.


Final Exam 35%

The exam is on Thursday December 12 from 3:30pm to 6:30pm. Do not book travel that conflicts with this date. University policy is that students may not register for two classes with exams at the same time.

The exam is closed book and is also closed to notes, calculators and phones. You may be asked to supply short derivations or proofs, to give advice on how to handle some hypothetical data, or diagnose a problem based on some regression output.


Supplementary materials

Big picture

Cautions about statistics and data science

Should data be handled with extreme caution, or should you try things and see what happens? If you're too cautious you end up with analysis paralysis (ready, aim, aim, aim, aim). If you're incautious other problems can happen. Fortune favors the bold. So does misfortune. Here are some cautionary tales. There is no sure fire resolution of this dilemma.

Statistical material

Subject matter material


Scribing

One year we had a pretty challenging classroom. It included poor acoustics and sight lines and chairs that were awkwardly positioned. I had each lecture scribed by one or two students.

Scribing a class lecture, the grad student desperately attempts to create something of value: Scribing a class lecture, the grad student
desperately attempts to create something of value