This course is about the linear model. It is mainly a course about applied statistics, using the linear model to illustrate important concepts. The structure is as follows: we work through linear models in increasing order of complexity, pausing to talk about statistical ideas along the way to understand how and when to use them.
In regression we're working primarily with real valued responses. The main tool for regression is the linear model, in all it's glory ranging from the humble one sample t test to more elaborate methods like splines and wavelets. We also look at competing methods that are sometimes better than linear regression, because the focus is on the problems not the tools. The mathematics and computation involved in regression are comparatively simple. Applied statistics remains difficult because connecting methods appropriately to a given problem context is hard. That is the focus of this course.
Here is the syllabus. The first 2/3 or so refer to chapters in the scribed notes by Eric Min Eric Min described below.
Later lectures are on newly added topics.
- Learn about the linear model \(Y=X\beta+\varepsilon\) in depth and detail. Concepts, use cases, distribution theory, computation, geometric insight, problems and fixes, regularization.
- Use applied statistics tools: bootstrap, cross-validation, permutations, and more.
- Cross-cutting concepts: reproducibility, random effects, sparsity, Bayes, causal inference and more.
This is not a first course in linear models. It is designed to be the last course on linear models for first year statistics PhD students. Many students will already have:
With hard work, you can make up one or one and a half deficits. More than that, and you will feel lost. Try one or two of 141, 191, 202, 203, 216 first.
- hands on experience modeling data
- strong preparation in probability
- linear algebra
- lots of statistics theory
- we will be using R
Sapp Teaching and Learning Center: STLC 111Tuesday, Thursday 10:30 to 11:50
- Art Owen
- Sequoia Hall 130
- My sunet id is owen
- Office: Wednesday 11am to noon
Day Time TA Office Meeting room Monday 3-5pm Claire Donnat Sequoia 237 firstname.lastname@example.org Green Earth Sciences Bldg 131 Tuesday 4-6pm Rina Friedberg Sequoia 233 email@example.com Sequoia 105 (Girschick) Thursday 5-7pm Youngtak Sohn Sequoia 231 firstname.lastname@example.org Sequoia 207 (Bowker) Friday 9-11am Dan Kluger Sequoia 241 email@example.com GESB 131
Here are some scribed notes by Eric Min along with some notations/corrections by Rob Tibshirani who used them in 2016/17. I am deeply indebted to Eric and Rob for their help. Eric carefully took down these nice notes in a classroom with awkward sight lines and poor acoustics. Rob fixed some of Eric's typos and some of my oversights. There may still be few errors and omissions, but these notes are still the best thing for students who want to read ahead. I thank Raj Krishnakumar for scribing some notes on instrumental variables that fit into Eric's notes at Chapter 15.2.
Instructor scribed notes for selected lectures.
Day Notes Day Notes 09/24 Intro, linear model, notation 09/26 Probability review | Noncentral distributions 10/01 Least squares (includes SVD) 10/03 One sample case 10/08 Two sample case 10/10 k sample case (we did not do random effects) 10/15 Plain linear regression (Min Ch 9) 10/17 Multiple regression 10/22 Variable selection etc (Min Ch 13/14) 10/24 Ridge (Min Ch14) 10/29 Midterm 10/31 Added variable plots, GLS (Min Ch 16) 11/05 Robust regression and outliers (Min Ch 16) 11/07 Bootstrapping regression (Min Ch 17) 11/12 A/B tests and ANCOVA 11/14 Regression discontinuity and instrumental variables 11/19 Bayes I; drawing from Peter Hoff's book (largely Ch 1 and 5) 11/21 Bayes II and random effects (largely Hoff's 8,9, Min's 11) 12/03 Quantile regression 12/05 Review
Here are some further notes to supplement the class. These were written a few years ago and are a bit more formal than what I would write now, but they are useful for stat 305A. Note: the PDFs have chapter numbers that are not unique.
Overview of 305A| Review of relevant probability| Linear least squares| one way ANOVA| multi-way ANOVA
The following book ``R for Data Science'' by Garrett Grolemund and Hadley Wickham is available online. Reading it online could be a good way to come up to speed with R. An earlier one is ``Introductory Statistics with R'' by Peter Dalgaard. Available online from Stanford accounts here.
That book explains how to use R. There are R tutorials below as well. Here is a scary story about using a spreadsheet for data analysis. Excel turned gene names into calendar dates in an irreversible way. This is not the only issue. You're much better off writing code to do your data analysis.
Sometimes students ask about other books on regression. Here are some others that are relevant to this course. Googling with the title and author should pull them up.
Several texts are on four hour reserve at the Li and Ma Science library.
- ``Regression: Linear Models in Statistics'', Bingham and Fry 2010. Stanford library has digital version Bingham and Fry
- ``Linear regression analysis'', Seber and Lee 2003, More theoretical.
- ``Regression analysis by example'', Chatterjee and Hadi 2012, Examples.
- ``Beyond ANOVA'', Rupert G. Miller Jr., 1986 (reprinted 1997), Covers classical material.
- ``Statistical methods'', Snedecor and Cochran, 1937 (many updates), very pragmatic, valuable for people with all prerequisites except experience with statistics.
- ``Plane answers to complex questions'', Christiansen, (4th ed. 2011), makes maximal use of linear algebra.
- ``Applied regression analysis'', Draper and Smith, (1966 ++), classic text. Your professors' professors learned or taught from this one.
Here is a problem set guide for students taking this course. Here is a guide for TAs grading this course.Be sure to give Axess a working email address:
The problem sets are available to students registered in the class. The existence of a new problem set will be announced in class.
I post them here as they are added. (Canvas not yet published.)
I expect to send a small number of important emails about problem sets and the homework there. Most other announcements will be made in class. If you email me about the class, be sure to have stat 305 in your subject line. Otherwise, your email won't show when I search for course related emails.Late penalties apply:
Upload your work to gradescope where it will get a time stamp. Work becomes late at midnight on the due date. We will count days late on each problem set. Each day late is penalized by 10% of the homework value. Homework more than 3 days late will ordinarily get 0. If you're travelling, you can email a pdf file. For sickness, interviews and other events, up to 3 late days total are forgiven at the end of the quarter. (Work late enough to get zero does not get redeemed though.)
The midterm is on Tuesday October 29 in class.
The midterm is closed book and is also closed to notes, calculators and phones. You may be asked to supply short derivations or proofs, to give advice on how to handle some hypothetical data, or diagnose a problem based on some regression output.
The exam is on Thursday December 12 from 3:30pm to 6:30pm. Do not book travel that conflicts with this date. University policy is that students may not register for two classes with exams at the same time.
The exam is closed book and is also closed to notes, calculators and phones. You may be asked to supply short derivations or proofs, to give advice on how to handle some hypothetical data, or diagnose a problem based on some regression output.
Here is Andrew Gelman's blog. He writes often about getting the right answer from statistics. Sometimes he says things I've thought but never seen in writing before. Sometimes I see completely new insights. He has lots of good specific examples of data analysis gone wrong with careful point by point critiques. Good statistical practice is not just about the math or the computing but about how they interact with the underlying science and goals.
xkcd on correlation versus causation. This is the first funny statistics joke I have ever seen. Lawyers have so many more to choose from.
Should data be handled with extreme caution, or should you try things and see what happens? If you're too cautious you end up with analysis paralysis (ready, aim, aim, aim, aim). If you're incautious other problems can happen. Fortune favors the bold. So does misfortune. Here are some cautionary tales. There is no sure fire resolution of this dilemma.
Scribing a class lecture, the grad student desperately attempts to create something of value: