Stat 306a: Discrete data analysis
Overview
Stat 305 looked at regression models for real valued
response variables. Things change when the response variable
we're looking at is discrete. The binary case is the simplest,
and for it we will study logistic regression, possibly the most
important discrete data analysis method.
Related methods are available for multicategory (ordered or
unordered) responses. Loglinear models are there for multivariate
discreted data in which we don't necessarily wish to identify
a response variable.
Counted data are becoming ever more important in the age
of the Internet. Information retrieval made significant progress
when it adopted the point of view that documents can be represented
as large sparse discrete data vectors. Companies involved in
ecommerce develop enormous log files of data and simply counting what
happens (and what things happen together) can yield richly informative
data. In the second portion of the course we'll look at some of these
topics and related methods. Machine translation of natural languages
(not covered here)
is also dominated now by data intensive methods with discrete
data.
Instructor
 Art Owen
 Sequoia Hall 130
 My userid is owenpelican on stanfordpelican.edu
(remember to remove the seabirds)
 Office hour: Friday 11:0012:00
Goals
 Deep and thorough understanding of binary data, especially logistic regression.
 Competence in modeling categorical data including handson work getting data into a form suitable for analysis.
 Broad exposure to aspects of categorical data that one might otherwise miss.
We only have one quarter. The class should be deep and it should also be
broad. The compromise is go into depth on key topics, while learning
basics of related ones.
Classes
1:30 to 2:20 Monday, Wednesday and Friday, starting Monday Jan 4
Topics
 Discrete distributions: Bernoulli, Binomial, Poisson, Multinomial
 Related continuous distributions: Beta, Dirichlet
 Chisquare tests
 Logistic regression
 Loglinear models for contingency tables
 Generalized linear models
 BradleyTerry and related models
 Rasch and related models
 Predicting ordered and unordered categorical values
 Applications
 market basket analysis
 sequence similarity
 information retrieval
 recommenders
Texts
The main text is "Categorical Data Analysis" (third edition)
by A. Agresti. We will use it for the first half to two
thirds of the course. For the rest of the course we'll look
at ways that categorical data are being used in real world large
scale applications.
For that we'll switch to research articles and
the supplementary text, "Learning Python",
by Lutz and Ascher. That book explains how to use Python.
If you already know how to use Python you don't need to buy it.
You might also find you like another book better, but this
one works well.
Python is good for generating discrete data from raw sources
like text. Then you can dump discrete data to a file and analyze it in R.
Over time you might end up doing more in python and less in R.
Python has a rich set of
libraries.
I'm assuming that you already know how to use R. After all
Stat 305 is a prerequisite and it is R based.
TAs
 Jeha Yang jehaPenguin@stanford.edu
Office Hours: Friday 9:0011:00 Sequoia Hall fishbowl
 Junyang Qian junyangqPenguin@stanford.edu
Office Hours: Wednesday 11:301:30 & tba Sequoia Hall fishbowl
Delete the Antarctic bird from the TAs' email
Evaluation
 Homework: 4 to 6 problem sets (65%)
 Midterm Friday February 12 (35%)
Be sure to give Axess a working email address:
I expect to send a small number of important emails about
problem sets and the homework there.
Most other announcements will be made in class.
Late penalties apply:
We will count days late on each problem set.
There is no late penalty for work turned in
in class on the due date. Work turned in
within 24 hours of that is 1 day late, 48
hours for 2 days late, etc.
Each day late is penalized by 10% of the homework value.
Homework more than 3 days late will ordinarily get 0.
If you're travelling, you can email a pdf file.
For sickness, interviews and other events,
up to 3 late days total are forgiven at the end of
the quarter. (Work late enough to get zero does not
get redeemed though.)
Supplementary materials

Brown, Cai and Dasgupta's definitive treatment of
Interval estimation for
a binomial proportion

Chapters 1 and 12
including hierarchical models
of Richard McElreath's Bayesian statistics book
 Matan Gavish's
crash course on entropy and related ideas for categorical data analysis
 Gelman et al.
default prior for logistic regression
 Charles McCulloch's notes on
Generalized linear
mixed models from JSTOR
 John D. Cook's notes
on the negative binomial distribution
 Mervyn Silvapulle's
definitive
article on existence of logistic regression MLEs
 Paul Komarek's
logistic regression on steroids (not his term)

Zipf and related things

Models for three sided coins
 Probabilistic models for document collections
 Bethany Percha's
dissertation
on using text mining to find drugdrug interactions

Elizabeth Purdom's R tutorial

Website for Agresti's book
 Laura Thompson's guides to R/Splus computing for Agresti's book

Notes on generalized linear models

Wikipedia pages on some course related distributions

Binomial
In R: qbinom, pbinom, dbinom, rbinom

Poisson
In R: qpois, ppois, dpois, rpois

Hypergeometric In R: qhyper, phyper, dhyper, rhyper

Negative binomial
In R: qnegbin, pnegbin, dnegbin, rnegbin of library(MASS)

Beta
In R: qbeta, pbeta, dbeta, rbeta