Stat 306a: Discrete data analysis


Stat 305 looked at regression models for real valued response variables. Things change when the response variable we're looking at is discrete. The binary case is the simplest, and for it we will study logistic regression, possibly the most important discrete data analysis method. Related methods are available for multicategory (ordered or unordered) responses. Loglinear models are there for multivariate discreted data in which we don't necessarily wish to identify a response variable.

Counted data are becoming ever more important in the age of the Internet. Information retrieval made significant progress when it adopted the point of view that documents can be represented as large sparse discrete data vectors. Companies involved in e-commerce develop enormous log files of data and simply counting what happens (and what things happen together) can yield richly informative data. In the second portion of the course we'll look at some of these topics and related methods. Machine translation of natural languages (not covered here) is also dominated now by data intensive methods with discrete data.


Art Owen
Sequoia Hall 130
My userid is owenbuzzard on (remember to remove the carrion eaters)
Office hour: Tuesday 11:00am


11 to 12:15 Monday and Wednesday, starting Monday Jan 3



The main text is "Categorical Data Analysis" (second edition) by A. Agresti. We will use it for the first half to two thirds of the course. For the rest of the course we'll look at ways that categorical data are being used in real world large scale applications. For that we'll switch to research articles and the supplementary text, "Learning Python", by Lutz and Ascher. That book explains how to use Python. If you already know how to use Python you don't need to buy it. You might also find you like another book better, but this one works well. Python is good for generating discrete data from raw sources like text. Then you can dump discrete data to a file and analyze it in R. Over time you might end up doing more in python and less in R. Python has a rich set of libraries. I'm assuming that you already know how to use R. After all Stat 305 is a prerequisite and it is R based.



Be sure to give Axess a working email address:
I expect to send a small number of important emails about problem sets and the homework there. Most other announcements will be made in class.
Late penalties apply:
We will count days late on each problem set. HW turned in on the due date but after class ends is one day late. The next day is two days late and so on. late if it is not turned in in class Each day late is penalized by 10% of the homework value. Homework more than 3 days late will ordinarily get 0. If you're travelling, you can email a pdf file. For sickness, interviews and other events, up to 3 late days total are forgiven at the end of the quarter. (Work late enough to get zero does not get redeemed though.)

Supplementary materials


Problems (closed for the season)