Stat 321 Readings and Lectures

There is no text on this collection of topics. I have put David Skillicorn's book on reserve in the Math-CS library. It covers the matrix decomposition material. Here is a collection of background material for Stat 321.

Lectures and notes

  1. Introduction
    Discussion of the topics and types of data we're looking at.
  2. Analysis of variance
    Similar problems have been handled in the 'all Gaussian' world. They have a rich vocabulary of terms for different kinds of variables that come up. The terms depend on our goals as well as how the variables are distributed. Don't get lost in the details of all these terms. They're there so you can search the literature with them.

    Why we care: What is important is that treating a random effect as a fixed one can trick you into believing the evidence is way stronger than really it is. Imagine 10 people with 10000 blood pressure measurements on each of their two arms. It looks like 200,000 observations. You might conclude with high confidence that left and right arms have different blood pressures. But really there are only 10 people. If you had infinitely many measurements on them you still would have uncertainty about how the results apply. Intuitively this is obvious. We hope our statistical methods take account of this. But a black box might not take proper account of the difference between 10 subjects with 10000 observations each and 10000 subjects with 10 observations each. Those are very different.
    For ANOVA like problems we know how to handle this with some precision. We'd hate to get it wrong on some other problem.

  3. Bootstrap lecture| Bootstrap background (Efron and Gong) | Bootstrap caveat (McCullagh)| Unbalanced heteroscedastic setting
    When rows are IID cases and columns are variables we can do a simple bootstrap analysis. If you are new to, or fuzzy on, the bootstrap, this introductory article by Efron and Gong is a good place to start. The bootstrap is approximately right in great generality. Only weak assumptions are needed.

    Why we care: The bootstrap goes badly wrong for 'crossed random effect' type data where both rows and columns are like the subjects in the analysis above. We might find ourselves with a seriously wrong answer. Also: we'd hate to have to take all the theory of restricted maximum likelihood estimators that gets used in the ANOVA case, and try to generalize it to other settings. It's bad enough when everything is Gaussian and mildly unbalanced. It would be awful for other models.
    So we'd like to do something easy like the bootstrap. Unfortunately McCullagh proves the non-existence of an exactly correct bootstrap even for a sample mean over all cases. McCullagh's paper is not easy to read. (How could it be easy to prove that no conceivable resampling will work?) The bootstrap described in the lecture notes is approximately right though.

  4. Classic categorical data
    Some of these problems echo long standing ones studied in categorical data analysis. Here we look at some of those old methods. The Rasch model is a famous highlight.
  5. Singular Value Decomposition
    Here we look at the SVD and its properties. They underly many of the algorithms we'll study. Later we'll use the SVD as a data analysis tool.
  6. Principal Components Analysis
    This is a classical dimension reduction technique from multivariate analysis.
  7. Correspondence analysis |article by Greenacre and Hastie
    Correspondence analysis merges principal components and discrete data analysis.
  8. Heat maps | Wilkinson's IPAM talk Some AGEMAP slides Heatmaps are a good tool for visualizing matrix valued data. They are better for confirmatory use than for exploratory use, especially when the data set is large. Picking the row/column ordering is hard to do and makes a difference. Same for the color scheme.
  9. Clustering
    Here we look at clustering, as a prelude to bi-clustering.
  10. SVD per se
    Notes on the SVD in and of itself for data analysis. In class we looked at Alter et al for SVD on Microarrays, and a second microarray paper by Liu et al. who robustified it and then made heatmaps.
  11. Nonnegative matrix factorization. The lecture was based on Lee and Seung (1999) and Donoho and Stodden (2003), and, Berry, Browne, Langville, Pauca, Plemmons (2006)


  1. Janowitz survey
  2. Spectral clustering by Ulrike von Luxburg
  3. Spectral clustering by Shortreed and Meila, includes image segmentation example
  4. Ng, Jordan, Weiss paper
  5. Chris Ding's two tutorials on spectral clustering

Matrix Completion

Here are some key papers on matrix completion in roughly chronological order:
  1. Candes and Recht (2008)
    This one does exact completion via nuclear norms, with high probability. It uses m >= Cn^(6/5) r log(n) random measurements. (Fewer when r is not so large.) It introduces an incoherence measure.
  2. Cai, Candes and Shen (2008)
    This one has the singular value thresholding algorithm.
  3. Keshavan, Montanari, and Oh (2009)
    They have a fast algorithm, and require near or below minimal m. When m is below minimal, they get a recovery with a bounded RMSE. When m >= C nr max(log n,r ) they get exact recovery. The proof uses a condition that bounds the singular values away from 0 and infinity.
  4. Candes and Tao (2009)
    They use the nuclear norm method and get exact recovery with high probability. The incoherence conditions have changed a little and the results are sharper than Candes and Recht. They need m >= C mu^4 nr^2 (log n)^2 [thm 1.1] or >= C mu^2 nr (log n)^6 [thm 1.2].
  5. Candes and Plan (2009)
    Recovery in the presence of noisy data. Minimize nuclear norm of estimate, constrained by Frobenius norm of errors from obs below some delta. The delta comes, for example, from known noise variance.
  6. Zhu, So, and Ye (2009)
    Randomized basis pursuit. Sample rows or columns of M.

Zipf etc

Here are some references on discrete variables with many levels, particularly Zipf's law.
  1. Wikipedia on Zipf's law
    Basics of Zipf's law.
  2. Montemurro's article on Zipf's law on Gutenberg books (Thanks to David Gleich)
    Really big data sets tend to break Zipf's law.
  3. Cosma Schalizi's words of warning
    about assuming parametric forms for Zipf-like data. (Thanks to Hal Varian)

Newman's Q and its relaxation

  1. My brief notes
  2. Relaxation of Q by White and Smyth
  3. Newman's Phys Rev E paper

Walking and smoothing on graphs

  1. Trustrank
  2. Denny Zhou's website at Microsoft.
    This has lots of papers about smoothing on graphs. I like the one with Scholkopf and Hofmann as a starting point. New ones keep arriving.


This spot is for articles that don't yet fit into a big enough category.
  1. Paul Komarek's logistic regression on steroids (not his term)
  2. Thomas Hofmann's probabilistic latent semantic analysis
  3. Blei, Ng, and Jordan's latent dirichlet allocation