Stat 321 Readings and Lectures

There is no text covering this entire collection of topics. The Skillicorn book below (available online from Stanford library) covers much of the matrix decomposition material.

Lectures and notes

    These are the notes from the previous runnings of the course as Stat 315c. They will be updated as the class goes on.

  1. Introduction
    Discussion of the topics and types of data we're looking at.

  2. Analysis of variance
    Similar problems have been handled in the 'all Gaussian' world. They have a rich vocabulary of terms for different kinds of variables that come up. The terms depend on our goals as well as how the variables are distributed. Don't get lost in the details of all these terms. They're there so you can search the literature with them.

    Why we care: What is important is that treating a random effect as a fixed one can trick you into believing the evidence is way stronger than really it is. Imagine 10 people with 10000 blood pressure measurements on each of their two arms. It looks like 200,000 observations. You might conclude with high confidence that left and right arms have different blood pressures. But really there are only 10 people. If you had infinitely many measurements on them you still would have uncertainty about how the results apply. Intuitively this is obvious. We hope our statistical methods take account of this. But a black box might not take proper account of the difference between 10 subjects with 10000 observations each and 10000 subjects with 10 observations each. Those are very different.
    For ANOVA like problems we know how to handle this with some precision. We'd hate to get it wrong on some other problem.

  3. 2012 Bootstrap lecture| Pigeonhole Bootstrap lecture
    Bootstrap background (Efron and Gong) | Bootstrap caveat (McCullagh)
    Pigeonhole bootstrap (2007)|
    Tensor case with reweighting vs resampling (2011) with Dean Eckles
    When rows are IID cases and columns are variables we can do a simple bootstrap analysis. If you are new to, or fuzzy on, the bootstrap, this introductory article by Efron and Gong is a good place to start. The bootstrap is approximately right in great generality. Only weak assumptions are needed.

    Why we care: The bootstrap goes badly wrong for 'crossed random effect' type data where both rows and columns are like the subjects in the analysis above. We might find ourselves with a seriously wrong answer. Also: we'd hate to have to take all the theory of restricted maximum likelihood estimators that gets used in the ANOVA case, and try to generalize it to other settings. It's bad enough when everything is Gaussian and mildly unbalanced. It would be awful for other models.
    So we'd like to do something easy like the bootstrap. Unfortunately McCullagh proves the non-existence of an exactly correct bootstrap even for a sample mean over all cases. McCullagh's paper is not easy to read. (How could it be easy to prove that no conceivable resampling will work?) The bootstrap described in the lecture notes is approximately right though.

  4. Classic categorical data
    Some of these problems echo long standing ones studied in categorical data analysis. Here we look at some of those old methods. The Rasch model is a famous highlight.

  5. Singular Value Decomposition
    Here we look at the SVD and its properties. They underly many of the algorithms we'll study. Later we'll use the SVD as a data analysis tool.

  6. Principal Components Analysis
    This is a classical dimension reduction technique from multivariate analysis.

  7. Correspondence analysis |article by Greenacre and Hastie
    Correspondence analysis merges principal components and discrete data analysis.

  8. Heat maps | Wilkinson's IPAM talk Some AGEMAP slides Heatmaps are a good tool for visualizing matrix valued data. They are better for confirmatory use than for exploratory use, especially when the data set is large. Picking the row/column ordering is hard to do and makes a difference. Same for the color scheme.

  9. Clustering
    Here we look at clustering, as a prelude to bi-clustering. This survey by Janowitz is helpful.

  10. SVD per se
    Notes on the SVD in and of itself for data analysis. In class we looked at Alter et al for SVD on Microarrays, and a second microarray paper by Liu et al. who robustified it and then made heatmaps.

  11. Nonnegative matrix factorization. The lecture was based on Lee and Seung (1999) and Donoho and Stodden (2003), and, Berry, Browne, Langville, Pauca, Plemmons (2007). See also David Skillicorn's book for coverage of matrix decompositions.

  12. Here is a survey of biclustering. We looked at the plaid model by Lazzeroni and Owen and archetypal analysis by Cutler and Breiman (cool, but not a biclustering). Both raise complicated optimization problems. Here's an application of archetypes to the human genotope.

  13. We looked at how to cross-validate an outer product model, such as the SVD. Here are slides on it. The first article is here. Patrick Perry's thesis is here.

  14. Next we consider projection based data reduction methods, leading up to the CUR and CX decompositions. The optimization approach of Bien, Xu and Mahoney does CX using group lasso ideas instead of randomness. Francis Bach does CUR without randomness. A crucial first result is the Johnson-Lindenstrauss lemma. Given n data points in m dimensional space you can project them down to n points in O(log(n)) dimensional space while almost preserving all interpoint distances. That generally gets dense (non-sparse) values but having only O(log(n)) dimensions to work with may be more than worth it. Here is Andrei Broder's article on estimating resemblance. Here is Udi Manber's article on duplicate detection. Broder's article has a strange manipulation of the numerator. A much simpler approach in Broder, Charikar, Frieze and Mitzenmacher uses the fact that the probability of a common minimum equals the resemblance. You then use a bunch of permutations.

  15. Spectral clustering lecture. Chris Ding's two tutorials give the geometric intuition. Ulrike von Luxburg's survey gives math and some history. Spectral clustering by Shortreed and Meila, includes image segmentation example. Ng, Jordan, Weiss paper
    Spectral biclustering (called co-clustering or bipartitle spectral graph partitioning) Inderjit Dhillon: Long tech report| short article| my summary

  16. Tensor methods. We looked at work by Lek-Heng Lim on the strange and difficult problems that come in the transition from matrices to tensors of arbitrary order. For instance, this one. Then Peter Hoff's approach to Gaussian tensors. Then Genevera Allen's work here. Kolda and Bader survey tensor decompositions.

Matrix Completion

Here are some key papers on matrix completion in roughly chronological order:
  1. Candes and Recht (2008)
    This one does exact completion via nuclear norms, with high probability. It uses m >= Cn^(6/5) r log(n) random measurements. (Fewer when r is not so large.) It introduces an incoherence measure.
  2. Cai, Candes and Shen (2008)
    This one has the singular value thresholding algorithm.
  3. Keshavan, Montanari, and Oh (2009)
    They have a fast algorithm, and require near or below minimal m. When m is below minimal, they get a recovery with a bounded RMSE. When m >= C nr max(log n,r ) they get exact recovery. The proof uses a condition that bounds the singular values away from 0 and infinity.
  4. Candes and Tao (2009)
    They use the nuclear norm method and get exact recovery with high probability. The incoherence conditions have changed a little and the results are sharper than Candes and Recht. They need m >= C mu^4 nr^2 (log n)^2 [thm 1.1] or >= C mu^2 nr (log n)^6 [thm 1.2].
  5. Candes and Plan (2009)
    Recovery in the presence of noisy data. Minimize nuclear norm of estimate, constrained by Frobenius norm of errors from obs below some delta. The delta comes, for example, from known noise variance.
  6. Zhu, So, and Ye (2009)
    Randomized basis pursuit. Sample rows or columns of M.

Zipf etc

Here are some references on discrete variables with many levels, particularly Zipf's law.
  1. Wikipedia on Zipf's law
    Basics of Zipf's law.
  2. Montemurro's article on Zipf's law on Gutenberg books (Thanks to David Gleich)
    Really big data sets tend to break Zipf's law.
  3. Cosma Schalizi's words of warning
    about assuming parametric forms for Zipf-like data. (Thanks to Hal Varian)

Newman's Q and its relaxation

  1. My brief notes
  2. Relaxation of Q by White and Smyth
  3. Newman's Phys Rev E paper

Walking and smoothing on graphs

  1. Trustrank
  2. Denny Zhou's website at Microsoft.
    This has lots of papers about smoothing on graphs. I like the one with Scholkopf and Hofmann as a starting point. New ones keep arriving.
  3. Xu, Dyer & Owen empirical stationary correlations fro smoothing on graphs.
  4. Ya Xu's thesis

Miscellaneous

This spot is for articles that don't yet fit into a big enough category.
  1. Paul Komarek's logistic regression on steroids (not his term)
  2. Thomas Hofmann's probabilistic latent semantic analysis
  3. Blei, Ng, and Jordan's latent dirichlet allocation