Stat 321: Learning from matrix valued data
Overview
Very commonly in statistics we arrange our
data into rows and columns.
There are n rows corresponding
to n IID observations
and p columnns corresponding to p
fixed variables, often split into predictors
and one or more responses. The columns are
named entities that we wish to study.
The rows are anonymous,
with no intrinsic interest, apart from
what they tell us about the column variables.
Lots of data does not fit this paradigm.
Sometimes the specific rows in our data set
are just as important as the colunns.
This type of problem is not new, but it is becoming
more prevalent recently. Here are some examples
of what the rows columns and entries are:
- Terms, documents, counts in information retrieval
- Genes, experiments, expression levels in microarray analysis
- Movies, customers, ratings in recommender systems
- Students, questions, correctness in item response theory
- Web pages and other web pages in link analysis
- Varieties and fertilizers in crop science
The data can always be cast as triples (row ID, col ID, Val)
which looks like a classic setup.
But then the data are very far from IID.
Also, categorical variables with many levels
(like phone numbers and IP addresses)
behave differently from usual ones. For example
there are usually unobserved levels.
About the name
Some years ago I called this topic 'transposable data'. The
reason is that both the data matrix and it's transpose can
be looked at as having named columns and disposable rows used
to learn about the columns, depending on your goals. You could
analyze X on Monday, Wednesday, and Friday
while looking at X' on Tuesday and Thursday.
Here is the link
to the transposable data class I taught for Spring quarter 1999/2000.
I'm open to suggestions for a better name.
Maybe 'My Big Fat Data Matrix' will do.
Many of the data sets are sparsely sampled and so
fit the dyadic data framework. But many other
data settings are not dyadic.
There is overlap with "small n, big p" problems but
just as often it's "big n, big p".
Who should take it?
This course is aimed at people who want to learn about
methods, old and new, for large data sets with named rows and columns.
It is also useful for people looking for a field with opportunities for
new research problems.
Instructor
- Art Owen
- Sequoia Hall 130
- My userid is owenbuzzard on stat.stanfordbuzzard.edu
(remember to remove the carrion eaters)
- Office hour: Tuesday 11:00
Classes
MW 12:50-2:05 in Sequoia Hall 200
Starting Monday April 2
Topics
- Biplots and heatmaps
- Anova models, Rasch models, correspondence analysis
- Clustering, biclustering, spectral clustering
- SVD, non-negative matrix factorization, and generalizations
- PageRank, TrustRank and generalizations
- Prediction on graphs
- Tensor methods for three way data
- Matrix resampling and downsampling
- Random matrix theory and Tracy-Widom laws
- Graph based algorithms
Readings
We will draw on these
research articles
for background reading.
TA
- Hao Chen haochenPenguin@stanford.edu
Office Hours: Friday 11am to 1pm Sequoia 231
Delete the Antarctic bird from the TA's email
Evaluation
- Homework: 3 or 4 problem sets (100%)
I will assume that you can use R and Python. It is feasible to take
the course if you know one and are willing to learn the other.
Be sure to give Axess a working email address:
I expect to send a small number of important emails about
problem sets and the homework there.
Most other announcements will be made in class.
Late penalties apply:
We will count days late on each problem set.
HW turned in on the due date but after class ends
is one day late. The next day is two days late and
so on.
late if it is not turned in in class
Each day late is penalized by 10% of the homework value.
Homework more than 3 days late will ordinarily get 0.
If you're travelling, you can email a pdf file.
For sickness, interviews and other events,
up to 3 late days total are forgiven at the end of
the quarter. (Work late enough to get zero does not
get redeemed though.)
Problems
Problems (passwd given in class)