Stat 315c: Learning from matrix valued data

Overview

Very commonly in statistics we arrange our data into rows and columns. There are n rows corresponding to n IID observations and p columnns corresponding to p fixed variables, often split into predictors and one or more responses. The columns are named entities that we wish to study. The rows are anonymous, with no intrinsic interest, apart from what they tell us about the column variables. Lots of data does not fit this paradigm. Sometimes the specific rows in our data set are just as important as the colunns.

This type of problem is not new, but it is becoming more prevalent recently. Here are some examples of what the rows columns and entries are: The data can always be cast as triples (row ID, col ID, Val) which looks like a classic setup. But then the data are very far from IID. Also, categorical variables with many levels (like phone numbers and IP addresses) behave differently from usual ones. For example there are usually unobserved levels.

About the name

Some years ago I called this topic 'transposable data'. The reason is that both the data matrix and it's transpose can be looked at as having named columns and disposable rows used to learn about the columns, depending on your goals. You could analyze X on Monday, Wednesday, and Friday while looking at X' on Tuesday and Thursday. I'm open to suggestions for a better name. Maybe 'My Big Fat Data Matrix' will do. Many of the data sets are sparsely sampled and so fit the dyadic data framework. But many other data settings are not dyadic. There is overlap with "small n, big p" problems but just as often it's "big n, big p".

Who should take it?

This course is aimed at people who want to learn about methods, old and new, for large data sets with named rows and columns. It is also useful for people looking for a field with opportunities for new research problems.

Instructor

Art Owen
Sequoia Hall 130
My userid is owenbuzzard on stat.stanfordbuzzard.edu (remember to remove the carrion eaters)
Office hour: Wed 11:00

Classes

MW 2:15-3:30 in Sequoia Hall 200

Starting Wednesday April 1


Topics


Readings

The text 'Understanding Complex Datasets' by David Skillicorn covers several of these topics. We will also draw on these research articles for background reading.

TA


Evaluation

Be sure to give Axess a working email address:
I expect to send a small number of important emails about problem sets and the homework there. Most other announcements will be made in class.
Late penalties apply:
We will count days late on each problem set. HW turned in on the due date but after class ends is one day late. The next day is two days late and so on. late if it is not turned in in class Each day late is penalized by 10% of the homework value. Homework more than 3 days late will ordinarily get 0. If you're travelling, you can email a pdf file. For sickness, interviews and other events, up to 3 late days total are forgiven at the end of the quarter. (Work late enough to get zero does not get redeemed though.)

Problems

Problems (passwd given in class)