Research overview
Here is a sample of the problem areas I'm interested in
along with a few of the related works. The
year by year list
has additional works including some that don't fit into one of these topics.
- Transposable data
Many statistical data matrices have
both many rows and many columns. Often both
rows and columns correspond to entities of
interest (e.g. proteins and genes or movies
and customers) instead of being IID. It is not
just p>n because in these problems you could
argue which dimension has p and which has n
levels. So questions
arise as to how to bootstrap, cross-validate
and visualize such data. For example:
- Empirical likelihood
Much of statistical inference is organized around the likelihood
function. That usually requires an unpleasant assumption that the
data come from one of the popular parametric families. Empirical
likelihood uses a data determined likelihood function to avoid this.
There is no loss of power up to second order asymptotics and
it can either win or lose compared to the true likelihood at
third order.
-
Web page for the book
-
1988 Biometrica paper at
JSTOR for univariate mean
-
1990 Annals of Statistics paper at
JSTOR for multivariate mean
-
1991 Annals of Statistics paper at
JSTOR for linear models
-
Escaping the convex hull EJS (with Sarah Emerson)
- Monte Carlo and quasi-Monte Carlo
Monte Carlo integration typically gets a root mean
square error of O(n^-1/2). Quasi-Monte Carlo (QMC) sampling uses deterministic
points more uniformly distributed than random ones, and it gets
an error of O(n^(-1+epsilon)) for any epsilon>0.
Randomizing the QMC points (while preserving their uniformity)
allows replication based error estimates. It can also bring
error cancellation leading to RMSE O(n^(-3/2 + epsilon)).
- Bioinformatics
The following papers are motivated by large data problems
in biology. Most of the work was done with the
Stuart Kim lab
or as followup on theoretical holes that became
evident in that work.
-
Plaid
models with Laura Lazzeroni.
-
A
gene recommender for completing partially known
clusters (with Kim lab). It is a kind of supervised correlation.
-
The
AGEMAP project (with Kim lab and Kevin Becker's lab
at the National Institute on Aging).
-
Another look at Karl Pearson's
meta-analysis
wrongly thought for over 50 years to be inadmissible. It actually
beats Fisher's test on certain alternatives, when the null hypotheses
tend to be violated in the same direction as each other.
-
Aging in the human
kidney
and
muscle with the Kim lab.
-
For correlated hypothesis tests
this paper finds the
variance
of the number of false dscoveries.