Next: About this document ...
Information Retrieval, and the Vector Space Model
Art B. Owen
Stanford University
owen "at" stat.stanford.edu
Search Engines
Goal: Find documents relevant to a query
Examples:
- Boolean query:
Monte Carlo AND (importance OR stratification) AND NOT Chevrolet
- Natural language query:
Is it raining in Topanga?
- List of words:
Efron bootstrap resample
Word counts
Most engines use word counts in documents
Most use other things too
- links
- titles
- position of word in document
- sponsorship
- present and past user feedback
Term Document Matrix
1#1 number of times term 2#2 is in document 3#3
Documents
- web page
- article
- section
- paragraph
- sentence
Terms
- word e.g. ``airplane''
- n-gram e.g. ``airp'', ``irpl'', ``rpla'', ``plan'', ``lane''
- collocation e.g. ``white house'' or ``New York''
Term-document matrices are huge and sparse
Further processing
- Stop words
- Ignore very common words
``the''
``and'' ``what''
- Stemming
- Strip words to root
reformation reformative reformed reforming
4#4 reform
tf-idf
Term frequency, inverse document frequency
5#5
where
6#6
There are many variations
Example
Upper left 10x10 corner
0 |
0 |
0 |
0 |
3.08 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2.67 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4.39 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Example ctd
7#7 |
8#8 |
Term |
24 |
29 |
TOMORROW |
28 |
31 |
SPENT |
7 |
12 |
FACTS |
38 |
63 |
EXPLOSIVES |
24 |
29 |
LEADING |
36 |
45 |
NATIONS |
9 |
18 |
0 |
58 |
91 |
1 |
44 |
85 |
2 |
23 |
31 |
OPPORTUNITY |
74 |
136 |
GENERAL |
8 |
10 |
TEARS |
11 |
13 |
VIDEOTAPE |
17 |
32 |
DEVICES |
37 |
43 |
FACE |
13 |
14 |
ALONE |
33 |
35 |
ALONG |
27 |
37 |
HAVEN |
86 |
137 |
FACT |
Vector space
Each document is a vector
9#9
of transformed counts
Document similarity could be
10#10
A query 11#11 is a (very short) document
Precision-recall
Given 11#11 rank 12#12 documents in order of relevance
Suppose there are 13#13 truly relevant documents
Precision 14#14
% of first 15#15 ranked documents that are relevant
Recall 14#14
% of 13#13 relevant documents among first 15#15 ranked documents
Transposing it
A document has a weighted list of words
A word has a weighted list of documents
Query with a list of documents:
- Todays documents...word NASDAQ is hot
- All documents in bovine set
- All documents in dental set
Also
``Words are known by the company they keep''
Do ``boat'' queries find
``ship'' docs?
Maybe we should ``cluster'' the terms
Let 16#16
Clustering: approximate by
17#17
18#18 is 19#19'th cluster mean
20#20 is 21#21 if term 22#22 in cluster 19#19, zero else
Latent semantic indexing
SVD:
23#23
May find a nautical singular vector 24#24 with
``boat'' and ``ship'' and ``starboard'' etc.
Run queries on 25#25 with 26#26
- SVD looks a bit like clusters
- First few singular values have ``less noise''
-
27#27 much faster than 28#28
- Less storage too
References
- Manning and Shutze (1999) ``Foundations of Statistical Natural
Language Processing'', MIT Press
- Grammar and Parsing
- Statistical models of word frequency
- Witten, Moffat and Bell (1999) ``Managing Gigabytes'' 2nd Edition.
Morgan Kaufman.
- Compression and index construction
- Searching compressed data
- Berry and Browne (1999) ``Understanding Search Engines'' SIAM
- Describe a course project building an engine
- Berry, Dumais, O'Brien (1995) ``Using Linear Algebra for
Intelligent Information Retrieval'', SIAM Review v37 N4 pp573-595
Next: About this document ...
Art Owen
2000-04-05