next up previous
Next: About this document ...


Information Retrieval, and the Vector Space Model

Art B. Owen
Stanford University
owen "at" stat.stanford.edu



Search Engines

Goal: Find documents relevant to a query

Examples:
  1. Boolean query:
    Monte Carlo AND (importance OR stratification) AND NOT Chevrolet
  2. Natural language query:
    Is it raining in Topanga?
  3. List of words:
    Efron bootstrap resample


Word counts

Most engines use word counts in documents

Most use other things too



Term Document Matrix

1#1 number of times term 2#2 is in document 3#3

Documents
  1. web page
  2. article
  3. section
  4. paragraph
  5. sentence

Terms
  1. word e.g. ``airplane''
  2. n-gram e.g. ``airp'', ``irpl'', ``rpla'', ``plan'', ``lane''
  3. collocation e.g. ``white house'' or ``New York''

Term-document matrices are huge and sparse



Further processing

Stop words
Ignore very common words
``the'' ``and'' ``what''
Stemming
Strip words to root
reformation reformative reformed reforming
4#4 reform

tf-idf
Term frequency, inverse document frequency


5#5

where

6#6



There are many variations


Example

Upper left 10x10 corner
0 0 0 0 3.08 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 2.67 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 4.39 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0


Example ctd

7#7 8#8 Term
24 29 TOMORROW
28 31 SPENT
7 12 FACTS
38 63 EXPLOSIVES
24 29 LEADING
36 45 NATIONS
9 18 0
58 91 1
44 85 2
23 31 OPPORTUNITY
74 136 GENERAL
8 10 TEARS
11 13 VIDEOTAPE
17 32 DEVICES
37 43 FACE
13 14 ALONE
33 35 ALONG
27 37 HAVEN
86 137 FACT


Vector space

Each document is a vector 9#9 of transformed counts

Document similarity could be

10#10

A query 11#11 is a (very short) document

Precision-recall

Given 11#11 rank 12#12 documents in order of relevance

Suppose there are 13#13 truly relevant documents

Precision 14#14
% of first 15#15 ranked documents that are relevant
Recall 14#14
% of 13#13 relevant documents among first 15#15 ranked documents



Transposing it

A document has a weighted list of words

A word has a weighted list of documents

Query with a list of documents:
  1. Todays documents...word NASDAQ is hot
  2. All documents in bovine set
  3. All documents in dental set

Also
``Words are known by the company they keep''



Do ``boat'' queries find ``ship'' docs?

Maybe we should ``cluster'' the terms

Let 16#16

Clustering: approximate by

17#17

18#18 is 19#19'th cluster mean

20#20 is 21#21 if term 22#22 in cluster 19#19, zero else



Latent semantic indexing

SVD:

23#23



May find a nautical singular vector 24#24 with ``boat'' and ``ship'' and ``starboard'' etc.

Run queries on 25#25 with 26#26



References

  1. Manning and Shutze (1999) ``Foundations of Statistical Natural Language Processing'', MIT Press
    • Grammar and Parsing
    • Statistical models of word frequency
  2. Witten, Moffat and Bell (1999) ``Managing Gigabytes'' 2nd Edition. Morgan Kaufman.
    • Compression and index construction
    • Searching compressed data
  3. Berry and Browne (1999) ``Understanding Search Engines'' SIAM
    • Describe a course project building an engine
  4. Berry, Dumais, O'Brien (1995) ``Using Linear Algebra for Intelligent Information Retrieval'', SIAM Review v37 N4 pp573-595
    • Emphasize SVD updates






next up previous
Next: About this document ...
Art Owen
2000-04-05