Lecture 16: Filtering & TDT - [Download PPT Powerpoint]

2006.03.09 - SLIDE 1IS 240 – Spring 2006

Prof. Ray Larson

University of California, Berkeley

School of Information Management & Systems

Tuesday and Thursday 10:30 am - 12:00 pm

Spring 2006http://www.sims.berkeley.edu/academics/courses/is240/s06/

Principles of Information Retrieval

Lecture 16: Filtering & TDT

2006.03.09 - SLIDE 2IS 240 – Spring 2006

Overview

• Review– LSI

• Filtering & Routing

• TDT – Topic Detection and Tracking

2006.03.09 - SLIDE 3IS 240 – Spring 2006

Overview

• Review– LSI

• Filtering & Routing

• TDT – Topic Detection and Tracking

2006.03.09 - SLIDE 4IS 240 – Spring 2006

How LSI Works

• Start with a matrix of terms by documents• Analyze the matrix using SVD to derive a

particular “latent semantic structure model”• Two-Mode factor analysis, unlike

conventional factor analysis, permits an arbitrary rectangular matrix with different entities on the rows and columns – Such as Terms and Documents

2006.03.09 - SLIDE 5IS 240 – Spring 2006

How LSI Works

• The rectangular matrix is decomposed into three other matices of a special form by SVD– The resulting matrices contain “singular

vectors” and “singular values”– The matrices show a breakdown of the original

relationships into linearly independent components or factors

– Many of these components are very small and can be ignored – leading to an approximate model that contains many fewer dimensions

2006.03.09 - SLIDE 6IS 240 – Spring 2006

How LSI Works

TitlesC1: Human machine interface for LAB ABC computer applicationsC2: A survey of user opinion of computer system response timeC3: The EPS user interface management systemC4: System and human system engineering testing of EPSC5: Relation of user-percieved response time to error measurementM1: The generation of random, binary, unordered treesM2: the intersection graph of paths in treesM3: Graph minors IV: Widths of trees and well-quasi-orderingM4: Graph minors: A survey

Italicized words occur and multiple docs and are indexed

2006.03.09 - SLIDE 7IS 240 – Spring 2006

How LSI Works

Terms Documents c1 c2 c3 c4 c5 m1 m2 m3 m4Human 1 0 0 1 0 0 0 0 0Interface 1 0 1 0 0 0 0 0 0Computer 1 1 0 0 0 0 0 0 0User 0 1 1 0 1 0 0 0 0System 0 1 1 2 0 0 0 0 0Response 0 1 0 0 1 0 0 0 0Time 0 1 0 0 1 0 0 0 0EPS 0 0 1 1 0 0 0 0 0Survey 0 1 0 0 0 0 0 0 0Trees 0 0 0 0 0 1 1 1 0Graph 0 0 0 0 0 0 1 1 1Minors 0 0 0 0 0 0 0 1 1

2006.03.09 - SLIDE 8IS 240 – Spring 2006

How LSI Works

Dimension 2

Dimension 1

11graphM2(10,11,12)

10 Tree12 minor

9 survey

M1(10) 7 time

3 computer

4 user6 response

5 system

2 interface1 human

M4(9,11,12)

M2(10,11)C2(3,4,5,6,7,9)

C5(4,6,7)

C1(1,2,3)

C3(2,4,5,8)

C4(1,5,8)

Q(1,3)Blue dots are termsDocuments are red squaresBlue square is a queryDotted cone is cosine .9 from Query “Human Computer Interaction”-- even docs with no terms in common(c3 and c5) lie within cone.

SVD to 2 dimensions

2006.03.09 - SLIDE 9IS 240 – Spring 2006

How LSI Works

X T0=

S0 D0’

txd txm mxm mxd

X = T0S0D0’

docs

terms

T0 has orthogonal, unit-length columns (T0’ T0 = 1)D0 has orthogonal, unit-length columns (D0’ D0 = 1)S0 is the diagonal matrix of singular valuest is the number of rows in Xd is the number of columns in Xm is the rank of X (<= min(t,d)

2006.03.09 - SLIDE 10IS 240 – Spring 2006

Overview

• Review– LSI

• Filtering & Routing

• TDT – Topic Detection and Tracking

2006.03.09 - SLIDE 11IS 240 – Spring 2006

Filtering

• Characteristics of Filtering systems:– Designed for unstructured or semi-structured data– Deal primarily with text information– Deal with large amounts of data– Involve streams of incoming data– Filtering is based on descriptions of individual or

group preferences – profiles. May be negative profiles (e.g. junk mail filters)

– Filtering implies removing non-relevant material as opposed to selecting relevant.

2006.03.09 - SLIDE 12IS 240 – Spring 2006

Filtering

• Similar to IR, with some key differences• Similar to Routing – sending relevant incoming

data to different individuals or groups is virtually identical to filtering – with multiple profiles

• Similar to Categorization systems – attaching one or more predefined categories to incoming data objects – is also similar, but is more concerned with static categories (might be considered information extraction)

2006.03.09 - SLIDE 13IS 240 – Spring 2006

Structure of an IR System

SearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

2006.03.09 - SLIDE 14IS 240 – Spring 2006

Structure of an Filtering System

Interest profilesRaw Documents

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

IncomingDataStream

Potentially Relevant

Documents

Comparison/filtering

Store1: Profiles/Search requests

Doc surrogateStream

Indexing/Categorization/

Extraction

Formulating query in terms of

descriptors

Storage of profiles

Information Filtering System

Adapted from Soergel, p. 19

Individual or Groupusers

2006.03.09 - SLIDE 15IS 240 – Spring 2006

Major differences between IR and Filtering

• IR concerned with single uses of the system• IR recognizes inherent faults of queries

– Filtering assumes profiles can be better than IR queries

• IR concerned with collection and organization of texts– Filtering is concerned with distribution of texts

• IR is concerned with selection from a static database.– Filtering concerned with dynamic data stream

• IR is concerned with single interaction sessions– Filtering concerned with long-term changes

2006.03.09 - SLIDE 16IS 240 – Spring 2006

Contextual Differences

• In filtering the timeliness of the text is often of greatest significance

• Filtering often has a less well-defined user community

• Filtering often has privacy implications (how complete are user profiles?, what to they contain?)

• Filtering profiles can (should?) adapt to user feedback– Conceptually similar to Relevance feedback

2006.03.09 - SLIDE 17IS 240 – Spring 2006

Methods for Filtering

• Adapted from IR – E.g. use a retrieval ranking algorithm against

incoming documents.

• Collaborative filtering– Individual and comparative profiles

2006.03.09 - SLIDE 18IS 240 – Spring 2006

TREC Filtering Track

• Original Filtering Track– Participants are given a starting query – They build a profile using the query and the training

data– The test involves submitting the profile (which is not

changed) and then running it against a new data stream

• New Adaptive Filtering Track– Same, except the profile can be modified as each

new relevant document is encountered.

• Since streams are being processed, there is no ranking of documents

2006.03.09 - SLIDE 19IS 240 – Spring 2006

TREC-8 Filtering Track

• Following Slides from the TREC-8 Overview by Ellen Voorhees

• http://trec.nist.gov/presentations/TREC8/overview/index.htm

2006.03.09 - SLIDE 20IS 240 – Spring 2006

2006.03.09 - SLIDE 21IS 240 – Spring 2006

2006.03.09 - SLIDE 22IS 240 – Spring 2006

2006.03.09 - SLIDE 23IS 240 – Spring 2006

2006.03.09 - SLIDE 24IS 240 – Spring 2006

Overview

• Review– LSI

• Filtering & Routing

• TDT – Topic Detection and Tracking

2006.03.09 - SLIDE 25IS 240 – Spring 2006

TDT: Topic Detection and Tracking

• Intended to automatically identify new topics – events, etc. – from a stream of text and follow the development/further discussion of those topics

2006.03.09 - SLIDE 26IS 240 – Spring 2006

Topic Detection and Tracking

Introduction and

Overview– The TDT3 R&D Challenge

– TDT3 Evaluation

Methodology

Slides from “Overview NIST Topic Detection and Tracking -Introduction and Overview” by G. Doddington-http://www.itl.nist.gov/iaui/894.01/tests/tdt/tdt99/presentations/index.htm

2006.03.09 - SLIDE 27IS 240 – Spring 2006

TDT Task Overview*

• 5 R&D Challenges:– Story Segmentation– Topic Tracking– Topic Detection– First-Story Detection– Link Detection

• TDT3 Corpus Characteristics:†– Two Types of Sources:

• Text • Speech

– Two Languages:• English 30,000

stories• Mandarin 10,000

stories

– 11 Different Sources:• _8 English__ 3

MandarinABC CNN VOAPRI VOA XINNBC MNB ZBNAPW NYT

** see http://www.itl.nist.gov/iaui/894.01/tdt3/tdt3.htm for details† see http://morph.ldc.upenn.edu/Projects/TDT3/ for details

2006.03.09 - SLIDE 28IS 240 – Spring 2006

Preliminaries

A topictopic is …a seminal eventevent or activity, along with all

directly related events and activities.

A storystory is …a topically cohesive segment of news that

includes two or more DECLARATIVE independent clauses about a single event.

2006.03.09 - SLIDE 29IS 240 – Spring 2006

Example Topic

Title: Mountain Hikers Lost– WHAT: 35 or 40 young Mountain Hikers were

lost in an avalanche in France around the 20th of January.

– WHERE: Orres, France – WHEN: January 1998– RULES OF INTERPRETATION: 5. Accidents

2006.03.09 - SLIDE 30IS 240 – Spring 2006

(for Radio and TV only)

Transcription:text (words)

Story:Non-story:

The Segmentation Task:

To segment the source stream into its constituent stories, for all audio sources.

2006.03.09 - SLIDE 31IS 240 – Spring 2006

Story Segmentation Conditions

• 1 Language Condition:

• 3 Audio Source Conditions:

• 3 Decision Deferral Conditions:

Both English and Mandarin

manual transcriptionASR transcriptionoriginal audio data

AudioEnglish (words)

Mandarin (characters)

English & Mandarin (seconds)

100 150 301,000 1,500 300

10,000 15,000 3,000

TextMaximum Decision Deferral Period

2006.03.09 - SLIDE 32IS 240 – Spring 2006

The Topic Tracking Task:

To detect stories that discuss the target topic,in multiple source streams.

• Find all the stories that discuss a given target topic– Training: Given Nt sample stories that

discuss a given target topic,– Test: Find all subsequent stories that

discuss the target topic.

on-topicunknownunknown

training data

test dataNew This Year: not guaranteed to be off-topic

2006.03.09 - SLIDE 33IS 240 – Spring 2006

Topic Tracking Conditions

• 9 Training Conditions:

• 1 Language Test Condition:

• 3 Source Conditions:

• 2 Story Boundary Conditions:

Training Language

English MandarinBoth

Sources1 (E) 1 (M) 1 (E), 1(M)2 (E) 2 (M) 2 (E), 2(M)4 (E) 4 (M) 4 (E), 4(M)

N t

English (E) Mandarin (M)

Both English and Mandarin

text sources and manual transcription of the audio sourcestext sources and ASR transcription of the audio sourcestext sources and the sampled data signal for audio sources

Reference story boundaries providedNo story boundaries provided

2006.03.09 - SLIDE 34IS 240 – Spring 2006

The Topic Detection Task:

To detect topics in terms of the (clusters of) storiesthat discuss them.

– Unsupervised topic training A meta-definition of topic is required independent of topic specifics.

– New topics must be detected as the incoming stories are processed.

– Input stories are then associated with one of the topics.

a topic!

2006.03.09 - SLIDE 35IS 240 – Spring 2006

Topic Detection Conditions

• 3 Language Conditions:

• 3 Source Conditions:

• Decision Deferral Conditions:

• 2 Story Boundary Conditions:

English onlyMandarin only

English and Mandarin together

Reference story boundaries providedNo story boundaries provided

text sources and manual transcription of the audio sourcestext sources and ASR transcription of the audio sourcestext sources and the sampled data signal for audio sources

Maximum decision deferral period in # of source files

110

100

2006.03.09 - SLIDE 36IS 240 – Spring 2006

• There is no supervised topic training (like Topic Detection)

Time

First Stories

Not First Stories

= Topic 1= Topic 2

The First-Story Detection Task:

To detect the first story that discusses a topic, for all topics.

2006.03.09 - SLIDE 37IS 240 – Spring 2006

First-Story Detection Conditions

• 1 Language Condition:

• 3 Source Conditions:

• Decision Deferral Conditions:

• 2 Story Boundary Conditions:

English only

Reference story boundaries providedNo story boundaries provided

text sources and manual transcription of the audio sourcestext sources and ASR transcription of the audio sourcestext sources and the sampled data signal for audio sources

Maximum decision deferral period in # of source files

110

100

2006.03.09 - SLIDE 38IS 240 – Spring 2006

The Link Detection Task

To detect whether a pair of stories discuss the same topic.

• The topic discussed is a free variable.• Topic definition and annotation is

unnecessary.• The link detection task represents a basic

functionality, needed to support all applications (including the TDT applications of topic detection and tracking).

• The link detection task is related to the topic tracking task, with Nt = 1.

same topic?

2006.03.09 - SLIDE 39IS 240 – Spring 2006

Link Detection Conditions• 1 Language Condition:

• 3 Source Conditions:

• Decision Deferral Conditions:

• 1 Story Boundary Condition:

English only

text sources and manual transcription of the audio sourcestext sources and ASR transcription of the audio sourcestext sources and the sampled data signal for audio sources

Maximum decision deferral period in # of source files

110

100

Reference story boundaries provided

2006.03.09 - SLIDE 40IS 240 – Spring 2006

TDT3 Evaluation Methodology• All TDT3 tasks are cast as statistical detection (yes-

no) tasks.– Story Segmentation: Is there a story boundary here?– Topic Tracking: Is this story on the given topic?– Topic Detection: Is this story in the correct topic-

clustered set?– First-story Detection: Is this the first story on a topic?– Link Detection: Do these two stories discuss the same

topic?• Performance is measured in terms of detection cost,

which is a weighted sum of miss and false alarm probabilities: CDet = CMiss • PMiss • Ptarget + CFA • PFA • (1- Ptarget)

• Detection Cost is normalized to lie between 0 and 1: (CDet)Norm = CDet / min{CMiss • Ptarget, CFA • (1- Ptarget)}