Agenda
• Índices de palabras
• Web Search Engine
• Retrieval Information Systems
• Metabuscadores
• Preguntas
En busca de la memoria dinámica extendida
Índice de Palabras: Onomástica de los nombres en Catalán
Web Search Engine
• Lenguaje de programación: Python
• Manejo de Alta RAM
• Almacenamiento Compartido
• Procesamiento en Paralelo
Web Search Engine
http://nlp.stanford.edu/IR-book/pdf/19web.pdf Pag.434
Código Python – Web Search Engine
def crawl_web(seed): # returns index, graph of inlinks
tocrawl = [seed]crawled = []graph = {} # <url>, [list of pages it links to]index = {} while tocrawl:
page = tocrawl.pop()if page not in crawled:
content = get_page(page)add_page_to_index(index, page, content)outlinks = get_all_links(content)graph[page] = outlinksunion(tocrawl, outlinks)crawled.append(page)
return index, graph
def get_next_target(page):start_link = page.find('<a href=')if start_link == -1:
return None, 0start_quote = page.find('"', start_link)end_quote = page.find('"', start_quote + 1)url = page[start_quote + 1:end_quote]return url, end_quote
def get_all_links(page):links = []while True:
url, endpos = get_next_target(page)if url:
links.append(url)page = page[endpos:]
else:break
return links
def union(a, b):for e in b:
if e not in a:a.append(e)
def add_page_to_index(index, url, content):words = content.split()pos=0for word in words:
pos=content.find(word, pos)add_to_index(index, word, url,pos)
def add_to_index(index, keyword, url,pos):if keyword in index:
index[keyword].append([url,pos])else:
index[keyword] = [[url,pos]]
def lookup(index, keyword):if keyword in index:
return index[keyword]else:
return None
cache = {'http://www.udacity.com/cs101x/final/multi.html': """<html>
<body>
<a href="http://www.udacity.com/cs101x/final/a.html">A</a><br><a href="http://www.udacity.com/cs101x/final/b.html">B</a><br>
</body>""",
'http://www.udacity.com/cs101x/final/b.html': """<html><body>
Monty likes the Python programming languageThomas Jefferson founded the University of VirginiaWhen Mandela was in London, he visited Nelson's Column.
</body></html>""",
'http://www.udacity.com/cs101x/final/a.html': """<html><body>
Monty Python is not about a programming languageUdacity was not founded by Thomas JeffersonNelson Mandela said "Education is the most powerful weaponwhich you canuse to change the world."</body></html>""", }
def get_page(url):if url in cache:
return cache[url]else:
print "Page not in cache: " + urlreturn None
http://www.udacity.com/cs101
Information Retrieval Systems
Metabuscadores
• Es la unión de búsquedas(query) en varios buscadores(Search Engine) – Índices de Búsquedas -
http://dg3rtljvitrle.cloudfront.net/slides/chap10.pdf
http://dg3rtljvitrle.cloudfront.net/slides/chap10.pdf