Friday, November 16, 2012

Week 12



Deep web stores content in searchable databases that require specific inquiries

BrightPlanet allows for surface and deep web requests

Searching of both is imperative for the user to retrieve the maximum amount of information. 

Current search engines only retrieve 1 of 3,000 pages available.

The World Wide Web is only part of the internet- it includes FTP, email, news, telnet, gopher and other things

Search engine dissatisfaction has increased steadily since 1997.

Search engines crawl or spider to record every hyperlink on pages to gather information

Authors can also submit their pages

Surface web has 2.5 billion documents

BrightPlanet is a directed query engine

The NEC found:

·                 Surface Web coverage by individual, major search engines has dropped from a maximum of 32% in 1998 to 16% in 1999, with Northern Light showing the largest coverage.
·                 Metasearching using multiple search engines can improve retrieval coverage by a factor of 3.5 or so, though combined coverage from the major engines dropped to 42% from 1998 to 1999.
·                 More popular Web documents, that is, those with many link references from other documents, have up to an eight-fold greater chance of being indexed by a search engine than those with no link references.

  • Deep web contents are on average 27% smaller than surface web
  • Deep web is 500 times larger than surface web
  • Searching needs to include the whole web
  • Directed query technology is the only means to integrate deep and surface Web information.


The simplest crawling algorithm uses a queue of URLs yet to be visited and a fast mechanism 

for determining if it has already seen a URL.vechanism for determining if it  has already seen a URL.

Crawling requests HTTP to get a page, once it gets the page it scans it for links to other urls

Real crawlers much address:

Speed
Politeness
Excluded content
Duplicate content

Modern spammers create artificial web landscapes of domains, servers, links, and pages to 

inflate the link scores of the targets they have been paid to promote. Spammers also engage in

 cloaking, the process of delivering different content to crawlers than to site visitors.

An inverted file is a concatenation of the postings lists for each distinct term.

Scanning and inversion create an inverted file

Scaling up merges partial inverted files

Indexers use compression to reduce demands on disk space and memory

Anchor text contributes strongly to the quality of search results.

Average query lengths are two to three words



 NO MUDDIEST POINT FOR THIS WEEK 

No comments:

Post a Comment