Deep web stores content in
searchable databases that require specific inquiries
BrightPlanet allows for
surface and deep web requests
Searching of both is
imperative for the user to retrieve the maximum amount of information.
Current
search engines only retrieve 1 of 3,000 pages available.
The World Wide Web is only
part of the internet- it includes FTP, email, news, telnet, gopher and other
things
Search engine
dissatisfaction has increased steadily since 1997.
Search engines crawl or
spider to record every hyperlink on pages to gather information
Authors can also submit
their pages
Surface web has 2.5
billion documents
BrightPlanet is a directed
query engine
The NEC found:
·
Surface
Web coverage by individual, major search engines has dropped from a maximum of
32% in 1998 to 16% in 1999, with Northern Light showing the largest coverage.
·
Metasearching
using multiple search engines can improve retrieval coverage by a factor of 3.5
or so, though combined coverage from the major engines dropped to 42% from 1998
to 1999.
·
More
popular Web documents, that is, those with many link references from other
documents, have up to an eight-fold greater chance of being indexed by a search
engine than those with no link references.
- Deep web
contents are on average 27% smaller than surface web
- Deep web is
500 times larger than surface web
- Searching
needs to include the whole web
- Directed
query technology is the only means to integrate deep and surface Web
information.
The simplest crawling
algorithm uses a queue of URLs yet to be visited and a fast mechanism
for
determining if it has already seen a URL.vechanism for determining if it has already seen a URL.
Crawling requests HTTP to
get a page, once it gets the page it scans it for links to other urls
Real crawlers much
address:
Speed
Politeness
Excluded content
Duplicate content
Modern spammers create
artificial web landscapes of domains, servers, links, and pages to
inflate the
link scores of the targets they have been paid to promote. Spammers also engage
in
cloaking, the process of delivering different content to crawlers than to
site visitors.
An inverted file is a
concatenation of the postings lists for each distinct term.
Scanning and inversion
create an inverted file
Scaling up merges partial
inverted files
Indexers use compression
to reduce demands on disk space and memory
Anchor text contributes
strongly to the quality of search results.
Average query lengths are
two to three words
NO MUDDIEST POINT FOR THIS WEEK
No comments:
Post a Comment