Code Globe: Web Crawler or WebRobot or Web Spider Working

A web spider, some times called a crawler or a robot, plays an important role as an essential infrastructure of every search engines. It automatically discovers and collects resources, especially the web pages, from the Internet. As the rapidly growth of the Internet, a typical design of web spider may not cope with the overwhelming number of web pages.

Search engines.

A search engine is a program that searches through some dataset. In the context of the Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot .Robots are software agents.

Web Agent

The word "agent" is used for lots of meanings in computing these days. Specifically:
Autonomous agents are programs that do travel between sites, deciding themselves when to move and what to do. These can only travel between special servers and are currently not widespread in the Internet. Intelligent agents are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking. User-agent is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Navigator and Microsoft Internet Explorer, and Email User-agent like Qualcomm Eudora etc.

This process is called web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a website, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot. Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images).Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.

Basic Search engine Architecture

Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a typical search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. A Web crawler is a program, which automatically traverses the web by downloading documents and following links from page to page. They are mainly used by web search engines to gather data for indexing. Other possible applications include page validation, structural analysis and visualization, update notification, mirroring and personal web assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc.

Crawlers are automated programs that follow the links found on the web pages.

There is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the store server. The store server then compresses and stores the web pages into a repository. Every web page has an associated ID number called a doc ID, which is assigned whenever a new URL is parsed out of a web page. The indexer and the sorter perform the indexing function. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

The URL Resolver reads the anchors file and converts relative URLs into absolute URLs and in turn into doc IDs. It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links, which are pairs of doc IDs. The links database is used to compute Page Ranks for all the documents. The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of word IDs and offsets into the inverted index.

A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. A lexicon lists all the terms occurring in the index along with some term-level statistics (e.g., total number of documents in which a term occurs) that are used by the ranking algorithms The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries. (Brin and Page 1998).

How a Web Crawler Works

Web crawlers are an essential component to search engines; running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers, which are all beyond the control of the system. Web crawling speed is governed not only by the speed of one’s own Internet connection, but also by the speed of the sites that are to be crawled. Especially if one is a crawling site from multiple servers, the total crawling time can be significantly reduced, if many downloads are done in parallel. Despite the numerous applications for Web crawlers, at the core they are all fundamentally the same. Following is the process by which Web crawlers work:

1. Download the Web page.

2. Parse through the downloaded page and retrieve all the links.

3. For each link retrieved, repeat the process

Architecture of web crawler

The Web crawler can be used for crawling through a whole site on the Inter-/Intranet. You specify a start-URL and the Crawler follows all links found in that HTML page. This usually leads to more links, which will be followed again, and so on. A site can be seen as a tree-structure, the root is the start-URL; all links in that root-HTML-page are direct sons of the root. Subsequent links are then sons of the previous sons.

A single URL Server serves lists of URLs to a number of crawlers. Web crawler starts by parsing a specified web page, noting any hypertext links on that page that point to other web pages. They then parse those pages for new links, and so on, recursively.

Webcrawler software doesn't actually move around to different computers on the Internet, as viruses or intelligent agents do. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a web browser does when the user clicks on links. All the crawler really does is to automate the process of following links.

Web crawling can be regarded as processing items in a queue. When the crawler visits a web page, it extracts links to other web pages. So the crawler puts these URLs at the end of a queue, and continues crawling to a URL that it removes from the front of the queue.

Crawling policies

There are three important characteristics of the Web that generate a scenario in which Web crawling is very difficult:

· its large volume,

· its fast rate of change, and

· dynamic page generation,

which combine to produce a wide variety of possible crawlable URLs.

The large volume implies that the crawler can only download a fraction of the web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted.

The recent increase in the number of pages being generated by server-side scripting languages has also created difficulty in that endless combinations of HTTP GET parameters exist, only a small selection of which will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided contents, then that same set of content can be accessed with forty-eight different URLs, all of which will be present on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

The behavior of a web crawler is the outcome of a combination of policies:

· A selection policy that states which pages to download.

· A re-visit policy that states when to check for changes to the pages.

· A politeness policy that states how to avoid overloading websites.

· A parallelization policy that states how to coordinate distributed web crawlers

Selection policy

Given the current size of the Web, even large search engines cover only a portion of the publicly available internet; a study by Lawrence and Giles (Lawrence and Giles, 2000) showed that no search engine indexes more than 16% of the Web. As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction contains the most relevant pages, and not just a random sample of the Web.

This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the latter is the case of vertical search engines restricted to a single top-level domain, or search engines restricted to a fixed Web site). Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling.

Different types of crawling.

Path-ascending crawling

Some crawlers intend to download as many resources as possible from a particular Web site. Cothey introduced a path-ascending crawler that would ascend to every path in each URL that it intends to crawl.

Focused crawling
The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. Web crawlers that attempt to download pages that are similar to each other are called focused crawler or topical crawlers.

Crawling the Deep Web

A vast amount of Web pages lie in the deep or invisible Web. These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google’s Sitemap Protocol and mod oai (Nelson et al., 2005) are intended to allow discovery of these deep-Web resources.

Re-visit policy

The Web has a very dynamic nature, and crawling a fraction of the Web can take a really long time, usually measured in weeks or months. By the time a web crawler has finished its crawl, many events could have happened. These events can include creations, updates and deletions.

Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change.

Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency.

Politeness policy

Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers.

As noted by Koster (Koster, 1995), the use of web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using web crawlers include:

Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time.

Server overload, especially if the frequency of accesses to a given server is too high.

Poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle.

Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.

Parallelization policy

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.

Crawling is an effective process synchronisation tool between the users and the search engine.

Web Robot Algorithms

Each robot uses different algorithms to decide where to visit. In general, they start from a historical list of URLS, especially some most popular web sites on the Web.

Starting at a location on the web reveals a branching structure which, if cycles are avoided, is essentially a tree

· Depth First Traversal

· Breadth First Traversal

· Heuristics search

For each URL(web page), we use a heuristics function to evaluate its importance. Then we visit those important web pages first

Robots Exclusion Standard

The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard complements Sitemaps, a robot inclusion standard for websites.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data

This example allows all robots to visit all files because the wildcard "*" specifies all robots:

User-agent: *

Disallow:

This example keeps google robots out:

User-agent: googlebot

Disallow: /

The next is an example that tells all crawlers not to enter into four directories of a website:

User-agent: *

Disallow: /cgi-bin/

Disallow: /images/

Disallow: /tmp/

Disallow: /private/

Example that tells a specific crawler not to enter one specific directory:

User-agent: BadBot

Disallow: /private/

Will the /robots.txt standard be extended?

Probably... there are some ideas floating around. They haven't made it into a coherent proposal because of time constraints, and because there is little pressure. Mail suggestions to the robots mailing list, and check the robots home page for work in progress.

What if I can't make a /robots.txt file?

Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents.

Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Google’s Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.

To improve search performance, Google ignores (doesn’t index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s performance.

Google’s Query Processor

The query processor has several parts, including the user interface (search box), the “engine” that evaluates queries and matches them to relevant documents, and the results formatter.

PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed more important and is more likely to be listed above a page with a lower PageRank.

Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page. A patent application discusses other factors that Google considers when ranking a page. Visit SEOmoz.org’s report for an interpretation of the concepts and the practical applications contained in Google’s patent application.

Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate relevance; they’re tweaked to improve quality and performance, and to outwit the latest devious techniques used by spammers.

Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives more priority to pages that have search terms near each other and in the same order as the query. Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the page, options offered by Google’s Advanced Search Form and Using Search Operators (Advanced Operators).
Let’s see how Google processes a query.

Demerits of web crawler

It requires considerable bandwidth.

It sometime uses of spamming.

Unable to crawl all deep web.

Dynamic content - dynamic pages which are returned in response to a submitted query or accessed only through a form (especially if open-domain input elements e.g. text fields are used; such fields are hard to navigate without domain knowledge).

Unlinked content - pages which are not linked to by other pages, which may prevent Web crawling programs from accessing the content. This content is referred to as pages without backlinks (or inlinks).

Private Web - sites that require registration and login (password-protected resources).

Contextual Web - pages with content varying for different access contexts (e.g. ranges of client IP addresses or previous navigation sequence).

Limited access content - sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard, CAPTCHAs or pragma:no-cache/cache-control:no-cache HTTP headers), prohibiting search engines from browsing them and creating cached copies.

Scripted content - pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or AJAX solutions.

Non-HTML/text content - textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.

Poorly written web robots may damage files in server.

Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes.

Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects

· Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites

Guidelines for robot writers

To write a good Web Robot, you should try to avoid

· Overloading network

· Overload a server with rapid requests for documents

· Servers that unreachable

· Cycles in the web structure

Also

Be Accountable

Test Locally

Stay with it

Don't hog resources

Share results , If you are interested in writing your own crawler Please comment on my post with (which language like C#, java) language.We will give you assistance through our blog.Also if you want clarification for any of area of our post please feel free to comment.

webdevlopers webhosting company cochin/kerala/india

Thanks.

References:

http://www.google.com/technology/

http://www.robotstxt.org/wc/robots.html

http://www.robotstxt.org/wc/exclusion.html

6 comments:

Farhat KhanJuly 23, 2009 at 8:56 AM
Hello Irshad,
I would like to thanks for this precious piece of information.
I am doing a project in which i require,writing my own crawler. Please assist me whatever you can and also provide me some documentaion so that i can document it also.
I am writing crawler in Java.
see you soon with your postive and worthy response.
Once again Thanks!
irshad cpJuly 23, 2009 at 10:25 PM
Thanks for comment, i think this link will help you

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

If you need more information please feel free to ask
Web spiderSeptember 11, 2009 at 3:22 PM
Great topic.
Can you give me a sample code to make sitemap ?
irshad cpSeptember 11, 2009 at 8:10 PM
http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/
UnknownNovember 1, 2009 at 12:42 PM
Hi irshad
thanks for this information.
can u asist me in writing a web crawler in c#?
AnonymousNovember 20, 2009 at 12:39 PM
My final year project work is crawler ...
\
Any body need its coding ...

r.kapoor0611@gmail.com

Code Globe

Friday, February 6, 2009

Web Crawler or WebRobot or Web Spider Working

6 comments:

LinkWithin

Contributors