Tuesday, February 17, 2009

Search Engine Optimization SEO

What is Search Engine Optimization?

Search engine optimization (SEO) is the most important part of online marketing. Most of the companies have a web presence and now a day everything goes online. Here comes the role of SEO. SEO help to promote your website, blog or any kind of web presence only with a little bit of basic HTML knowledge. So we can define SEO generally as a technique which optimizes your site such that it is easily accessible by search engine (Search engine spiders) and to obtain higher page ranks in search engine results.

Speaking the truth, after following all these things we can’t assure that you will get to the top of search. But, yes we can assure you that your page will be in a higher position from where you really are . The First thing that must be in your mind is to set realistic goals like “I will be in first 40 results of Google for ‘Search engine Optimization’ keyword”. That’s the motivation with which you could defeat others in SEO. SEO can increase traffic to your website with out any kind of advertisement (Cooool isn’t it?). By following SEO techniques you could get in to the top of search results just because of your keyword relevance. For example, Assume you have a website that sells pizzas. So you must select your keywords such that it is related to this type of food item.

Through out this article we will be using a website, say www.buypizza.com, in order to explain the SEO process.

All search engines have a computer program called Web Crawler, web robot or web spider that employ the duty of indexing web pages to search engine servers. For more details on how Search engine works and its architecture please look at the article on Web Crawlers here .So I skip that portion here and move forward in to the facts of SEO.

1. Keywords: The most important term in SEO

The most important thing in SEO is keyword selection, you must select relevant keywords for your website. This is because search engines can’t feel the web page they always crawl. They usually look at the text and identifiers of a website in order to know what this website is meant for and what it contains. When user searches for a particular information, the search engine process it by comparing the request text with index in the search engine server and then calculate the relevancy of the term in the index. Different algorithms are employed for calculating ranking and relevance on different search engines. For example Google uses “PageRank” algorithm for ranking pages based on its relevance. Search engine look at URL, link, heading, Meta tag and alt tag for keywords. Different search engines uses different strategies in their working and they update their search algorithms frequently, that’s why Search engines like Yahoo, Google, Ask, MSN etc give different result for the same keyword.

Before selecting keywords you should thoroughly identify your competitor(s) in web market and choose a keyword that has not yet been used(if you are lucky) .For example when you select a domain name www.buypizza.com use “pizza” keyword plus your unique identifier names in industry(i.e “ABC pizza”) and use keywords like “buy pizza”, “ABC online pizza” and corresponding synonyms of the chosen keyword. It should be kept in mind that search engines themselves have keywords that match synonyms of search query.

Avoid using large amount of keywords that doesn't belong to or unfit for your website content. Use specific keywords that belongs to your website content and theme. Use Google keyword suggession tool which provide details about the amount of traffic for a particular keyword of your website and that of your opponents.

Next thing that comes into play is the keyword density. The keyword density tells the search engine how relevant your site is for a particular keyword. You can use Online Keyword density checker for your website to identify the density of different different keywords in your website. Normally 3-7 % density of a particular keyword in a document is affordable. If you try to Spam using large number of keywords that doesn’t match to your website content you can get banned from search engines and WebCrawler won’t visit your website for indexing. So its better if you don’t Spam. Also the position of keywords in a page is important. Like keywords in URLs, TITLE tag, headings and Top of the website have a great relevance. For example, if your pizza website has a link to order pizza its better to use Buypizza.com/orderABCpizza.html rather than using Buypizza.com/ordernow.html as URL.

When selecting your domain name try to get a keyword rich (at least 3 keywords) domain names. Use any of the domain name availability checker service.

Keywords in headings have an important position. But don't use long headings, it’s not a good practice. Don’t use stop words. Stop words are general words and letters ignored by majority of search engines.

Some of the stop words are :
· a

· are

· and

· as

· be

· at

· for

· he

· from

· his

· I

· is

· in

· it

· of

· that

· on

· the

· they

· to

· this

· was

Less usage of stop words helps reduce the file size, and increase the speed and relevancy of the search results. As such, always try to utilize keywords instead of stop words as much as possible. Obviously, avoiding the stop words may result difficulty in reading, but it is worth the sacrifice, in the cause of trying to obtain top search engine rankings.

2. Links Optimization

Next important term in SEO is links .Web is woven web pages that are connected by links. If your website is not referred in any of the famous websites it may take time for search engine to get your site. One of the easy way to tell search engine about your site is Submit URL(e.g : Google Submit URL that is provided by every search engine. You can submit your website to major search engines and directories to get listed.

Inbound and Outbound Link

When calculating ranking of your website search engine needs Inbound and Outbound links from/to your site.

Outbound links are links that start from your site and lead to another one, while inbound links, or back links, come from an external site to yours.

For example:

Consider a website of Different food items and in that site they put a link to your site in their pizza food category. It’s called out links. Also if your sites have a link to www.howtomakehamburger.com it’s called back link or inbound link.

Try to increase Outbound links and also good Inbound links which lead to same themed website will also increase ranking. If your site has many inbound link that comes from websites that have a different theme than yours, then it will adversely affect your websites search position.

Using images for links might be prettier but it is a SEO killer. Instead of having buttons for links, use simple text links. Since search engines spider the text on a page, they can't see all the designer themes, so avoid using them, or provide a meaningful textual description in the <alt> tag, as described next

Anchor Text

It’s the text you click on the link .So use keyword rich anchor text and descriptive.

For example:
For my blog the back link anchor text is like this “Irshad cp’s blog”

<a href=http://irshadcp.blogspot.com>Irshad cp’s blog</a>

3. META Tags

During the infancy of the search engine the META tag was the only thing for search engine optimization. The meta Description tag is one of the way for you to write a description for your site. It points search engines to what themes and topics your Web site is relevant to. It comes after TITLE tag in HEAD section of HTML source.




<TITLE>ABCPizza’s Website, Buy Pizza Online </TITLE>

<META Name=”Description” Content=”Buy delicious pizza from ABC ”>

<META Name=”Keywords” Content=”pizza online, buy pizza, delicious pizza, free pizza sticker”>



You can specify more keywords using “,”.

HTML META tags to keep robots (web crawler) in to your document.

The basic idea is that if you include a tag like:


in your HTML document, that document will be indexed.


The robot will parse the links in that document

4. Content is King

Like “Customer is King” in market “Content is King” in Web. Provide fresh and relevant content in your Website. This will probably get you high traffic and search engine ranking. Try to update your site with good content. Even if you update site daily, the search engine crawlers may come after 2 days, 4 days, a week or so .It depends on the search engine policy. This crawler visit will lead in to high search engine traffic to your website. If you don’t update you will be avoided from top searches or your ranking will go down.

This is not the case of company website. In most cases you are not needed to update the company website because it only contains company information. However update of a new product information’s or news to a companies website shows recently what happening around there. Also company official’s blogs are a good traffic grabber.

If your website is a tutorial or magazine based site then there may be no scarcity of fresh content . However try to format matter in your website in to paragraphs and search engine friendly headings. Make proper text highlighting in text. Use bold and italic text to show the importanceof text.

Also don’t use other websites content which will degrade your website from search engine ranking. Some webmasters use hidden text (not seen for humans but visible for search engine robots) to fool the search engine. This will result in getting banned from search engine results. At the end of this article I will discuss what a search engine spam is and what are the do’s and don’ts of Search engine.

Don’t use images for navigation links, a good practice is to use text link. Also if image tag is used (image is displayed in your page) use image ALT property with appropriate description.

For eg: an image of our pizza website logo.

<img src”../images/logo.gif” alt=”ABC pizza logo”/>

This ALT text is displayed when image in not displayed. Also it is help full for screen readers (Visually impaired people uses screen reader software to read web pages).This text is indexed for image search in search engines.

Use static links its better one. Links with more than 3 parameter are not indexed by search engine. Try to reduce the page size to 61kb.

5. Promote your website

After doing the above stated things, it’s time to submit your site to search engines, directories (like dmoz.org and yahoo directory) and posting article that are related to your site content to forums and discussion groups.

Also promote your site using paid and non-paid ads that also drive traffic to website. I will describe the promotion of websites using Advertisements in my next article.

6. Things to be avoided

a. Remove all other META tags (author, date, etc.), except "description" and "keywords" unless you're sure they are absolutely necessary.

b. Avoid using the same Title Tag throughout your site. Try using a unique Title Tag for each web page and use keyword Phrases that holds theme relevance to that page.

c. Most major engines cannot read frames. If you must use frames, include important body text within a <no frames> tag.

d. Avoid completely Flash designs. A majority of major engines will not index flash sites. Editors may be critical of heavy or slow loading flash.

e. Avoid JavaScript links. Spiders cannot crawl links in Java Script.

f. Never use keywords that do not apply to your site's content

7. What constitutes search engine Spam?

Experts Say:

Any optimization method or practice employed solely to deceive the search engines for the purpose of increasing rankings is considered Spam. Some techniques are clearly considered as an attempt to Spam the engines. Where possible, you should avoid these:

* Keyword stuffing: This is the repeated use of a word to increase its frequency on a page. Search engines now have the ability to analyze a page and determine whether the frequency is above a "normal" level in proportion to the rest of the words in the document.

* Invisible text: Some webmasters stuff keywords at the bottom of a page and make their text color the same as that of the page background. This is also detectable by the engines.

* Tiny text: Same as invisible text but with tiny, illegible text.

* Page redirects: Some engines, especially Info seek, do not like pages that take the user to another page without his or her intervention, e.g. using META refresh tags, CGI scripts, Java, JavaScript, or server side techniques.

* Meta tags stuffing: Do not repeat your keywords in the Meta tags more than once, and do not use keywords that are unrelated to your site's content.

* Do not create doorways.(some pages to your website other than home page)

* Do not submit the same page more than once on the same day to the same search engine. If crawler don’t index with in days don’t repeat submit to search engine.

* Do not submit virtually identical pages, i.e. do not simply duplicate a web page, give the copies different file names, and submit them all. That will be interpreted as an attempt to flood the engine.

* Do not submit more than the allowed number of pages per engine per day or week. Each engine has a limit on how many pages you can manually submit to it using its online forms.

* Do not participate in link farms or link exchange programs. Search engines consider link farms and link exchange programs as spam, as they have only one purpose - to artificially inflate a site's link popularity, by exchanging links with other participants.

To try your knowledge in SEO participate in FREE SEO EXPERT QUIZ in http://www.seomoz.org feel free to comment on this article and provide your valuable suggestions. If you want clarification on any area please comment.

OK Friends! Starts optimizing your website happy Web life.. cheers!!!

kerala/cochin web solution providers-seo-webdevelopment-webhosting

Sunday, February 15, 2009

AT command set for SONY ERICSSON

The GSM modem will respond in 2 ways, "ERROR" will be returned if the AT command is not supported.If the command is executed successfully an OK will be returned after the response text.(You can use communicate with "GSM" modem via either "Hyperterminal" or cable (Serial port Communication[Java,.NET, Sending and recieving SMS programmatically] ).

e.g AT+CBC=?
response text will be of the form->
+CBC :(0,2),(0-100)

List of AT Commands
  1. AT Attention command
  2. AT* List all supported AT commands
  3. ATZ Restore to user profile (ver. 2)
  4. AT&F Set to factory-defined configuration (ver. 2)
  5. ATI Identification information (ver. 3)
  6. AT&W Store user profile
  7. AT+CLAC List all available AT commands
  8. AT+CGMI Request manufacturer identification (ver. 1)
  9. AT+CGMM Request model identification
  10. AT+CGMR Request revision identification
  11. AT+CGSN Request product serial number identification
  12. AT+GCAP Request modem capabilities list
  13. AT+GMI Request manufacturer information
  14. AT+GMM Request model identification
  15. AT+GMR Request revision identification
  16. ATA Answer incoming call command (ver. 2)
  17. ATH Hook control (ver. 2)
  18. ATD Dial command (ver. 5)
  19. ATO Return to online data mode
  20. AT+CVHU Voice hangup control
  21. AT+CLCC List current calls
  22. AT*CPI Call progress information
  23. ATE Command echo (ver. 2)
  24. ATSO Automatic answer control
  25. ATS2 Escape sequence character
  26. ATS3 Command line termination character (ver. 3)
  27. ATS4 Response formatting character (ver. 3)
  28. ATS5 Command line editing character (ver. 3)
  29. ATS7 Completion connection timeout
  30. ATS10 Automatic disconnect delay control
  31. ATQ Result code suppression (ver. 2)
  32. ATV DCE response mode (ver. 2)
  33. ATX Call progress monitoring control
  34. AT&C Circuit 109 (DCD) control
  35. AT&D Circuit 108 (DTR) response
  36. AT+IFC Cable interface DTE-DCE local flow control
  37. AT+ICF Cable interface character format (ver. 2)
  38. AT+IPR Cable interface port rate
  39. AT+ILRR Cable interface local rate reporting
  40. AT+DS Data compression (ver. 3)
  41. AT+DR Data compression reporting
  42. AT+WS46 Mode selection
  43. AT+FCLASS Select mode
  44. AT*ECBP CHF button pushed (ver. 2)
  45. AT+CMUX Switch to 07.10 multiplexer (ver. 2)
  46. AT*EINA Ericsson system interface active
  47. AT*SEAM Add menu item
  48. AT*SESAF SEMC show and focus
  49. AT*SELERT SEMC create alert (information text)
  50. AT*SESTRI SEMC create string Input
  51. AT*SELIST SEMC create list
  52. AT*SETICK SEMC create ticker
  53. AT*SEDATE SEMC create date field
  54. AT*SEGAUGE SEMC create gauge (bar graph/progress feedback)
  55. AT*SEGUP SEMC update gauge (bar graph/ progress feedback)
  56. AT*SEONO SEMC create on/off input
  57. AT*SEYNQ SEMC create yes/no question
  58. AT*SEDEL SEMC GUI delete
  59. AT*SESLE SEMC soft key label (ver. 1)
  60. AT*SERSK SEMC remove soft key
  61. AT*SEUIS SEMC UI session establish/terminate
  62. AT*EIBA Ericsson Internal Bluetooth address
  63. AT+BINP Bluetooth input
  64. AT+BLDN Bluetooth last dialled number
  65. AT+BVRA Bluetooth voice recognition activation
  66. AT+NREC Noise reduction and echo cancelling
  67. AT+VGM Gain of microphone
  68. AT+VGS Gain of speaker
  69. AT+BRSF Bluetooth retrieve supported
  70. AT+GCLIP Graphical caller ID presentation
  71. AT+CSCS Select TE character set (ver. 3)
  72. AT+CHUP Hang up call
  73. AT+CRC Cellular result codes (ver. 2)
  74. AT+CR Service reporting control
  75. AT+CV120 V.120 rate adaption protocol
  76. AT+VTS DTMF and tone generation
  77. AT+CBST Select bearer service type (ver. 3)
  78. AT+CRLP Radio link protocol (ver. 2)
  79. AT+CEER Extended error report (ver. 2)
  80. AT+CHSD HSCSD device parameters (ver. 2)
  81. AT+CHSN HSCSD non-transparent call configuration (ver. 2)
  82. AT+CHSC HSCSD current call parameters (ver. 2)
  83. AT+CHSR HSCSD parameters report (ver. 2)
  84. AT+CHSU HSCSD automatic user-initiated upgrade
  85. AT+CNUM Subscriber number (ver. 2)
  86. AT+CREG Network registration (ver. 2)
  87. AT+COPS Operator selection (ver. 2)
  88. AT+CLIP Calling line identification (ver. 2)
  89. AT+CLIR Calling line identification restriction
  90. AT+CCFC Calling forwarding number and conditions (ver. 2)
  91. AT+CCWA Call waiting (ver. 2)
  92. AT+CHLD Call hold and multiparty (ver. 1)
  93. AT+CSSN Supplementary service notification (ver. 2)
  94. AT+CAOC Advice of charge
  95. AT+CACM Accumulated call meter (ver. 2)
  96. AT+CAMM Accumulated call meter maximum
  97. AT+CDIP Called line identification presentation
  98. AT+COLP Connected line identification presentation
  99. AT+CPOL Preferred operator list
  100. AT+COPN Read operator names
  101. AT*EDIF Divert function (ver. 2)
  102. AT*EIPS Identify presentation set
  103. AT+CUSD Unstructured supplementary service data (ver. 2)
  104. AT+CLCK Facility lock (ver. 5)
  105. AT+CPWD Change password (Ver. 3)
  106. AT+CFUN Set phone functionality (ver. 2)
  107. AT+CPAS Phone activity status (ver. 3)
  108. AT+CPIN PIN control (ver. 2)
  109. AT+CBC Battery charge (ver. 2)
  110. AT+CSQ Signal quality (ver.1)
  111. AT+CKPD Keypad control (ver. 7)
  112. AT+CIND Indicator control (ver. 5)
  113. AT+CMAR Master reset
  114. AT+CMER Mobile equipment event reporting
  115. AT*ECAM Ericsson call monitoring (ver. 2)
  116. AT+CLAN Language
  117. AT*EJAVA Ericsson Java application function
  118. AT+CSIL Silence Command
  119. AT*ESKL Key-lock mode
  120. AT*ESKS Key sound
  121. AT*EAPP Application function (ver. 5)
  122. AT+CMEC Mobile equipment control mode
  123. AT+CRSM Restricted SIM access
  124. AT*EKSE Ericsson keystroke send
  125. AT+CRSL Ringer sound level (ver. 2)
  126. AT+CLVL Loudspeaker volume level
  127. AT+CMUT Mute control
  128. AT*EMEM Ericsson memory management
  129. AT+CRMP Ring melody playback (ver. 2)
  130. AT*EKEY Keypad/joystick control (ver. 2)
  131. AT*ECDF Ericsson change dedicated file
  132. AT*STKC SIM application toolkit configuration
  133. AT*STKE SIM application toolkit envelope command send
  134. AT*STKR SIM application toolkit command response
  135. AT+CMEE Report mobile equipment error
  136. AT+CSMS Select message service (ver.2)
  137. AT+CPMS Preferred message storage (ver. 4)
  138. AT+CMGF Message format (ver. 1)
  139. AT+CSCA Service centre address (ver. 2)
  140. AT+CSAS Save settings
  141. AT+CRES Restore settings
  142. AT+CNMI New messages indication to TE (ver. 4)
  143. AT+CMGL List message (ver. 2)
  144. AT+CMGR Read message (ver. 2)
  145. AT+CMGS Send message (ver. 2)
  146. AT+CMSS Send from storage (ver. 2)
  147. AT+CMGW Write message to memory (ver. 2)
  148. AT+CMGD Delete message
  149. AT+CMGC Send command (ver. 1)
  150. AT+CMMS More messages to send
  151. AT+CGDCONT Define PDP context (ver. 1)
  152. AT+CGSMS Select service for MO SMS messages
  153. AT+CGATT Packet service attach or detach
  154. AT+CGACT PDP context activate or deactivate
  155. AT+CGDATA Enter data state
  156. AT+CGEREP Packet domain event reporting (ver. 1)
  157. AT+CGREG Packet domain network registration status
  158. AT+CGPADDR Show PDP address
  159. AT+CGDSCONT Define secondary PDP context
  160. AT+CGTFT Traffic flow template
  161. AT+CGEQREQ 3G quality of service profile (requested)
  162. AT+CGEQMIN 3G quality of service profile (minimum acceptable)
  163. AT+CGEQNEG 3G quality of service profile (negotiated)
  164. AT+CGCMOD PDP context modify
  165. Extension of ATD – Request GPRS service
  166. Extension of ATD – Request packet domain IP service
  167. AT+CPBS Phonebook storage (ver. 3)
  168. AT+CPBR Phonebook read (ver. 2)
  169. AT+CPBF Phonebook find (ver. 2)
  170. AT+CPBW Phonebook write (ver. 4)
  171. AT+CCLK Clock (ver. 4)
  172. AT+CALA Alarm (ver. 3)
  173. AT+CALD Alarm delete
  174. AT+CAPD Postpone or dismiss an alarm (ver. 2)
  175. AT*EDST Ericsson daylight saving time
  176. AT+CIMI Request international mobile subscriber identity
  177. AT*EPEE PIN event
  178. AT*EAPS Active profile set
  179. AT*EAPN Active profile rename
  180. AT*EBCA Battery and charging algorithm (ver. 4)
  181. AT*ELIB Ericsson list Bluetooth devices
  182. AT*EVAA Voice answer active (ver. 1)
  183. AT*EMWS Magic word set
  184. AT+CPROT Enter protocol mode
  185. AT*EWDT WAP download timeout
  186. AT*EWBA WAP bookmark add (ver. 2)
  187. AT*EWCT WAP connection timeout
  188. AT*EIAC Internet account, create
  189. AT*EIAD Internet account configuration, delete
  190. AT*EIAW Internet account configuration, write general parameters
  191. AT*EIAR Internet account configuration, read general parameters
  192. AT*EIAPSW Internet account configuration, write PS bearer parameters
  193. AT*EIAPSR Internet account configuration, read PS bearer parameters
  194. AT*EIAPSSW Internet account configuration, write secondary PDP context parameters
  195. AT*EIAPSSR Internet account configuration, read secondary PDP context parameters
  196. AT*EIACSW Internet account configuration, write CSD bearer parameters
  197. AT*EIACSR Internet account configuration, read CSD bearer parameters
  198. AT*EIABTW Internet account configuration, write Bluetooth bearer parameters
  199. AT*EIABTR Internet account configuration, read Bluetooth bearer parameters
  200. AT*EIAAUW Internet account configuration, write authentication parameters
  201. AT*EIAAUR Internet account configuration, read authentication parameters
  202. AT*EIALCPW Internet account configuration, write PPP parameters – LCP
  203. AT*EIALCPR Internet account configuration, read PPP parameters – LCP
  204. AT*EIAIPCPW Internet account configuration, write PPP parameters – IPCP
  205. AT*EIAIPCPR Internet account configuration, read PPP parameters – IPCP
  206. AT*EIADNSV6W Internet account configuration, write DNS parameters – IPv6CP
  207. AT*EIADNSV6R Internet account configuration, read DNS parameters – IPv6CP
  208. AT*EIARUTW Internet account configuration, write routing table parameters
  209. AT*EIARUTD Internet account configuration, delete routing table parameters
  210. AT*EIARUTR Internet account configuration, read routing table parameters
  211. AT*SEACC Accessory class report
  212. AT*SEACID Accessory identification
  213. AT*SEACID2 Accessory identification (Bluetooth)
  214. AT*SEAUDIO Accessory class report
  215. AT*SECHA Charging control
  216. AT*SELOG SE read log
  217. AT*SEPING SE ping command
  218. AT*SEAULS SE audio line status
  219. AT*SEFUNC SE functionality status (ver. 2)
  220. AT*SEFIN SE flash Information
  221. AT*SEFEXP Flash auto exposure setting from ME
  222. AT*SEMOD Camera mode indicator to the flash
  223. AT*SEREDI Red eye reduction indicator to the flash
  224. AT*SEFRY Ready indicator to the ME
  225. AT*SEAUP Sony Ericsson audio parameters
  226. AT*SEVOL Volume level
  227. AT*SEVOLIR Volume indication request
  228. AT*SEBIC Status bar icon
  229. AT*SEANT Antenna identification
  230. AT*SESP Speakermode on/off
  231. AT*SETBC Text to bitmap converter
  232. AT*SEAVRC Sony Ericsson audio video remote control
  233. AT*SEMMIR Sony Ericsson multimedia information request
  234. AT*SEAPP Sony Ericsson application
  235. AT*SEAPPIR Sony Ericsson application indication request
  236. AT*SEJCOMM Sony Ericsson Java comm
  237. AT*SEDUC Sony Ericsson disable USB charge
  238. AT*SEABS Sony Ericsson accessory battery status
  239. AT*SEAVRCIR Sony Ericsson audio video remote control indication request
  240. AT*SEGPSA Sony Ericsson global positioning system accessory
  241. AT*SEAUDIO Accessory class report
  242. AT*SEGPSA Sony Ericsson global positioning system accessory
  243. AT*SEAUDIO Accessory Class Report
  244. AT*SEGPSA Sony Ericsson global positioning system accessory
  245. AT*SETIR Sony Ericsson time information request
  246. AT*SEMCM Sony Ericsson memory card management
  247. AT*SEAUDIO Accessory Class Report

Enable and disable Trigger in SQL Server

Disable Triggers instead of dropping them.
Business rules in table often expect your application to update the table one row at a time. Also some triggers generate error when the code in the trigger assigns to a local variable returned by selecting a column from the inserted virtual table. The assignment fails if you are updating multiple rows because the inserted table contains more than one row so the sub query returns more than a single value. Multiple updates need special handling in such scenario. Developers often wind up dropping a trigger before multi- row updates then creating them later.

But this scenario can be overcome by disabling trigger

Alter Table Mytable Disable Trigger mytrigger

You can enable Trigger using
Alter Table Mytable Enable Trigger mytrigger

2 Simple steps to speed up your SQL Server execution

1.Replace "Count (*) " with "exists" when checking for existence

If(Select(count(*) from orders where shipvia=3>0)

The execution Plan shows sqlserver has to read whole rows in orders table ,You can achieve same result by

If exists(select * from orders where shipvia=3>0)

You will see a major speed improvement

2. In sub queries can be replaced with Left outer join

Select * from Customers where customer rid not in(select customer rid from order)
Replace with outer join

Select c.* from Customers c left outer join orders o On o.customerid=c.customerid where o.customerid is null .

Serial Port Communication in Java

Every one is searching on Sun Microsystems website for Java serial communication library(javax.comm) for Windows and alas! no results.

To access serial port with java use RXTX Serial port Library

Experts say:

If you want to access the serial (RS232) or parallel port with JAVA, you need to install a platform/operating system dependent library. Install either javax.comm from SUN (Sun no longer offer's the Windows platform binaries of javax.comm, however javax.comm 2.0.3 can be used for the Windows platform, by using it in conjunction with the Win32 implementation layer provided by the RxTx project)
or better install the rxtxSerial and/or rxtxParallel library from rxtx.org (Windows, Linux, Mac OS X)."

RXTX will support all type of platforms including Windows, Linux, Mac OS X.It can be downloaded from www.rxtx.org

All platform supported downloads are availabe here http://rxtx.qbang.org/wiki/index.php/Download

Direct Download is here

After downloading this zip file unzip it. It contains libraries for Windows, Linux, Mac OS X in respective directories.Copy dll from Windows directory in your unzipped folder and paste it into "C:\Windows\System32".

It also contains RXTXcomm.jar file, copy this in to your java applications lib directory(If doesn't exist Create the lib directory).

Add reference to this jar file in your application.
Try this sample program that communicate with AT commands(you may use a mobile phone that accepts AT command).

Sample Program (ListPortClass.java)
import gnu.io.*;
import java.io.*;

public class ListPortClass

public static void main(String[] s)
CommPortIdentifier portIdentifier = CommPortIdentifier.getPortIdentifier("COM1");
if (portIdentifier.isCurrentlyOwned())
System.out.println("Port in use!");
else {

SerialPort serialPort = (SerialPort) portIdentifier.open("ListPortClass", 300);
int b = serialPort.getBaudRate();
serialPort.setSerialPortParams(300, SerialPort.DATABITS_8, SerialPort.STOPBITS_1, SerialPort.PARITY_NONE);
OutputStream mOutputToPort = serialPort.getOutputStream();
InputStream mInputFromPort = serialPort.getInputStream();
String mValue = "AT\r";
System.out.println("beginning to Write . \r\n");
System.out.println("AT Command Written to Port. \r\n");
System.out.println("Waiting for Reply \r\n");
byte mBytesIn [] = new byte[20];
String value = new String(mBytesIn);
System.out.println("Response from Serial Device: "+value);
catch (Exception ex)
System.out.println("Exception : " + ex.getMessage());


 software development-webhosting, cochin, kerala

Saturday, February 14, 2009

Installing Ruby on Rails on Windows


What is Ruby ?
Ruby is a dynamically typed,interpreted,reflective object oriented programming language.

What is Rails ?
Rails is an add-on to Ruby programming language which contains a library, scripts for generating parts of application, and much more. We dont add Ruby on top of Rails, rather Rails framework is an add-on to the Ruby programing language.

Download a recent version of Ruby package that is marked stable(Download Here). After downloading is completed, install Ruby.In this installation you will also get an editor(SciTE). After installation type "irb" in your command prompt, which provides a prompt as below :
eg : >puts "hello" will give you the output hello
NB :type "exit" to exit from irb.

If you want to check whether the installation is correct, type "dir c:\ruby\bin\irb*" in command prompt. In response we get irb and irb.bat. IF not then re-install ruby.

Now comes the installation of rails which require a internet connection. take command prompt and type
> gem install tails -r -y

-r specifies gem to install rails remotely and
-y specifies gem to install supported programs that are needed by rails in order to work properly.

NB:You can get help using "ri" in command prompt eg :ri times

Tuesday, February 10, 2009

How to retrieve processorID,Motherboard serial number and MAC address of a PC in C#.net

Add using System.Management; namespace in your application (You can find it in :
Reference -> Add reference -> .NET tab -> System.management).

The following code snippet will do the work for you.
//Code for retrieving motherboard's serial number
ManagementObjectSearcher MOS = new ManagementObjectSearcher("Select * From Win32_BaseBoard");
foreach (ManagementObject getserial in MOS.Get())
textBox1.Text = getserial["SerialNumber"].ToString();

//Code for retrieving Processor's Identity
MOS = new ManagementObjectSearcher("Select * From Win32_processor");
foreach (ManagementObject getPID in MOS.Get())
textBox2.Text = getPID["ProcessorID"].ToString();

//Code for retrieving Network Adapter Configuration
MOS = new ManagementObjectSearcher("Select * From Win32_NetworkAdapterConfiguration");
foreach (ManagementObject mac in MOS.Get())
textBox3.Text = mac["MACAddress"].ToString();


Sending SMS in C#.NET using GSM Modem and AT(ATtention) Commands.

SMS (Short Message Service )as specified by the ETSI organization (Documents GSM 3.38, 3.40) can be up to 160 characters long.Each character in the message is represented by a seven bit default alphabet.8 bit messages(MAX 140 character) are usually used for sending data's like image and ring tones etc in smart messaging .16 bit messages (70 character) are used for UCS2 text messages in which a 16 bit text message of class zero will appear as a flash SMS or alert SMS in some Phones.It is strongly recommended that you read all the above specified documents.

There are two ways of sending and receiving SMS messages:

1.PDU (Protocol Description Unit) mode,
2.Text Mode(Unavailable on some phones eg: SonyEricsson Cybershot k810i).

It should be noted that there are several encoding alternatives for displaying an SMS message Out of which some of the common options are "PCCP437","PCDN","8859-1","IRA" and "GSM".This are set by AT Commands (AT+CSCS) when trying to read the message using a computer application.If you read the message on your phone ,phone will choose appropriate encoding.
If text mode is used the application is limited by a set of preset encoding options.If PDU mode is used any encoding can be implemented.we explain PDU mode in detail here.

PDU format
The PDU string is a collection of meta information a PDU string would contain information about sender,information about his/her SMS service centre, timestamps ,the actual message itself .The PDU string consist of hexa-decimal octets and decimal semi-octets.The following is an example of a PDU string which represents an message "hellohello":

Sending a Message to in PDU format to phone number 9447537254

The above PDU message consists of the following:

07 ->Length of SMS Service Centre Number. in this case 7 octet.

91 ->Represents international format. If unknown 81 is used.

194924909979 ->Message Centre address in decimal semi-octets.

11 ->First octet of the SMS-SUBMIT message.

00 ->TP Message Reference This value lets the phone set the message reference
number itself.

0C ->Length of Sender Phone number.

91 ->Represents international format of the phone number.

194974352745 -> Represents Sender phone number in decimal semi-octets.

00 ->Default Protocol identifier.

00 ->Default Data Coding scheme.

AA ->TP Validity period. AA-means 4 days note that this octet is optional.

0A ->TP length of message in Hexa-decimal.

E8329BFB4697D9EC37 -> User data in hexa-decimal in this case "hellohello".


Decimal semi octet?

For example Consider the phone number 919447537254.The corresponding decimal semi-octet is calculated by simply swapping each pair as follows :

91 - ie 19
94 - 49
47 - 74
53 - 35
72 - 27
54 - 45

Creating user data ?

The message "hellohello" consist of 10 characters each represented by septets(7 bit each).Before SMS transfer this septets must be transformed in to octets.


First take the decimal equivalent of each character and convert it to 7 bit binary equivalent.

Alphabet ------Decimal----------------Septet

h ---------------------->104 ----------------------------- >1101000
e ---------------------->101 ----------------------------- >1100101
l ---------------------->108 ----------------------------- >1101100
l ---------------------->108 ----------------------------- >1101100
o ---------------------->111 ----------------------------- >1101111
h ---------------------->104 ----------------------------- >1101000
e ---------------------->101 ----------------------------- >1100101
l ---------------------->108 ----------------------------- >1101100
l ---------------------->108 ----------------------------- >1101100
o ---------------------->111 ----------------------------- >1101111

Next convert the 7bit binay to 8 bit binary as follows:

The first septet(h) is turned in to an octet by adding the right most bit of the second septet this bit is inserted to the left which yields 1+110100 =11101000(E8).The right most bit of the second character is then consumed , so the second character (septet) needs two bits of the third charecter to make an 8 bit octet.This process is continued yielding the following octets.(It should be noted that there are only 9 octets for 10 septets).

1101000 ---------------------->11101000 -------------->E8
1100101 ---------------------->00110010-------------->32
1101100 ---------------------->10011011 -------------->9B
1101100 ---------------------->11111101 -------------->FD
1101111 ---------------------->01000110 -------------->46
1101000 ---------------------->10010111 -------------->97
1100101 ---------------------->11011001 -------------->D9
1101100 ----------------------- ----------- -----------------
1101100 ---------------------->11101100 -------------->EC
1101111 ---------------------->110111 ------------------>37

There for the message "hellohello" is converted in to the 9 octets "E8 32 9B FD 46 97 D9 EC 37" .

Useful AT Commands for Sending SMS :

NB: "\r" should be supplied for the command to be executed"

AT\r - Text Communication between phone and any accessories .Determines the presence of a phone

AT+CMGF=0 -> Tells the TA that PDU mode is used. AT+CMGF=? shows if the command is supported by the phone .the response will be of the form +CMGF:(List of Supporting Modes eg:0,1)
AT+CSCA? -> Command for retrieving message centre address.The response will be of the form +CSCA: followed by the message centre address.
AT+CMGS - >AT command for sending message from a TE to the network (SMS-SUBMIT).The command should specify the length of actual TP data unit in octet (excluding SMSC address octet) followed by the actual PDU string.(eg: AT+CMGS="length in octet"\r"Actual PDU" Ctrl^ Z. The decimal equivalent of Ctrl ^Z is 26 , simply convert it to char at the end of PDU string)

Receiving SMS in PDU format

Receiving SMS is a little bit tricky .The following Decoding process may be followed:

DOWNLOAD sample project here


cellphone apllication development cum website development hosting kerala cochin

Useful links:

Zion Plus Editor For Mobile

Zion Plus Beta2 is an application that allows you to create text files in phone memory and memory card.

If you want to create folder in your mobile through this application edit the "directoy" field,shown when you choose save or save as option.
for eg: the directory field shows -"file:///e:/folderx/" you can
create a folder "NEWFOLDER"in that directory by changing the field by

You can specify the file name in the filename textbox,default name
will be newfile.

Help menu to assist you in how to
use this application.

You can download this application by clicking the link below.

Download Zion Plus Beta

Sunday, February 8, 2009

Saving an image into SQL server database in C# Windows Application.

//Reading image in to File Stream from a Physical Location
FileStream fs = new FileStream(@"C:\image1.jpg", FileMode.Open, FileAccess.Read);
Initialize a binary reader to read Binary data from image
BinaryReader br = new BinaryReader(fs);

//Next use the ReadByte(or any other equivalent ) method to read image data in to byte array
byte[] photo = br.ReadBytes((int)fs.Length);

//Close the binary reader AND filestream object

//Next Create a SQL command which makes use of SQL parameters
SqlCommand addEmp = new SqlCommand("INSERT INTO Employees
("+FirstName,Photo) " + "VALUES(@FirstName,@Photo)",conn);
addEmp.Parameters.Add("@FirstName", SqlDbType.NVarChar, 10).Value = pfirstName;

//Assign the byte array to the corresponding SQL parameter
addEmp.Parameters.Add("@Photo", SqlDbType.Image, photo.Length).Value = photo;

//Execute the query and Finaly You are Done


Saturday, February 7, 2009

Top 10 hacking incidents.

Top 10 hacking incidents of all time instances where some of the most seemingly secure computer networks were compromised.

Early 1990s :
Kevin Mitnick, often incorrectly called by many as god of hackers, broke into the computer systems of the world's top technology and telecommunications companies Nokia, Fujitsu, Motorola, and Sun Microsystems. He was arrested by the FBI in 1995, but later released on parole in 2000. He never termed his activity hacking, instead he called it social engineering.

November 2002:
Englishman Gary McKinnon was arrested in November 2002 following an accusation that he hacked into more than 90 US military computer systems in the UK. He is currently undergoing trial in a British court for a "fast-track extradition" to the US where he is a wanted man. The next hearing in the case is slated for today.

Russian computer geek Vladimir Levin effected what can easily be called The Italian Job online - he was the first person to hack into a bank to extract money. Early 1995, he hacked into Citibank and robbed $10 million. Interpol arrested him in the UK in 1995, after he had transferred money to his accounts in the US, Finland, Holland, Germany and Israel.

When a Los Angeles area radio station announced a contest that awarded a Porsche 944S2 for the 102nd caller, Kevin Poulsen took control of the entire city's telephone network, ensured he is the 102nd caller, and took away the Porsche beauty. He was arrested later that year and sentenced to three years in prison. He is currently a senior editor at Wired News.

Kevin Poulsen again. A little-known incident when Poulsen, then just a student, hacked into Arpanet, the precursor to the Internet was hacked into. Arpanet was a global network of computers, and Poulsen took advantage of a loophole in its architecture to gain temporary control of the US-wide network.

US hacker Timothy Lloyd planted six lines of malicious software code in the computer network of Omega Engineering which was a prime supplier of components for NASA and the US Navy. The code allowed a "logic bomb" to explode that deleted software running Omega's manufacturing operations. Omega lost $10 million due to the attack.

Twenty-three-year-old Cornell University graduate Robert Morris unleashed the first Internet worm on to the world. Morris released 99 lines of code to the internet as an experiment, but realised that his program infected machines as it went along. Computers crashed across the US and elsewhere. He was arrested and sentenced in 1990.

The Melissa virus was the first of its kind to wreak damage on a global scale. Written by David Smith (then 30), Melissa spread to more than 300 companies across the world completely destroying their computer networks. Damages reported amounted to nearly $400 million. Smith was arrested and sentenced to five years in prison.

MafiaBoy, whose real identity has been kept under wraps because he is a minor, hacked into some of the largest sites in the world, including eBay, Amazon and Yahoo between February 6 and Valentine's Day in 2000. He gained access to 75 computers in 52 networks, and ordered a Denial of Service attack on them. He was arrested in 2000.

They called themselves Masters of Deception, targeting US phone systems. The group hacked into the National Security Agency, AT&T, and Bank of America. It created a system that let them bypass long-distance phone call systems, and gain access to private lines.

Friday, February 6, 2009

Web Crawler or WebRobot or Web Spider Working

A web spider, some times called a crawler or a robot, plays an important role as an essential infrastructure of every search engines. It automatically discovers and collects resources, especially the web pages, from the Internet. As the rapidly growth of the Internet, a typical design of web spider may not cope with the overwhelming number of web pages.

Search engines.
A search engine is a program that searches through some dataset. In the context of the Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot .Robots are software agents.

Web Agent
The word "agent" is used for lots of meanings in computing these days. Specifically:
Autonomous agents are programs that do travel between sites, deciding themselves when to move and what to do. These can only travel between special servers and are currently not widespread in the Internet. Intelligent agents are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking. User-agent is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Navigator and Microsoft Internet Explorer, and Email User-agent like Qualcomm Eudora etc.

This process is called web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a website, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot. Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images).Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.

Basic Search engine Architecture

Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a typical search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. A Web crawler is a program, which automatically traverses the web by downloading documents and following links from page to page. They are mainly used by web search engines to gather data for indexing. Other possible applications include page validation, structural analysis and visualization, update notification, mirroring and personal web assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc.

Crawlers are automated programs that follow the links found on the web pages.

There is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the store server. The store server then compresses and stores the web pages into a repository. Every web page has an associated ID number called a doc ID, which is assigned whenever a new URL is parsed out of a web page. The indexer and the sorter perform the indexing function. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

The URL Resolver reads the anchors file and converts relative URLs into absolute URLs and in turn into doc IDs. It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links, which are pairs of doc IDs. The links database is used to compute Page Ranks for all the documents. The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of word IDs and offsets into the inverted index.

A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. A lexicon lists all the terms occurring in the index along with some term-level statistics (e.g., total number of documents in which a term occurs) that are used by the ranking algorithms The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries. (Brin and Page 1998).

Search Engine Architecture

How a Web Crawler Works

Web crawlers are an essential component to search engines; running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers, which are all beyond the control of the system. Web crawling speed is governed not only by the speed of one’s own Internet connection, but also by the speed of the sites that are to be crawled. Especially if one is a crawling site from multiple servers, the total crawling time can be significantly reduced, if many downloads are done in parallel. Despite the numerous applications for Web crawlers, at the core they are all fundamentally the same. Following is the process by which Web crawlers work:

1. Download the Web page.

2. Parse through the downloaded page and retrieve all the links.

3. For each link retrieved, repeat the process

Architecture of web crawler

Web Crawler Architecture

The Web crawler can be used for crawling through a whole site on the Inter-/Intranet. You specify a start-URL and the Crawler follows all links found in that HTML page. This usually leads to more links, which will be followed again, and so on. A site can be seen as a tree-structure, the root is the start-URL; all links in that root-HTML-page are direct sons of the root. Subsequent links are then sons of the previous sons.

A single URL Server serves lists of URLs to a number of crawlers. Web crawler starts by parsing a specified web page, noting any hypertext links on that page that point to other web pages. They then parse those pages for new links, and so on, recursively.

Webcrawler software doesn't actually move around to different computers on the Internet, as viruses or intelligent agents do. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a web browser does when the user clicks on links. All the crawler really does is to automate the process of following links.

Web crawling can be regarded as processing items in a queue. When the crawler visits a web page, it extracts links to other web pages. So the crawler puts these URLs at the end of a queue, and continues crawling to a URL that it removes from the front of the queue.

Crawling policies

There are three important characteristics of the Web that generate a scenario in which Web crawling is very difficult:

· its large volume,

· its fast rate of change, and

· dynamic page generation,

which combine to produce a wide variety of possible crawlable URLs.

The large volume implies that the crawler can only download a fraction of the web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted.

The recent increase in the number of pages being generated by server-side scripting languages has also created difficulty in that endless combinations of HTTP GET parameters exist, only a small selection of which will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided contents, then that same set of content can be accessed with forty-eight different URLs, all of which will be present on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

The behavior of a web crawler is the outcome of a combination of policies:

· A selection policy that states which pages to download.

· A re-visit policy that states when to check for changes to the pages.

· A politeness policy that states how to avoid overloading websites.

· A parallelization policy that states how to coordinate distributed web crawlers

Selection policy

Given the current size of the Web, even large search engines cover only a portion of the publicly available internet; a study by Lawrence and Giles (Lawrence and Giles, 2000) showed that no search engine indexes more than 16% of the Web. As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction contains the most relevant pages, and not just a random sample of the Web.

This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the latter is the case of vertical search engines restricted to a single top-level domain, or search engines restricted to a fixed Web site). Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling.

Different types of crawling.

Path-ascending crawling

Some crawlers intend to download as many resources as possible from a particular Web site. Cothey introduced a path-ascending crawler that would ascend to every path in each URL that it intends to crawl.

Focused crawling
The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. Web crawlers that attempt to download pages that are similar to each other are called focused crawler or topical crawlers.

Crawling the Deep Web

A vast amount of Web pages lie in the deep or invisible Web. These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google’s Sitemap Protocol and mod oai (Nelson et al., 2005) are intended to allow discovery of these deep-Web resources.

Re-visit policy

The Web has a very dynamic nature, and crawling a fraction of the Web can take a really long time, usually measured in weeks or months. By the time a web crawler has finished its crawl, many events could have happened. These events can include creations, updates and deletions.

Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change.

Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency.

Politeness policy

Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers.

As noted by Koster (Koster, 1995), the use of web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using web crawlers include:

Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time.

Server overload, especially if the frequency of accesses to a given server is too high.

Poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle.

Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.

Parallelization policy

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.

Crawling is an effective process synchronisation tool between the users and the search engine.

Web Robot Algorithms

Each robot uses different algorithms to decide where to visit. In general, they start from a historical list of URLS, especially some most popular web sites on the Web.

Starting at a location on the web reveals a branching structure which, if cycles are avoided, is essentially a tree

· Depth First Traversal

· Breadth First Traversal

· Heuristics search

For each URL(web page), we use a heuristics function to evaluate its importance. Then we visit those important web pages first

Robots Exclusion Standard

The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard complements Sitemaps, a robot inclusion standard for websites.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data

This example allows all robots to visit all files because the wildcard "*" specifies all robots:

User-agent: *


This example keeps google robots out:

User-agent: googlebot

Disallow: /

The next is an example that tells all crawlers not to enter into four directories of a website:

User-agent: *

Disallow: /cgi-bin/

Disallow: /images/

Disallow: /tmp/

Disallow: /private/

Example that tells a specific crawler not to enter one specific directory:

User-agent: BadBot

Disallow: /private/

Will the /robots.txt standard be extended?

Probably... there are some ideas floating around. They haven't made it into a coherent proposal because of time constraints, and because there is little pressure. Mail suggestions to the robots mailing list, and check the robots home page for work in progress.

What if I can't make a /robots.txt file?

Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents.

Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Google’s Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.

To improve search performance, Google ignores (doesn’t index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s performance.

Google’s Query Processor

The query processor has several parts, including the user interface (search box), the “engine” that evaluates queries and matches them to relevant documents, and the results formatter.

PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed more important and is more likely to be listed above a page with a lower PageRank.

Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page. A patent application discusses other factors that Google considers when ranking a page. Visit SEOmoz.org’s report for an interpretation of the concepts and the practical applications contained in Google’s patent application.

Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate relevance; they’re tweaked to improve quality and performance, and to outwit the latest devious techniques used by spammers.

Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives more priority to pages that have search terms near each other and in the same order as the query. Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the page, options offered by Google’s Advanced Search Form and Using Search Operators (Advanced Operators).
Let’s see how Google processes a query.

How google query traverse

Demerits of web crawler

It requires considerable bandwidth.

It sometime uses of spamming.

Unable to crawl all deep web.

Dynamic content - dynamic pages which are returned in response to a submitted query or accessed only through a form (especially if open-domain input elements e.g. text fields are used; such fields are hard to navigate without domain knowledge).

Unlinked content - pages which are not linked to by other pages, which may prevent Web crawling programs from accessing the content. This content is referred to as pages without backlinks (or inlinks).

Private Web - sites that require registration and login (password-protected resources).

Contextual Web - pages with content varying for different access contexts (e.g. ranges of client IP addresses or previous navigation sequence).

Limited access content - sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard, CAPTCHAs or pragma:no-cache/cache-control:no-cache HTTP headers), prohibiting search engines from browsing them and creating cached copies.

Scripted content - pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or AJAX solutions.

Non-HTML/text content - textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.

  • Poorly written web robots may damage files in server.

  • Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes.

  • Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects

  • · Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites

Guidelines for robot writers

To write a good Web Robot, you should try to avoid

· Overloading network

· Overload a server with rapid requests for documents

· Servers that unreachable

· Cycles in the web structure


Be Accountable

Test Locally

Stay with it

Don't hog resources

Share results , If you are interested in writing your own crawler Please comment on my post with (which language like C#, java) language.We will give you assistance through our blog.Also if you want clarification for any of area of our post please feel free to comment.

webdevlopers  webhosting company cochin/kerala/india







Related Posts with Thumbnails