SEO Expert India: GOOGLE GUIDE - What is Google? - Google Technology

What is Google?

“Googol” is the mathematical term for a 1 followed by 100 zeros. The term was coined by Milton Sirotta, nephew of American mathematician Edward Kasner, and was popularized in the book, “Mathematics and the Imagination” by Kasner and James Newman. Google's play on the term reflects the company's mission to organize the immense amount of information available on the web.
Google Technology

Google.com began as an academic search engine. In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second.

Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts:

* Googlebot, a web crawler that finds and fetches web pages.
* The indexer that sorts every word on every page and stores the resulting index of words in a huge database.
* The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

Let's take a closer look at each part.
Googlebot, Google's web Crawler

Googlebot is Google's web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It's easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn't traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, and then handing it off to Google's indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it's capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

allows rapid access to documents that contain user query terms.

To improve search performance, Google ignores (doesn't index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google's performance.

Google's Query Processor

The query processor has several parts, including the user interface (search box); the “engine” that evaluates queries and matches them to relevant documents, and the results formatter.

Google considers over a hundred factors in determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the

proximity of the search terms to one another on the page. PageRank is Google's system for ranking web pages.

Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings

Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives more priority to pages that have search terms near each other and in the same order as the query. Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the page, options offered by the Advanced-Search page and search operators.

Let's see how Google processes a query.

History of Site Ranking

In the early 1990's when the web was emerging, several sites having industry specific content were being added to the web each day. Web surfers, on the other hand, had very few tools to locate such sites, which they believed were out there but did not have a clue about their domain names or web addresses. With the birth of Yahoo in 1993, surfers were offered some relief. Yahoo classified each site it discovered in a neatly organized directory list and also embedded a search engine in its site to search for sites based on 'keywords' existing in its database. Several other search engines like AltaVista, Excite, and Lycos etc. followed the search trends offering site search facilities to users. Most of these search engines relied heavily on Meta Tags to classify the relevance of websites based on the keywords they found in the tags.

Things seemed to work out fine before site owners and webmasters realized the value of how they can 'embed' industry specific keyword phrases in their Meta Tags and other site code, thus manipulating their way to show up higher in search results. Over a period of time, search engine results started getting cluttered with sites that spammed their content with relevant keywords but had poor site content for the visitor. The very essence, credibility and importance of search engines was now being challenged to deal with how they could offer a more refined search output to their users.

What is PageRank ?

PageRank is a unique algorithm developed by Google founders Larry Page and Sergey Brin at Stanford University and determines the importance of a web page measuring page importance on a scale from 0 - 10, where 10 is the highest. The main factor behind the PageRank algorithm is link popularity. If one site links to another site, then Google interprets this link as a vote, the more votes cast, obviously the more important the page must be. ...

From here on in, we'll occasionally refer to PageRank as “PR”.

Note:

Not all links are counted by Google. For instance, they filter out links from known link farms. Some links can cause a site to be penalized by Google. They rightly figure that webmasters cannot control which sites link to their sites, but they can control which sites they link out to. For this reason, links into a site cannot harm the site, but links from a site can be harmful if they link to penalized sites. So be careful which sites you link to. If a site has PR0, it is usually a penalty, and it would be unwise to link to it.

Emergence of Google PageRank

Google realized the problem conventional search engines faced in dealing with this situation. If the control of relevance remained with the webmasters, the ranking results would remain contaminated with sites artificially inflating their keyword relevance.

Web, by its very nature is based on hyperlinks, where sites link to other prominent sites. If you take the logic that you would tend to link to sites that you consider important, in essence, you are casting a vote in favor of the sites that you link to. When hundreds or thousands of sites link to a site, it is logical to assume that such a site would be good and important.

Taking this logic further the Google founders, Sergey Brin and Larry Page formulated a Search Engine algorithm that shifted the ranking weight to off-page factors. They evolved a formula called PageRank (named after its founder Larry Page) where the algorithm would count the number of sites that link to a page and assign it an importance score on a scale of 1-10. More the number of sites that link to a page, higher its PageRank.

The Google Toolbar

You can download Google Toolbar (free) and install it in your Internet Explorer within minutes. Amongst other useful features, it displays the PageRank of each web page you visit.

The Google toolbar appears just below your Internet Explorer browser and can be used for making a search on the web from any page. Google toolbar displays the PageRank of each web page on a scale of 1-10. If you have the Google toolbar installed in your browser, you would be used to seeing each page's PageRank as you browse the web. Google does not display the PageRank of web pages that it has not indexed. Please note that the Toolbar displays the PageRank of individual pages and not the site as a whole.

PageRank in Google's own Words

Google explains PageRank as follows:

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an Indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search. Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.
Relationship between Search Engine Ranking and PageRank

While the exact algorithm of each search engine is a closely guarded secret, search engine analysts believe that the search engine results (ranking) is some form of a multiplier factor of ‘Page Relevance’ and ‘PageRank’. Simply put, the formula would look something like:

PR (A) = (1-d) + d (PR (t1)/C (t1) + ... + PR (tn)/C (tn))

That's the equation that calculates a page's PageRank. It's the original one that was published when PageRank was being developed, and it is probable that Google uses a variation of it but they aren't telling us what it is. It doesn't matter though, as this equation is good enough.

In the equation 't1 - tn' are pages linking to page A, ‘C’ is the number of outbound links that a page has and ‘d’ is a damping factor, usually set to 0.85.

We can think of it in a simpler way:-

A page's PageRank = 0.15 + 0.85 * (a “share” of the PageRank of every page that links to it) “share” = the linking page's PageRank divided by the number of outbound links on the page.

A page “votes” an amount of PageRank onto each page that it links to. The amount of PageRank that it has to vote with is a little less than its own PageRank value (its own value * 0.85). This value is shared equally between all the pages that it links to.

From this, we could conclude that a link from a page with PR4 and 5 outbound links are worth more than a link from a page with PR8 and 100 outbound links. The PageRank of a page that links to yours is important but the number of links on that page is also important. The more links there are on a page, the less PageRank value your page will receive from it.

If the PageRank value differences between PR1, PR2 ...PR10 were equal then that conclusion would hold up, but many people believe that the values between PR1 and PR10 (the maximum) are set on a logarithmic scale, and there is very good reason for believing it. Nobody outside Google knows for sure one way or the other, but the chances are high that the scale is logarithmic, or similar.

Whichever scale Google uses, we can be sure of one thing. A link from another site increases our site's PageRank. Just remember to avoid links from link farms.

Source By : Google.com

GOOGLE GUIDE - What is Google? - Google Technology - Googlebot, Google's web Crawler

Categories

Blog Archive

Live Traffic