Sriram's Blog: All about Google

From now on, I have decided to make my blog to contain some interesting techie articles, which, I will write whenever I have leisure time. If this blog contains only about me, then it will be boring for the readers. So I have decided to update my blog with info about me, what all, I do these days and some good stuff (hope so) that I write…

I had and have a great passion for Google and its products. When they first introduced Gmail, I atonce created an account ( during May ), and announced to our powerelex group that iam proud to have an account there. I thought that many will ask more about that and they will be eager to get an account there. But, unfortunately no one cared about that. I was badly disappointed then. But only some few weeks back, many guys in our group were hunting like anything for getting a Gmail ID. Atleast for to boast to others that “ I too have a Gmail ID”, people are creating a Gmail ID. Hahahah. Already there are a lot of reviews done regarding Gmail and is readily available in the net.

If you have a good internet connection speed, don’t miss to view this one hour lecture about Google by Mr.Urs Hoelzle, Google Inc @ University of Washington. You really miss a lot in life, if you fail to listen to this lecture.

Click Here for Lecture

Google is a play on the word googol, which was coined by Milton Sirotta, nephew of American mathematician Edward Kasner, and was popularized in the book, "Mathematics and the Imagination". It actually refers to the number represented by the numeral 1 followed by 100 zeros. Google's use of the term reflects the company's mission to organize the immense, seemingly infinite amount of information available on the web.

The computing engine that powers Google is one of the largest cluster of Linux Servers in the History of world. Nearly they are using some 15000 machines. The power of all these machines are put together to give the end users that much lightning speed. If u happen to have a chat with some computer geek, he will talk more about the number of machines Google is using rather than the number of web pages it has indexed. Nearly every web page in the world is indexed and cached by Google. Also they have multiple copies of them.

It’s a great challenge for them, if they decide to change something in the cluster, or if they plan to do software or hardware upgradation without disruption of service. Yet they achieve 100 % uptime. This is a serious stuff to consider. The software and hardware architecture that Google follows is as simple and as elegant as its site.

Google was founded some 4 years ago and now it handles nearly 150 Million Queries a day with a peak load of 1000 per second! It was actually founded by Larry Page and Sergey Brin both Stanford University Graduates. First, they worked together and wrote a search engine called BackRub, which had the unique ability to analyze the back links pointing to a given site. This is still used in the present Google search engine and is buzzed around as one of the supercapabilities of Google.

Page Ranking :

The much talked thing about Google is its Page Ranking technique. There is a reason why they have incorporated this, in their engine. Suppose, if you are searching for Emacs Manual, there are a lot of such docs available in the Internet. If the engine is going to fetch the document, which contains a large number of occurrences of the words Emacs and Manual, then the search result will be useless. Even some scrap, might have kept a document in his site with name Emacs Manual and containing the word Emacs for some 1000 times in it. It should not fetch this page and give as a result. So the engine must have some capability to analyse the pages and info in them. Accordingly Ranking is provided to each page, and accordingly the search result appears.

The thing which Google considers mostly is the number of links pointing to a particular page on a particular info. The quality of the referring page is also taken into account. Suppose if my page has been pointed for a particular info by some 5 outside sites, and if another guy, say X’s page is pointed by Yahoo for the same info, then the weightage is given to X’s page coz Yahoo is a world renowned site. So, the number of pointers doesn’t matter. The quality of the referring site matters. Who is pointing is more important than how many are pointing.

Page rank is just a query independent measure of goodness of each page available in the Internet.

( Iam bad at drawing, Sorry !)

So, if there is a page P which is pointed by A and B, then the weightage of P will be 1/3 of A + 1/4 of B. This is how Google calculates the rank of each page.

SOFTWARE ARCHITECTURE :

As said previously, Google’s software architecture is very simple and more efficient. Just an overview of its architecture is shown below. Each server shown in the diagram is not a single machine… Some thousands of machines are present in each block to handle all the traffic…

In the Index Servers, the words are mapped to documents according to the rank assigned. In the document servers, almost all web pages are cached. Multiple copies are kept, so that the high traffic can be handled. If there is only a single copy, many queries cant be looked up at a particular time. So from above figure, word1 and word2 are searched. Word1 is there in page1, page3 and page4. Word2 is there in page2 and page4. As the user has searched for both the words, ANDing the results of word1 and word2 yields page4 as the first result. Also according to word searched, Ads are introduced from the Ad Servers. For to lookup from Index to documents, Google doesn’t use any lookup tables, simply they involve some mathematical modulo operations.

HARDWARE ARCHITECTURE :

Some hundreds and hundreds of such racks are present. TCP/UDP is used mostly for intercommunication. So the real challenge of Google is the upgradation, maintenance of systems. As the websize grows, the number of machines required to cache them grows, the replicas of data grows, Indexing becomes large and the number of document servers needed to cache grows. Inspite of these things, Google has a very good throughput and efficiency which is achieved by using very cheap computers!

GOOGLE WEB SEARCH FEATURES :

Calculator: Google web search has an inbuilt calculator function. Try giving some expressions. Don’t get astonished. Example : 4^4, 10*10, sqrt (2).

Definition : To get a definition of a word search as define word that u want.

Froogle: Use froogle for product based search.

I’m Feeling Lucky Button : The "I'm Feeling Lucky™" button takes you directly to the first web page Google returned for your query. Have u ever tried this while searching ???

Site Search : To restrict your search in a particular site use the keyword site. For example, to find admission information in Stanford’s site, use Admission site:www.stannford.edu

Who links to you : To find the pages that point to a particular site use link:www.google.com

Google Compute : If u want to help Google someway, you can do that. You can donate your computers idle time for scientific research ! Visit http://toolbar.google.com/dc/offerdc.html

Like this, a lot and lot of Google services are there like Blog, Picassa photo Organiser, Desktop Search, Translate tool.

And don’t forget about the Google Web API (SOAP and WSDL Standards) which is available for free download. With this API, you can query more than 4 million web pages directly from your own computer programs. If you are curious, try it out. For to program, first, you have to register and get a license key.

Happy Hacking with Google !

Copyright (c)  2004  Sriram.K
 Permission is granted to copy, distribute and/or modify this document
 under the terms of the GNU Free Documentation License, Version 1.2
 or any later version published by the Free Software Foundation;

Labels: Techie Talk

Sriram's Blog

Friday, September 17, 2004

All about Google

1 Comments:

LABELS

PAGE VIEWS

AUDIENCE MAP

SEARCH POSTS

Recent Posts