Thursday, February 28, 2008

Web Crawler Research

I've recently been doing a lot of research on making a web crawler and it's fairly interesting. Basically, a web crawler is a program that makes a map of the web. You start out with a url, say, and then this application scans the page for all of the links. It creates a list of those links for later processing. Then, the application reads the current page. It can look at everything (pictures, layout, meta tags, etc.) or it could just look at a couple things, like just the text. It then takes that information and stores it - probably on a server.

Once it's done everything you want it to do on that page it goes to the first link on the list recently created, say, and starts all over. First, creating a list of new links (pages) and then taking an inventory (also called indexing) of the items on the page.

It's a fairly simple process, but can get kind of hairy fast. I mean, just think of all the links on Yahoo! or Digg. The list of pages to visit can get long very fast. As a matter of fact, most web crawlers are estimated to only cover up to 16% of the web at one time. The problem is that the application simply can't run fast enough to view everything out there because pages are being added and changed way too fast.

Isn't that crazy! I've found that to build a web crawler is pretty straight forward, but to make one that works efficiently is more of a challenge. The big challenge I'm facing right now is deciding what language to use. You see, web crawlers are so flexible they can be written in PHP, Perl, Python, Java or even C++. What's the best choice? Good question.

I guess the trick is to pick a language I'm kind of comfortable with and start there. I should probably just try and make one with the understanding that it won't be optimized, but at least I've got something to improve on.

We'll see if it works out. My personal goal is to have something working by the end of next week. Wish me luck.

No comments:

Post a Comment