I will do search engine, Values of stupidity.

Posted by aggarat@sanook.com 23 ตุลาคม, 2008 (0) Comment

Build a Search Engine with 10 Open Source Software Projects

Developing a large software system is always about standing on the shoulders of giants and dodging the proverbial reinvention of the wheel. By using the following open source projects in our search engine effort, we were able to do both. This is the first in a series of technical articles about the Cruxlux architecture in which we will explore how we use exceptional free technologies. Cruxlux is not a whole-web search engine like Google or MSN, but many of the development hurdles are common across any type of search. Without further adieu, and in no particular order of importance, here is the list.

Operating System: Linux

In choosing a foundation OS, there really wasn’t a choice for us. Linux provides a solid architecture and a lot of the open source mindshare is constantly improving it. It is comfortable to develop on with powerful tools such as wmii, vim, gcc, and valgrind as well as simple to use package managers such as synaptic. In an added bonus, there are a wide variety of heavily tested images for Amazon EC2. If you are wondering what distros, we use a mix of Ubuntu and Gentoo.

Database: MySQL

Every web service needs some kind of datastore, so we went with our favorite open source database: MySQL. It has widely tested client libraries in many languages, and a lot of exciting features and development going into it. Guha had a stellar experience with it in Folding@Home, which generated terabytes of data. We are especially interested in tracking Drizzle, given that it will be specifically tailored toward high levels of concurrency and cloud computing. MySQL is used to store metadata that our backend leverages, our user data, as well as posts in our debate infrastructure.

Optional: Progresql


HTTP Client: Curl

Every information extraction system needs a powerful way to grab data from the net. Curl stands the test of time as the best HTTP networking client library out there. It provides us with very high performance, highly concurrent crawls that can easily fill our bandwidth pipe with fresh content. Its threading support is very clean, and the event based support is something that may yield ludicrous speed. We are looking forward to exploring that more.

Optional: Lynx

General Purpose
Library: Boost

Boost is frankly amazing. The library is well thought out and the API usage is consistent throughout, so you don’t have to make a mental context switch every time you use a different Boost tool. Inside of Boost alone, we use Build, Date Time, Filesystem, Math, Pool, Regex, Serialization, Smart Ptr, String, Test, and, last but not least, bjam to build the whole shebang.

Optional: Some python framework for text processing.


Networking Services: Libevent

Event based programming has intrigued everyone with its scalability as well as how it allows developers to achieve concurrency while thinking in a single threaded mindset. See the C10K problem. We utilize libevent in our heavily service oriented architecture. Once you go non blocking you never go back.

Hash Table: Google Sparse/Dense Hash

Don’t leave home without your trusty hash table. Thankfully, Google released some of its well guarded secrets out to the world because this library is really imperative to anyone wanting to deal with a lot of data in memory. We use this not only to cache certain data used by our web server, but also during calculation phases by our backend. Couple it with Paul Hsieh’s speedy string hash, and you can have an elegant way to quickly address a great deal of websites on a single machine. If any of you has any other hash functions that you know of, please let us know in the comments. Murmur Hash looks fun to play with, and we will explore it in a later article when we look into hash functions in the domain of URLs.

Indexing Engine: Sphinx

Once you get data into a system, you gotta have a way to get it out again! We tried a lot of different indexing systems: CLucene, SOLR, Mysql Fulltext, but Sphinx won out because of the speed of indexing, powerful delta indexing, and a lightweight, scalable server. Each of them have their own strengths, but Sphinx fit our bill the best. Whatever you are looking for always seems to be at your fingertips in the documentation, and the community is top notch.

Optional: Lucene ( Java but I hate JAVA. I think it’s the slowest compile language.)

Web Server: Nginx

To continue theme of using fast, lightweight, Russian open source projects, we went with nginx for our web server. It’s solid proxying abilities let us use different app servers on the mid tier for various tasks, whether it be mongrel, merb, or a custom C search server.

Optional: Lighttpd+Varnish, APache+Varnish

Web Framework: Ruby/Rails
Ruby has a vast amount of libraries and has been a very powerful tool in prototyping a lot our algorithms and search features we research before porting them to C/C++. Ruby golf has become one of our hobbies. We use Rails throughout our webapp to provide a lot of the structure and additional features around search.

Optional PHP, PYTHON

Javascript Libraries: Jquery, Prototype, Scriptaculous

What is core to our design is to provide quick access to mass amounts of data, then use the power of modern clients to process and filter that data inside the browser. It at least gives us a chance of scaling. Javascript, extended via jquery, Prototype, and Scriptaculous, gives us the tools necessary to create a unique interface, that will get a nice face lift over the next few weeks. Jquery plugins in our posse include sparkline, Cycle Lite, and marquee, for example.

So that’s the short list, with plenty of other open source projects sprinkled in there that we will address in future articles. Is there a project we should be using in this mix? Feel free to let us know in the comments. We’ll be doing a series of posts in the coming weeks that focus in how we use these different projects in our own service, and we hope you find them useful in your own pursuits.

Categories : Google-Killer Tags :