Wednesday, May 11, 2011

Sphinx - An open source framework for search

After reading my introduction to full text search or you have read article somewhere else and decided to go with full text search in your next project, but you still confuse what full text search engine to use. One implementation of full text search engine is Sphinx. And I’ll give you a short course on how you installing Sphinx for your full text search engine.
Sphinx is a full-text search engine, distributed under GPL version 2. It is not only fast in searching but it is also fast in indexing your data. Currently, Sphinx API has binding in PHP, Python, Perl, Ruby and Java.
Sphinx features

high indexing speed (upto 10 MB/sec on modern CPUs);
high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
high scalability (upto 100 GB of text, upto 100 M documents on a single CPU);
provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;
provides distributed searching capabilities;
provides document exceprts generation;
provides searching from within MySQL through pluggable storage engine;
supports boolean, phrase, and word proximity queries;
supports multiple full-text fields per document (upto 32 by default);
supports multiple additional attributes per document (ie. groups, timestamps, etc);
supports stopwords;
supports both single-byte encodings and UTF-8;
supports English stemming, Russian stemming, and Soundex for morphology;
supports MySQL natively (MyISAM and InnoDB tables are both supported);
supports PostgreSQL natively.
There you go, so fire up your terminal or console, and let’s get thing done.
Installing sphinxsearch

Download sphinx at sphinxsearch.com, for this tutorial, I use Sphinx 0.9.8.1
$wget http://sphinxsearch.com/downloads/sphinx-0.9.8.1.tar.gz
Open your terminal, extract and install sphinx
$tar -xvf sphinx-0.9.8.1.tar.gz
sphinx need mysql-dev install, if you use ubuntu linux install this
$sudo apt get install libmysqlclient15-dev
Install sphinx to your system
$cd sphinx-0.9.8.1/
$./configure
$make
$sudo make install
Note if you want to use sphinx with PostgreSQL, configure with this argument –with-pgsql
$./configure --with-pgsql
Test your installation
$search
This should come up in your terminal
Sphinx 0.9.8.1-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff

Usage: search [OPTIONS] [word2 [word3 [...]]]

Options are:
-c, --config use given config file instead of defaults
-i, --index search given index only (default: all indexes)
-a, --any match any query word (default: match all words)
-b, --boolean match in boolean mode
-p, --phrase match exact phrase
-e, --extended match in extended mode
-f, --filter only match if attribute attr value is v
-s, --sortby sort matches by 'CLAUSE' in sort_extended mode
-S, --sortexpr sort matches by 'EXPR' DESC in sort_expr mode
-o, --offset print matches starting from this offset (default: 0)
-l, --limit print this many matches (default: 20)
-q, --noinfo dont print document info from SQL database
-g, --group group by attribute named attr
-gs,--groupsort sort groups by
--sort=date sort by date, descending
--rsort=date sort by date, ascending
--sort=ts sort by time segments
--stdin read query from stdin

This program (CLI search) is for testing and debugging purposes only;
it is NOT intended for production use.
Well done. You have Sphinx at your service. But before you can play with this full text search engine you have just installed, you have to understand how Sphinx works.
Sphinx installed 4 program in your environment, but most of the time we will only use indexer, search and searchd. To begin with, we have to create an index for our source. Let’s create a file name sphinx.conf, and here is a sample of sphinx.conf look like.
source book
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = root
sql_db = library
sql_port = 3306# optional, default is 3306
sql_query = SELECT id, title, summary, author from library
sql_query_info = SELECT * FROM library_book WHERE id=$id
}

index book
{
source = book
path = data/book
docinfo = extern
charset_type = sbcs
}

indexer
{
mem_limit = 32M
}

searchd
{
port = 3312
log = log/searchd.log
query_log = log/query.log
read_timeout = 5
max_children = 30
pid_file = log/searchd.pid
max_matches = 1000
}
For more information about sphinx configuration, please go to sphinx documentation.
Create log folder for our searchd log file and another folder named data for our index data. Run indexer to index our database.
$mkdir log
$mkdir data
$indexer --all
Sphinx 0.9.8.1-release(r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file ./sphinx.conf'...
indexing index 'book'...
collected 12 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 12 docs, 10319 bytes
total 0.018 sec, 571436.48 bytes/sec, 664.53 docs/sec
You can use search program to test search index you have just created. Assuming you have book with title contain PHP in your database, then run search PHP will give you some results.
$search PHP
Done. I really recommend you to play with sphinx configuration and tweak different setting based on your need.


With Regards,
Er.Animesh Nanda
Software Developer,
Manusis Technology Pvt. Ltd.
Bengaluru,Karnataka,INDIA.