Sphider

来自开放百科 - 灰狐
跳转到: 导航, 搜索
Wikipedia-35x35.png 您可以在Wikipedia上了解到此条目的英文信息 Sphider Thanks, Wikipedia.

Sphider - a lightweight search engine in PHP

Sphider is a lightweight web spider and search engine written in PHP, using MySQL as its back end database. It is suitable for adding search functionality to small or medium sites (up to around 100,000 pages). It also works great as a tool for site analysis - finding broken links, gathering statistics about the site etc.

Sphider is licenced under GNU General Public Licence.

目录

Features

Spidering and indexing

  • Full text indexing.
  • Can index both static and dynamic pages.
  • Finds links in <a href=...>, <frame ...>, <area ...> and <meta ...> tags, and can also follow links given in javascript as strings via window.location and window.open.
  • Respects robots.txt protocol.
  • Follows server side redirections.
  • Allows spidering to be limited by depth (ie maximum number of clicks from the starting page), by (sub)domain or by directory.
  • Supports indexing of pdf and doc files (using external binaries for file conversion).
  • Allows resuming paused spidering.
  • Possbility to exclude common words from being indexed.
  • Sophisticated administrator interface

Searching

  • Supports AND, OR and phrase searches
  • Supports excluding words (by putting a '-' in front of a word, any page including the word will be omitted from the results).
  • Option to add and group sites into categories
  • Possibility to limit searching to a given category and its subcategories.
  • "Did you mean" search suggestion on mistyped queries.
  • Context-sensitive auto-completion on search terms (la Google Suggest)
  • Word stemming for english (searching for "run" finds "running", "runs" etc)

Size and speed

Sphider uses regular expressions to extract links from webpages, so indexing is not particularly fast. Searching is quite fast, if the database size is reasonable. Code base is very small, probably making it the smallest search engine with such functionality out there.

Installation

1. Unpack the files, and copy them to the server, for example to /home/youruser/public_html/sphider (later referred to as [path_of_sphider])

2. In the server, create a database in MySQL to hold Sphider data.

a) at command prompt type (to log into MySQL): mysql -u <your username> -p Enter your password when prompted.

b) in MySQL, type: CREATE DATABASE sphider_db;

Of course you can use some other name for database instead of sphider_db.

c) Use exit to exit MySQL. For more information on how to create a database and give/get the necessary permissions, check MySQL.com

3. In settings directory, edit database.php file and change $database, $mysql_user, $mysql_password and $mysql_host to correct values (if you dont know what $mysql_host should be, it should probably stay as it is - 'localhost').

4. Open install.php script (admin directory) in your browser, which will create the tables necessary for Sphider to operate.

Alternatively, the tables can be created by hand using tables.sql script given in the sql directory of the Sphider distribution. In the prompt, type mysql -u <your username> -p sphider_db < [path_of_sphider]/sql/tables.sql

5. In admin directory, edit auth.php to change the administrator user name and password (default values are 'admin' and 'admin').

6. Open admin/admin.php in browser and start indexing.

7. search.php is the default search page.

Command Line

php spider.php <options>

where <options> are

-all 		Reindex everything in the database
-u <url> 		Set the url to index
-f 		Set indexing depth to full (unlimited depth)
-d <num> 		Set indexing depth to <num>
-l 		Allow spider to leave the initial domain
-r 		Set spider to reindex a site
-m <string>		Set the string(s) that an url must include (use \n as a delimiter between  multiple strings)
-n <string>		Set the string(s) that an url must not include (use \n as a delimiter between multiple strings)

For example, for spidering and indexing http://www.domain.com/test.html to depth 2, use

php spider.php -u http://www.domain.com/test.html -d 2

If you want to reindex the same url, use

php spider.php -u http://www.domain.com/test.html -r

http://www.cs.ioc.ee/~ando/sphider/

分享您的观点
个人工具
名字空间

变换
操作
导航
工具箱