Larbin

Larbin is an HTTP Web crawler with an easy interface that runs under Linux. It can fetch more than 5 million pages a day on a standard PC (with a good network).

安装指南

wget http://download.huihoo.com/larbin/larbin-2.6.3.tar.gz or
wget http://nchc.dl.sourceforge.net/sourceforge/larbin/larbin-2.6.3.tar.gz 
tar zxvf larbin-2.6.3.tar.gz
cd larbin-2.6.3
./configure
make // 修改./adns/internal.h文件，把568－571行直接注释掉，然后就可以编译通过了
./larbin
Ctrl+C
http://localhost:8081

定制

注释掉options.h中的这行：#define DEFAULT_OUTPUT // do nothing...，然后使其余的行生效（每行含义，看下面的说明），重新编译

[quote]The first thing you can define is the module you want to use for ouput.This defines what you want to do with the pages larbin gets. Here arethe different options :

DEFAULT_OUTPUT : This module mainly does nothing, except statistics.
SIMPLE_SAVE : This module saves pages on disk. It stores 2000 files per directory (with an index).
MIRROR_SAVE : This module saves pages on disk with the hierarchy ofthe site they come from. It uses one directory per site.
STATS_OUTPUT : This modules makes some stats on the pages. In order to see the results, see[url=http://localhost:8081/output.html.]http://localhost:8081/output.html.[/url] [/quote]
修改larbin.conf文件

startUrl[url=http://slashdot.org/]http://slashdot.org/[/url] 默认就是爬这个网站，当然可以设置成自己想要爬的网站

Links

Larbin

安装指南

Links

个人工具

名字空间

变换

查看

操作

搜索

导航

工具箱