Larbin

来自开放百科 - 灰狐
跳转到: 导航, 搜索

Larbin is an HTTP Web crawler with an easy interface that runs under Linux. It can fetch more than 5 million pages a day on a standard PC (with a good network).

安装指南

wget http://download.huihoo.com/larbin/larbin-2.6.3.tar.gz or
wget http://nchc.dl.sourceforge.net/sourceforge/larbin/larbin-2.6.3.tar.gz 
tar zxvf larbin-2.6.3.tar.gz
cd larbin-2.6.3
./configure
make // 修改./adns/internal.h文件,把568-571行直接注释掉,然后就可以编译通过了
./larbin
Ctrl+C
http://localhost:8081

定制

  • 注释掉options.h中的这行:#define DEFAULT_OUTPUT // do nothing...,然后使其余的行生效(每行含义,看下面的说明),重新编译

[quote]The first thing you can define is the module you want to use for ouput.This defines what you want to do with the pages larbin gets. Here arethe different options :

  • DEFAULT_OUTPUT : This module mainly does nothing, except statistics.
  • SIMPLE_SAVE : This module saves pages on disk. It stores 2000 files per directory (with an index).
  • MIRROR_SAVE : This module saves pages on disk with the hierarchy ofthe site they come from. It uses one directory per site.
  • STATS_OUTPUT : This modules makes some stats on the pages. In order to see the results, see[url=http://localhost:8081/output.html.]http://localhost:8081/output.html.[/url] [/quote]
  • 修改larbin.conf文件

startUrl[url=http://slashdot.org/]http://slashdot.org/[/url] 默认就是爬这个网站,当然可以设置成自己想要爬的网站

Links

分享您的观点
个人工具
名字空间

变换
操作
导航
工具箱