欢迎大家赞助一杯啤酒🍺 我们准备了下酒菜:Formal mathematics/Isabelle/ML, Formal verification/Coq/ACL2/Agda, C++/Lisp/Haskell
Larbin
来自开放百科 - 灰狐
Larbin is an HTTP Web crawler with an easy interface that runs under Linux. It can fetch more than 5 million pages a day on a standard PC (with a good network).
安装指南
wget http://download.huihoo.com/larbin/larbin-2.6.3.tar.gz or wget http://nchc.dl.sourceforge.net/sourceforge/larbin/larbin-2.6.3.tar.gz tar zxvf larbin-2.6.3.tar.gz cd larbin-2.6.3 ./configure make // 修改./adns/internal.h文件,把568-571行直接注释掉,然后就可以编译通过了 ./larbin Ctrl+C http://localhost:8081
定制
- 注释掉options.h中的这行:#define DEFAULT_OUTPUT // do nothing...,然后使其余的行生效(每行含义,看下面的说明),重新编译
[quote]The first thing you can define is the module you want to use for ouput.This defines what you want to do with the pages larbin gets. Here arethe different options :
- DEFAULT_OUTPUT : This module mainly does nothing, except statistics.
- SIMPLE_SAVE : This module saves pages on disk. It stores 2000 files per directory (with an index).
- MIRROR_SAVE : This module saves pages on disk with the hierarchy ofthe site they come from. It uses one directory per site.
- STATS_OUTPUT : This modules makes some stats on the pages. In order to see the results, see[url=http://localhost:8081/output.html.]http://localhost:8081/output.html.[/url] [/quote]
- 修改larbin.conf文件
startUrl[url=http://slashdot.org/]http://slashdot.org/[/url] 默认就是爬这个网站,当然可以设置成自己想要爬的网站
Links
分享您的观点