Web-Harvest

来自开放百科 - 灰狐
跳转到: 导航, 搜索

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

Features

  • Graphical user interface is introduced giving the environment for easier configuration development and testing.
  • html-to-xml processor, which is based on HtmlCleaner, now exposes attributes for controlling cleaner's behaviour.
  • Besides BeanShell scripting engine, two others are added: Groovy and JavaScript. Now it is possible to choose the favourite scripting engine or even mix them in a single Web-Harvest configuration. This option is supported by adding new attributes to config, script and template processors.
  • Access to HTTP client is supported by introducing implicit context varibale http. Now it is possible to check important HTTP response values, like http.mimeType, http.headers, http.statusCode, or even to obtain instance of org.apache.commons.httpclient.HttpClient class with http.client and manipulate it in the runtime.
  • New attribute cookie-policy added to the http processor, specifying the way HttpClient manage cookies.
  • Command-line use is improved by adding several new parameters.
  • For more comfortable use of Web-Harvest context variables in the script engines' runtime scopes, several handy methods are added to the class org.webharvest.runtime.variables.Variable (interface IVariable in previous versions of Web-Harvest).
  • Several useful methods added in implicit Web-Harvest context variable sys, like sys.xpath(expression, xml), sys.isVariableDefined(varname) and sys.defineVariable(varName, varValue, [overwrite]).
  • Attribute overwrite added in the ver-def processor, giving possibility to specify whether existing variables with specified name will be overwriten or not.
  • New proccessor <exit condition=... message=.../> is introduced in order to support conditional execution break.
  • Encoding selection in http processor is changed - if no explicitely specified with charset attribute, one given from HTTP response is used instead to read downloaded text content.
  • NTLM proxy authentication scheme is supported.
  • Performance improvements and bug fixes.

Links

分享您的观点
个人工具
名字空间

变换
操作
导航
工具箱