Crawler for indexing websites

Index a single Webpage

You can index an single Webpage (or an image or an pdf file on a webserver, including automatic text recognition by OCR, if enabled in the connector config /etc/opensemanticsearch/connector-web):

Start indexing by web interface

To start indexing a single web page via the web interface (i.e. http://localhost/search/admin/crawl):

Just write the url into the uri field and submit the form.

Command line tool

Or use or integrate (i.e. in a crontab or in your own scripts) this command line tool: opensemanticsearch-index-web *http://www.opensemanticweb.org/*

REST-API

Using the REST-API: http://127.0.0.1/search-apps/api/index-web?uri=*http://www.opensemanticsearch.org/*

Crawl whole websites or parts of a website

You can index a whole website with the web crawler module of Apache ManifoldCF.

With its Webinterface you can setup a homepage, a sitemap or a RSS-Feed as the start point and set how deep the crawl should be.

Its possible to setup rules which parts to crawl and which to exclude.

Another software for crawling a website is Scrapy.