Crawl and index files, file folders or file servers
How to index files like Word documents, PDF files and whole document folders to Apache Solr or Elastic Search?
This connector and command line tools crawl and index directories and files from your filesystem and index it to Apache Solr or Elastic Search for full text search and text mining.
If you use Linux that means you can crawl whatever is mountable to Linux into an Apache Solr or Elastic Search index or into a triplestore.
Index different file system types to Solr or Elastic Search
This can be a hard disk or partitions formated with fat, ext3, ext4 or a file server connected via ntfs, file shares like smb or even sshfs or sftp on servers, private file sharing services like Seafile or OwnCloud on own servers or Dropbox, Amazon or other storage services in the cloud.
Data enrichment by different data analytic tools
This connector integrates enhanced data enrichment and data analysis plugins like automatic text recognition (OCR) for images and photos (i.e. as files like PNG, JPG, GIF ...) or inside PDFs (i.e.scanned Documents) using Tesseract OCR.
Usage
Index a file or directory:
Web admin interface
Using the web admin interface * Open the page Files * Enter filename to the form * Press button "crawl"
Command line
Using the command line interface (CLI):
opensemanticsearch-index-file *filename*
API
Using the REST-API:
http://127.0.0.1/search-apps/api/index-file?uri=*/home/opensemanticsearch/readme.txt*
Config
Config file for indexing files: */etc/opensemanticsearch/connector-files*