title: Web scraping: Extract structured data from websites authors: - Markus Mandalka

Web scraping: Extract structured data from websites

Import structured data scraped from websites to the search server

This connector integrates the Open-Source Scraping-Framework Scrapy, a Python framework for ETL (Extract, transform and load) to build a customized crawler, parser, data scraper and converter for extracting structured data from websites.

This is for websites which dont offer their data in structured open standard formats like RDF, XML, JSON or CSV for which there are comfortable and easy to use connectors yet, so we have to extract all data from HTML pages with a more complicated individual scraper.

Where to write the data: Solr dynamic fields

If you dont want to use standard fields like title, author and content you dont have to change the config file schema.xml (which defines the fields of the Solr index).

Our Solr server is preconfigurated with dynamic fields so that you can fill standard fields like title or contnet or additional dynamic fields like yourfield_b for one boolean, yourfield_bs for booleans, yourfield_s for a string and yourfield_ss for some strings or yourfield_t for a text or yourfield_tt for some texts to be filled with data.

Have a look to schema.xml or managed-schema to see all possible data types of dynamic fields, like boolean, string, dates, text and so on.

If you did not use preconfigurated fields like tags (fieldname is tag_ss) and want to use them not only to find data but as interactive filters (facets) for the navigation:

To enable your own additional fields as facets (interactive filters) in the user interface just map the technical Solr fieldnames to user friendly labels in the config of the user interface with the option $cfg[facet].

Web scraping: Extract structured data from websites

Import structured data scraped from websites to the search server

Where to write the data: Solr dynamic fields

Enable new fields or facets in the user interface