When I first started working as a search engineer in 2008, I had the pleasure of working with Microsoft FAST ESP. It truly was a full stack enterprise search platform that included among other things, a powerful document processing pipeline engine. The Document Processing engine sits between your data sources and the indexer and is largely responsible for:
- Document routing,
- Document merging,
- Metadata enrichment,
- Data transformation and
- Data cleansing
Unfortunately, Microsoft’s acquisition of FAST quickly led to its demise and the industry started to look for alternatives in the open source space—enter Solr. Most organizations who made the shift from FAST to Solr quickly realized that Solr, while flexible and powerful, lacked some of the enterprise platform features such as a rich set of data source connectors and document processing pipeline.
To be fair, Solr does provide a Data Import Handler for ingesting RDBM data and has a close relationship with both Apache Nutch (web crawler) and Apache ManifoldCF (connector framework) as well as basic pre-indexing processing capabilities via update request processors. However, it is my humble opinion that Solr should focus on what it does best—be a first-class indexer and searcher and leave content extraction and pre-indexing processing as an external concern.
So, what are our options for content extraction + document processing + Solr? Well, there are many commercial and open source options. Over the years I experimented with many open source document processing options which are now mostly defunct (e.g., OpenPipeline, OpenPipe, Pypes, etc.) or have very poor community support and low adoption. Instead, I shifted my focus to what a document processing platform really needs to do: 1) extract, 2) transform and 3) load. Bingo, an ETL!
Ok, now we are getting somewhere. An ETL is great, but there are many common document processing patterns that I would prefer not to have to reinvent for each client. I would prefer not to be in the business of building plumbing code for document routing, filtering, aggregation, enrichment and document splitting and instead focus on extracting meaning and user intent from my data sources. Eureka! This sounds a lot like Enterprise Integration Patterns or EIPs.
Now that I know exactly what I am looking for, an enterprise integration platform or framework, what are my options? To keep things simple, my top three include:
Instead of diving into the selection process I went through for each of the above (it’s Sunday and I’m ready for brunch), I will simply tell you that my choice is Apache Camel for the following reasons:
- Expressive, easy-to-use, fluent Java DSL.
- Light-weight integration framework. At a minimum it’s just one JAR.
- Implements just about every EIP, so developers do not need to worry about the plumbing and can focus on what matters.
- No constraints on runtime integration (i.e., it works well as a simple standalone Java application, embedded in a web application, deployed in an OSGi runtime, etc.).
- Active development and user community.
- Extensive set of integration components (e.g., file system, HTTP/HTTPS, JMS, RDBMs, etc.) And, yes, support for Solr and Elasticsearch! The full component list is available here.
- Well-documented project.
- Frictionless development and simple API.
In my next article, Solr Document Processing with Apache Camel - Part 2, we will build a small Maven-based project for ingesting documents into Solr using a standalone Apache Camel application.
Until next time!