Solr Document Processing with Apache Camel - Part III

For those of you that are still following along, let's recap what we've accomplished since the last post, Solr Document Processing with Apache Camel - Part II. We started by deploying SolrCloud with the sample gettingstarted collection and then developed a very simple standalone Camel application to index products from a handful of stub JSON files.

In this post, we will continue to work against the SolrCloud cluster we set up previously. If you haven't done this, refer to the Apache Solr Setup in README.md. We will also start out with a new Maven project available in GitHub called camel-dp-part3. This project will be similar to the last version; but with the following changes:

We will be using a real data source. Specifically, Best Buy's movie product line.
We will introduce property placeholders. This will allow us to specify environment-specific configurations within a Java properties file.

Note: For readers with an attention to detail, please note that we will not be covering JAR assembly or making enhancements to the Camel runtime as I mentioned in my conclusion. These will be addressed in a later post.

Best Buy API

Before we get started, you will need to obtain an API key from the Best Buy Developer Site.

I decided to go with the Best Buy data source for the following reasons:

It's free.
The API supports bulk data downloads for all products as well as subsets of products. For example, we will be working with the movie product data in this post.
It provides a wonderful set of structured, eCommerce product data that we can use to build a real-world product search experience.

Best Buy to Solr Indexer

Now that we have Solr running and our Best Buy API key, let's clone the GitHub project and index some products.

$ git clone https://github.com/GastonGonzalez/camel-to-solr.git
$ camel-to-solr/camel-dp-part3

Edit src/main/resources/movies.properties and set bestbuy.api.key to the value of your API key. Then, run the following Maven command to build and start the indexer.

$ mvn clean compile exec:java -Dexec.mainClass=com.gastongonzalez.blog.camel.App

The Camel Application

While the indexer is running, let's take a look at the main class: src/main/java/com/gastongonzalez/blog/camel/App.java.

package com.gastongonzalez.blog.camel;

import org.apache.camel.CamelContext;
import org.apache.camel.Exchange;
import org.apache.camel.builder.RouteBuilder;
import org.apache.camel.component.properties.PropertiesComponent;
import org.apache.camel.component.solr.SolrConstants;
import org.apache.camel.dataformat.zipfile.ZipSplitter;
import org.apache.camel.impl.DefaultCamelContext;

public class App
{
    public static void main( String[] args ) throws Exception
    {
        CamelContext context = new DefaultCamelContext();

        PropertiesComponent propertiesComponent =                   context.getComponent("properties", PropertiesComponent.class);
        propertiesComponent.setLocation("classpath:movies.properties");
                  propertiesComponent.setSystemPropertiesMode(PropertiesComponent.SYSTEM_PROPERTIES_MODE_OVERRIDE);

    context.addRoutes(new RouteBuilder() {
        @Override
        public void configure() throws Exception
        {
            from("timer://foo?repeatCount=1&delay=1000")
                        .to("http4://api.bestbuy.com/v1/subsets/productsMovie.json.zip?apiKey={{bestbuy.api.key}}")
                        .setHeader(Exchange.FILE_NAME, constant("productsMovie.json.zip"))
                        .to("file:data/zip?doneFileName=${file:name}.done");

                from("file:data/zip?noop=true&doneFileName=productsMovie.json.zip.done")
                        .split(new ZipSplitter())
                        .streaming().to("file:data/json?doneFileName=${file:name}.done");

                from("file:data/json?noop=true&doneFileName=${file:name}.done")
                        .process(new JsonToProductProcessor())
                        .split().body()
                        .setHeader(SolrConstants.OPERATION, constant(SolrConstants.OPERATION_ADD_BEAN))
                        .to("solrCloud://{{solr.host}}:{{solr.port}}/solr/{{solr.collection}}?zkHost={{solr.zkhost}}&collection={{solr.collection}}");
        }
        });

        context.start();
        Thread.sleep(1000 * 60 * 15); // 15 min
        context.stop();
    }
}

Property Placeholders

In order to use property placeholders, like {{bestbuy.api.key}}, in our DSL-based routes, we need to instruct the Camel runtime (i.e., CamelContext) where our properties file is located. We do this by obtaining a PropertiesComponent and setting the location of the properties file. In our case, movies.properties is stored under src/main/resources and is copied to the root during the Maven build. As such, we can simply set the location as classpath:movies.properties.

Camel Routes

In this application we have three different routes.

Route #1 - Fetch all Best Buy movies as JSON

The first route uses the Timer component as a means to trigger our HTTP request to the Best Buy API. It has been configured to wait for 1 second after the route has been initialized by the container and fire a message exactly once. We then use the HTTP4 component to invoke our bulk product request. This request effectively downloads a ~157 MB ZIP file containing a series of JSON files. Lastly, we use the File component to save the ZIP file to the filesystem (data/zip/productsMovie.json.zip).

Since we are using multiple routes together and each route is asynchronous, we need to ensure that our next route does not consume data/zip/productsMovie.json.zip until the download is complete. Therefore, we configure the File component to create a ".done" file when the download is complete using: doneFileName=${file:name}.done. This creates a file named data/zip/productsMovie.json.zip.done.

At this point, the data/zip directory looks as follows:

data/zip
data/zip/productsMovie.json.zip
data/zip/productsMovie.json.zip.done

Route #2 - Unpack the ZIP file

Our next route polls for changes to the data/zip folder and waits to process any files until it sees a file along with its ".done" file. Once it detects a complete write of our ZIP file, it uses the splitter EIP to split our ZIP file into individual files. Effectively, each file in the ZIP (each JSON file) is split into an individual message. Then, each message is written to the file system using the File component.

This results in the data/json directory looking as follows:

data/json
data/json/products_0001_341632_to_3421383.json
data/json/products_0001_341632_to_3421383.json.done
data/json/products_0002_3421443_to_4798323.json
data/json/products_0002_3421443_to_4798323.json.done
data/json/products_0003_4798324_to_6265333.json
...

Route #3 - Unmarshall JSON and index POJOs

Our last route, starts with our friend the File component. Like before it is configured to poll for files and wait for ".done" files. It reads each JSON file and then passes the message to a custom processor that is responsible for unmarshalling the JSON into an array of Movie objects.

We are unable to use the GSON component here since the Best Buy JSON uses an anonymous array of product objects. However, we can define a custom processor and call the GSON library directly to use the well-known method for mapping objects in anonymous arrays. Here's a snippet from JsonToProductProcessor.java:

public class JsonToProductProcessor implements Processor
{
    public void process(Exchange exchange) throws Exception
    {
        Reader reader = exchange.getIn().getBody(Reader.class);

        Gson gson = new Gson();
        Type collectionType = new TypeToken<Collection<Movie>>(){}.getType();
        Collection<Movie> movies = gson.fromJson(reader, collectionType);

        exchange.getIn().setBody(movies);
    }
}

The GSON library does the heavy lifting by giving us a collection of Movie objects for each JSON file. Then, we set the message body with our collection of Movie objects.

For those interested in the Movie object, refer to Movie.java. Since this is a demonstration, it has been kept intentionally light and only contains two fields: the SKU and the title of the movie. Like the previous post, the Movie object includes Solr @Field annotations to support indexing of POJOs directly.

Now that we are working with a collection of POJOs as our message body, we use our splitter EIP and send each POJO to Solr for indexing.

Conclusion

Altogether, we have come along way since the first post. We are now in a position where we have a fairly light way of connecting to a full product data source and ingesting it into Solr. At this point, we have a structure in place where we can actually start thinking about our search concerns such as data modeling, signal modeling, tokenization, etc.