Using custom schema with Riak Search 2.0

Riak 2.0

There are a lot of new features coming down the pipe with Riak 2.0 but the most important one (at least to me) is Riak Search 2.0.

What is Riak Search 2.0 exactly? Riak is a very simple key value store with AP properties in the CAP theorem land. It scales very well and thanks to Erlang it is extremely reliable. There are very few features (and this is a great thing for a data store) so this is why introducing Solr as te revamped search is kind of big. I have never used Solr before so watch me fail or maybe succeed indexing some dbpedia documents.

Test data

Getting the test data

Dbpedia is a data query interface (SQL like) to Wikipedia. I am going to query it for all the cities on the planet having population bigger than 50.000. The query looks the following way:

SELECT DISTINCT *
WHERE {
  ?city rdf:type dbpedia-owl:City ;
  rdfs:label ?label;
  dbpedia-owl:abstract ?abstract ;
  dbpedia-owl:populationTotal ?pop ;
  dbpedia-owl:country ?country .
FILTER
  (lang(?abstract) = 'en' && lang(?label) = 'en' && ?pop > 50000)
}
ORDER BY ?pop

Dbpedia

The data needs to be sliced up to individual JSON files so we can load them into Riak easily and make Solr index the files using the custom schema. After removing some header information the data from dbpedia looks the following:

[
    {
        "abstract": {
            "type": "literal",
            "value": "Shawinigan is a city located...",
            "xml:lang": "en"
        },
        "city": {
            "type": "uri",
            "value": "http://dbpedia.org/resource/Shawinigan"
        },
        "country": {
            "type": "uri",
            "value": "http://dbpedia.org/resource/Canada"
        },
        "label": {
            "type": "literal",
            "value": "Shawinigan",
            "xml:lang": "en"
        },
        "pop": {
            "datatype": "http://www.w3.org/2001/XMLSchema#integer",
            "type": "typed-literal",
            "value": "50060"
        }
    },
    {
        "another": "city"
    }
]

You get an idea, it is a nested data structure an array of smaller hashes. I have processed it with Clojure to get one enty per file, using uuids to have unique file names (keys).

(require '[clojure.data.json :as json])
(defn uuid []
  "Returns a new java.util.UUID as string"
  (str (java.util.UUID/randomUUID)))
(def cities (slurp "cities.pp.json"))
(def cities-json (json/read-str cities))
(map #(spit (str "t/" (uuid) ".json") (json/write-str %)) cities-json)

This produces a bunch of JSON files so I can upload it to Riak. Before we get there, lets start up and configure our Riak service.

Configuring the Riak cluster

I am using the Yokozuna release, version 0.14.0. After downloading the source we need to create a devrel with two nodes. I assume you have Erlang and the build tools installed.

$ make stagedevrel DEVNODES=2

There might be some libs missing but I don’t want to go too much into the details about the operating system specific part of the story. After the dev nodes are created we need to configure Riak and enabled search. I prefer to use LevelDB as the persistent store and I would like to make Riak listen on all of the available interfaces, making our lives easier in a virtualized environment. Let’s do all of these.

$ for d in dev/dev?; do
    sed -e 's/storage_backend = bitcask/storage_backend = leveldb/' \
    -i.back $d/etc/riak.conf;
  done
$ for d in dev/dev?; do
    sed -e 's/search = off/search = on/' -i.back $d/etc/riak.conf;
  done
$ for d in dev/dev?; do
    sed -e 's/127.0.0.1/0.0.0.0/' -i.back $d/etc/riak.conf;
  done

After the configuration part is done, start up the nodes, make them into one cluster and we are almost ready to start to shove data in.

$ cd dev
$ for i in {1..2}; do
    dev$i/bin/riak start;
  done
$ dev2/bin/riak-admin cluster join dev1@0.0.0.0
$ dev2/bin/riak-admin cluster plan
$ dev2/bin/riak-admin cluster commit

Checking the member status to verify if the data is evenly distributed among the nodes:

$ dev2/bin/riak-admin member-status

============================== Membership ===============================
Status     Ring    Pending    Node
-------------------------------------------------------------------------
valid      50.0%      --      'dev1@0.0.0.0'
valid      50.0%      --      'dev2@0.0.0.0'
-------------------------------------------------------------------------
Valid:2 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

Preparing the bucket and loading the data

In this section we are going to create a Solr schema and index so that we can index the documents. Think about the schema as the merit how much Solr understands the data. It can be configured to reference individual elements in complex nested data structures. The data type can be also configured, that makes range queries possible for numeric data. I dont wan’t to go too much into the details of Solr, it is worth to spend few hours on the documentation, I am just scratching the surface in this post.

Creating an index with a custom schema

First thing first, we need to create a schema that is used by the index. I am not really a Solr expert but here is what I came up with:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="default" version="1.5">
  <fields>
    <field name="abstract.value" type="text_general" indexed="true" stored="true" multiValued="true" />
    <field name="city.value"     type="string"       indexed="true" stored="true" />
    <field name="country.value"  type="string"       indexed="true" stored="true" />
    <field name="label.value"    type="string"       indexed="true" stored="true" />
    <field name="pop.value"      type="int"          indexed="true" stored="true" />

    <dynamicField name="*" type="text_general" indexed="true" stored="false" multiValued="true" />

    <field name="_yz_id" type="_yz_str" indexed="true" stored="true" required="true"/>

    <!-- Same as the default from here..... -->

  </fields>
</schema>

Upload the schema to Riak and creating the index:

$ curl -XPUT \
  -d @sch_cities.xml \
  -H 'Content-Type: application/xml' \
  'http://10.0.3.81:10018/search/schema/sch_cities'
$ curl -XPUT \
  -i "http://10.0.3.81:10018/search/index/idx_cities" \
  -H'content-type:application/json' \
  -d'{"schema":"sch_cities"}'

Creating a bucket type

In Riak 2.0 there is a new feature called bucket types that allows groups of bucket to have the same configuration details.

Creating a new bucket type called cities and activating it:

$ riak-admin bucket-type create cities '{"props":{"search_index":"idx_cities"}'
$ riak-admin bucket-type activate cities

Loading the data

$ for i in *.json ; do
    curl -XPUT -d @"$i" \
    -H 'Content-Type: application/json' \
    http://10.3.1.10:10018/types/cities/buckets/cities/keys/"$i";
  done

Querying Solr

The search index can be queryed two ways:

using the Solr interface
using Riak

The syntax is the same, lets find the first 100 cities with population between 51000 and 52000, display only the name and the population and order it by the population. With Solr very comprehensive query syntax you end up with somethig like this:

http://10.3.0.10:10018/search/idx_cities?
q=*:*&
rows=100&
fl=label.value,pop.value&
wt=json&
indent=true&
fq=pop.value:[51000+TO+52000]&
sort:pop.value

Closing thoughts

I am a huge fan. Well, Riak is my favorite simple key-value store (o hai GET/PUT) and with Solr it makes life really easy to index what is stored in your system. I think it might be a bad idea to run Solr on the same nodes as Riak, finding bugs would be painful, but other than that I am pretty happy with the state of Yokozuna now.

Credits

The kudos go to #riak on freenode especially to @coderoshi and @nikore.