MarkLogic magic in Jupyter Notebook

Jupyter Notebook (and it’s alternatives) are being seen more outside the confines of the data-science space. People have realised that you can do much more with them than Markdown, Python and MatLab, though not to say those things aren’t interesting! I’ve been looking at using Jupyter more as a way to capture documentation and code – largely through the use of ‘magics‘, cell level interpreters, that for instance let you execute some ‘R’, run queries in a SQL database, plus a host of other things large and small. I found the bash magic to run commands and the dot magic to create diagrams particularly useful.

Untitled

But, wouldn’t it be even more useful to be able to call out to a MarkLogic server via the REST interface – even better if the output could be captured for subsequent use in the Notebook? Of course, it’s pretty easy in Python to POST out to somewhere with the requests library and get results back, but also far from elegant.  Why not build a magic?

A huge hat-tip to Nicolas Kruchten (@nicolaskruchten) who’s fabulous Pycon15 talk “Make Jupyter/IPython Notebook even more magical with cell magic extensions!” showed me how easy it was to make a magic.
Oh, and that Jupyter had it’s own editor (who knew)? 

So, make a cup of tea, watch his video (it’ll be 30 min well spent, but skip to about +17:15 in if you’re impatient) and come back….

import requests
from requests.auth import HTTPDigestAuth
from requests_toolbelt.multipart import decoder
from IPython.core.magic import needs_local_scope
import json

def dispatcher(line,cell):
 #Split the URI up
 r = requests.utils.urlparse(line)
 session = requests.session()
 session.auth = HTTPDigestAuth(r.username,r.password)
 payload = {r.scheme: cell}
 uri = 'http://%s:%s/v1/eval' % (r.hostname,r.port)
 r = session.post(uri, data=payload)
 # Output is a list of dict
 out = []
 if r.status_code == 200 and 'content-type' in r.headers:
 if r.headers['content-type'].startswith("multipart/mixed"):
 multipart_data = decoder.MultipartDecoder.from_response(r)
 for part in multipart_data.parts:
 ctype = part.headers['Content-Type']
 data = json.loads(part.content) if (ctype == 'application/json') else part.content
 out.append({'data' : data, 'type' : ctype})
 return out 
 
 
def load_ipython_extension(ipython, *args):
 ipython.register_magic_function(dispatcher, 'cell', magic_name="marklogic")

def unload_ipython_extension(ipython):
 pass

Interesting isn’t it? Now you have a good idea how magics work (you watched the video didn’t you?) and the code above should make some sense.

Encouraged by his example and a read of the docs  it was pretty straightforward to create a basic magic for MarkLogic REST in about 30 lines of code. If you want to play along, use the build-in editor (still can’t get over that) , create a file called sample_ext.py in the same folder as your notebook and drop the code above in.

The meat is in the dispatcher method:

  • It takes the first line in the cell and then the rest of the cell as it’s arguments. It assume the first line is a connection-string and the rest of the cell is the code.
  • The connection string is in the format xquery://admin:admin@localhost:8000 which is then split up into uri components.
  • The requests lib is used to construct the call, sent to the standard REST eval endpoint (using the default XDBC port 8000 in this case).
  • The http ‘scheme’ part of the uri; either xquery or javascript tells the eval what sort of code is being sent (sparql might be nice too, but I didn’t get to it) .

There isn’t anything special with the output; a couple of basic checks and then each part of the multipart response is made into a dictionary and added to the list (if it’s JSON, it’s eval’d first, otherwise it’s as it comes). The list is returned as the output from the cell. Certainly, not production grade, but good enough to play with.

Untitled

Next you load or reload the magic and it’s ready to use. Above you can see the results from a trivial bit of XQuery being run on the default REST port on my local MarkLogic server with the results in the output of the cell. One of the reasons for using list/dict as the return format is that it makes it trivial to create a Pandas DataFrame out of the result, which in turn allows all sorts of subsequent data munging and charting shenanigans. Notice especially how ‘the cell-above’ is referred with by “_ “. Both In[] and Out[] variables are maintained by the notebook for all the cells, so Out[196] could just as easily been used.

It works fine with javascript too, with the added ease of use that JSON brings to the table:

Untitled

Now it’s possible to include output from a MarkLogic server a few things come to mind going forward where this capability might be handy; from server monitoring and project configuration (especially matched with cell input controls) to developing code, not to forget simply having access to all that data.

Now it probably isn’t best to start pulling millions of lines of data from your MarkLogic server into a Notebook as a DataFrame. However, what you might do is use MarkLogic to do the heavy lifting across your structured/unstructured data that Jupyter can’t do: search for instance, or BI analytics or semantic inference and then pull that resultant dataset forward into the Python or R space to do further statistical analysis, machine learning, fancy charting, dashboards etc.

XForms over REST with Marklogic

Researching for an up and coming Marklogic project, I’ve been looking at the REST interface. I’d set one last month up on the laptop so that I could test whether we could push data in from our Apache Camel engine. That worked nicely and the thought occurred to me that I might be able to put an XForms interface directly over the top of the REST endpoints – a ‘one-tier’ application so to speak.

UntitledFor my demo, I took the idea of a herd book rather than the canonical name/address game.
For those who don’t keep animals, a herd book is simply a database of animals you have and what happens to them in terms of breeding, medicines etc. The very simplest ‘toy’ one  you might need for alpacas might be as shown here.

Setup: The Same Origin issue.

To set up for my experiment, all I did was alter my working REST interface in Marklogic admin, so that the ‘modules-database’ was on the file-system. That gave me somewhere tUntitledo put my XForm and the xsltforms library to drive it. I could of course have created a new Application Server on a different port, but that would have put the REST endpoint in a different domain and hence triggered same-origin security.

In a real application you’d use a proxy or CORS via headers or some such, but for this it was simply easiest to run it all together (this came to bite me later in searching though).

Creating documents via POST

First stop, now I had somewhere for my form to live was to get it to store a document. In REST, the endpoint for this lives at /v1/documents where you can PUT a document into the system at an address URI. However, that requires you to crete URIs and often you don’t really care what a new document is actually called, you just need a unique name. Marklogic has a nifty solution to this; you can POST a document instead and get your URI generated for you in a directory you provide (you can read about it here). In essence, you POST the data and you get a header back saying what it got called. Translated into XForms that gives this sort of <submission/>

 <xf:submission id="newanimal"
 resource="v1/documents?extension=xml&amp;directory=/my/herdbook/"
 method="post" replace="none" separator="&amp;" ref="instance('animal')">
 <xf:action ev:event="xforms-submit-done">
     <xf:setvalue ref="instance('control')/uri"
 value="substring-after(event('response-headers'),'=')" />
     <xf:message>
         <xf:output value="event('response-reason-phrase')" />
     </xf:message>
 </xf:action>
 </xf:submission>

Here I POST my form data, in ‘animal’ to REST and give it the target directory. I then use the ‘xforms-submit-done’ event to read the return header out via the event mechanism and put it in a ‘control’ instance, so I know the URI for editing. Success!

Updating documents via PUT

The simplest way to update a document is often just to write over it with a PUT to the same URI. That’s fine for documents if they’re small enough but it’s worth saying that the Marklogic REST interface also supports Partial Updates to both the document and the meta-data which might be better in some cases. Getting back to this form though, now I have a URI, if I update a document  I can simply replace the data with the following <submission/>:

<xf:submission id="upcase"
 method="put" replace="none" separator="&amp;" ref="instance('animal')">
<xf:resource value="concat('v1/documents?uri=',instance('control')/uri)"/>
<xf:action ev:event="xforms-submit-done">
    <xf:message>
        <xf:output value="'Updated'" />
    </xf:message>
</xf:action>
</xf:submission>

This is straightforward apart from the fact that I have to use an in-line <xf:resource/> element to construct the url with the URI I stored earlier. {Note to self. AVT?]

Searching documents

The last step in this little demo is getting lists of animals, searching etc. This bit took the most time due to a problem with how search works. In Marklogic, the modules-database gets used to store all the search options. Now this is great; you can pre-set a load of search templates, give them a name and put that in the search url in ‘options’. However, if you’ve hijacked the modules database for your code, well, you can’t. This is a pain because it doesn’t in my case allow you to override the snippet settings to get particular fields back. Now I thought that was that, until I found out that ML7 had the ability to send <options/> with a Structured Query. (thanks to @adamfowleruk).

Turns out all I needed for search was to POST a Structured Query format <instance/> using the accompanying <submission/> like so:

<xf:instance id="squery">
<search xmlns="http://marklogic.com/appservices/search">
<query/>
 <options>
 <return-metrics>false</return-metrics>
 <transform-results apply="metadata-snippet">
 <preferred-elements>
 <element name='name' ns=''/>
 <element name='dob' ns=''/>
 <element name='colour' ns=''/>
 </preferred-elements>
 </transform-results>
</options>
</search> 
 </xf:instance>

<xf:submission id="list all" action="v1/search?directory=/my/herdbook"
method="post" 
ref="instance('squery')" 
replace="instance" 
instance="results" 
omit-xml-declaration="yes" />

That gets me all the alpacas with their name,colour and dob so I can display them in a list. Note in this example the <query/> section is empty as I want every alpaca. In a real app, of course I could add something like a range query to get a particular animal linked to a search box. A bit of judicious code in some triggers and I can paginate using header data from the reply as well.

Untitled

Lastly, is a <action/> in the  list so that each row in the results triggers a fetch of that document so I can edit it if I want. Given the URI is in the search results, I simply need to copy it over to become my ‘current’ URI and send the instance using GET to /v1/documents. At the moment I send category=content (also the default) as I want the document, but I could expand that to category=meta easily to start working with the meta-data instead.

More to do?

UntitledThat’s pretty much it so far. I haven’t time currently to work this up into a full example at the moment, but it seems a very handy way to quickly put an XForms interface over REST, even if there are compromises, such as making the search a little harder.
Of course I’ve used Marklogic for this, but in theory it’s an approach that could work works with any REST system as long as it will take POST data in XML.
As to whether this is a viable way to write an application? I’d say probably yes for a prototype or something simple, especially as you get the REST endpoints thrown in for free.

Creating MarkLogic content with Apache Camel

MarkLogic with JMS?

A few weeks back we started looking at MarkLogic  at work as a possible replacement for our mix of Cocoon and eXistDB systems. One of the side issues that’s been raised is how we would get messages to/from our ActiveMQ EIP system. Now, MarkLogic doesn’t have a JMS connector, although to be fair, it seems to have a pretty good system for slurping up content from URL’s, files etc. However, there is a Java API, which gave me the idea of using my old friend Apache Camel.

If I could get Camel to talk to MarkLogic then, not only could I talk to any sort of queue, I could also pull content into MarkLogic from the huge range of other things Camel will talk to, and I would get all the EIP magic thrown in for good measure.

The Camel Routes.

The easiest way to test this was of course to set up a simple Camel project to slurp up some data. The most expedient  producer I could think of was to get an RSS feed from somewhere; JIRA being the most obvious, as it would reliably produce something new at reasonably short intervals. This would need transforming into XML and pushing into Marklogic via a queue. The MarkLogic side would be handled by their Java API mounted as a Bean, in my case written in Groovy so I could work with it as a script. So much for the basic plan. Here’s the start route as Camel sees it :

<route id="jira">
    <from uri="rss:https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml
    /temp/SearchRequest.xml?jqlQuery=&amp;tempMax=100&amp;delay=10s"/>
    <marshal><rss /></marshal>
    <setHeaderheaderName="ml_doc">
        <simple>/twitter/${exchangeId}</simple>
    </setHeader>
   <to uri="seda:mlfeed"/>
</route>

The Camel RSS module calls a basic Jira RSS feed, in this case, polling every 10 seconds. I’ve used the module defaults, so each entry is separated out of the feed and passed down the route one at a time. At this point the message body is a Java SyndFeed object, not XML, so it has to be ‘marshalled‘. Now the message body is an XML string ready for upload, but before I can send it I need to make a URI for MarkLogic to use. Each run of the route or ‘exchange’ has a unique id, so I’ve used that via the inbuilt <simple/> language. Alternatively, I could have also parsed something out of the content, like the Jira id or made something up like the current date-time. Finally, the message is dropped into a queue via the SEDA module.
Note, this in-memory queue type isn’t persistent, like JMS or ActiveMQ, but it’s built into camel-core, so was just handy.

There is another route to pull messages from the queue and into MarkLogic.

<route id="marklogic">
    <from uri="seda:mlfeed"/>
    <to uri="bean:marklogic?method=upload_xml"/>
    <!-- <to uri="file:outbox"/> -->
</route>

This route takes messages off the queue and passes them to a Bean written using Camel’s Groovy support. Lastly there’s an optional entry to put the message into a file in an /outbox folder. This is handy if you can’t get the MarkLogic bit working and want to look at the input: comment out the bean instead and just drop the data into files.

The Groovy Code.

The Groovy Bean is mounted in the configuration file, along with some parameters needed to connect to MarkLogic.
Note. To get this working, you’ll need to supply your own parameters, and have a MarkLogic REST server listening, as REST is the basis of their API. You can get instructions here.

<lang:groovy id="marklogic" script-source="classpath:groovy/marklogic.groovy">
<lang:property name="host" value="YOURHOST" />
<lang:property name="port" value="YOURPORT" />
<lang:property name="user" value="YOURNAME" />
<lang:property name="password" value="YOURPASSWORD" />
</lang:groovy>

Once the Bean is running, you simple call it’s methods in the route. You get as input the entire Exchange, so you have access to everything, as well as the ability to alter it as you like. In this case, I’ve simply written the data out and not altered the massage at all. In real life it would probably be more complex. The salient bit of Groovy code (the Get’s for the parameters are not shown) is shown below. This is the MarkLogic basic example with a couple of mods to a) Get the header that has the URI in, and b) Get the body of the input Message as an InputStream:

public void upload_xml(Exchange exchange) {
    // Get the doc url from Camel
    String docId = exchange.getIn().getHeader("ml_doc");
    if (docId == null) docId = "/twitter/" + exchange.getExchangeId();
    // create the client
    DatabaseClient client = DatabaseClientFactory.newClient(host, port,
 user, password,
 Authentication.DIGEST);

    // try to make use of the client connection
    try {
        XMLDocumentManager XMLDocMgr = client.newXMLDocumentManager();
        //Get an InputStream from the Message Body
        InputStreamHandle handle = new InputStreamHandle(exchange.getIn().getBody(InputStream.class));

        //Write out the XML Doc
        XMLDocMgr.write(docId,handle);
        } catch (Exception e) {
            System.out.println("Exception : " + e.toString() );
        } finally {
            // release the client
            client.release();
        }
}

Note.  I’ve connected and disconnected to the MarkLogic database each time. I’m sure this can’t be efficient in anything but a basic use case, but it will do for the present. There’s nothing to stop me creating an Init() method that could be called as the Bean starts to create a persistent connection if that’s better, but all the examples I could find seem to do it this way [If I’ve made any MarkLogic Java API gurus out there wince, I’m sorry. Happy to do it a better way].

Putting it all together.

If you’ve got a handy MarkLogic server, you can try this all out. I’ve put the code here on GitHub as a Maven project, and all you need to do is pull it and run “mvn compile exec:java”. Ten seconds or so after it starts, you should see something similar to this on the console:

[l-1) thread #1 – seda://mlfeed] DocumentManagerImpl INFO Writing content for /jira/ID-twiglet-53205-1398451322665-0-2
[l-1) thread #1 – seda://mlfeed] DatabaseClientImpl INFO Releasing connection

On the MarkLogic side, if you go to the Query Console you can use Explore to look at your database. You should see the files in the database – query them to your heart’s content.

  • I’m using MarkLogic 7 and Java API 2.0-2 with Camel 2.12.0.
  • If you want to change the routes, you’ll find them in src/resources/camel-context.xml.
  • The Groovy script is in resources/groovy/marklogic.groovy.
  • Remember, if you want to use modules outside of camel-core, you’ll need them in the pom.xml!

Bells and Whistles.

Now I’ve got the basic system working there are a couple of other things I could do. As the MarkLogic component reads from the end of a queue, I could for instance add another route that puts messages into the same queue from another source, for example Twitter (for which there’s a module) assuming I had appropriate twitter oauth ‘keys’, like so:

<route id="twitter-demo">
    <from uri="twitter://search?type=polling&amp;delay=60&amp;keywords=marklogic&amp;consumerKey={{twitter.consumerKey}}&amp;consumerSecret={{twitter.consumerSecret}}&amp;accessToken={{twitter.accessToken}}&amp;accessTokenSecret={{twitter.accessTokenSecret}}" />
    <setHeader headerName="ml_doc">
        <simple>/twitter/${body.id}</simple>
    </setHeader>
    <log message="${body.user.screenName} tweeted: ${body.text}" />
    <to uri="seda:mlfeed"/>
</route>

Of course, once you start doing that, you need someway to make sure you can throttle the speed that things get added to the queue to avoid overwhelming the consumer. Camel has several strategies for this, but my favourite is RoutePolicy. With this you can specify rules that allow the input routes to be shutdown and restarted as necessary to throttle the number of in-flight exchanges. You simple add the Bean like so with an a approprite configuration:

<bean id="myPolicy" class="org.apache.camel.impl.ThrottlingInflightRoutePolicy">
 <property name="scope" value="Context"/>
<property name="maxInflightExchanges" value="20"/>
<property name="loggingLevel" value="WARN"/>
</bean>

and then add this policy to any route you wish to control, like so:

<route routePolicyRef="myPolicy">

Once there are more than 20 messages in-flight (‘context’ means all routes/queues) the inbound routes will be suspended. Once the activity drops below 70% (you can configure this) they’ll start up again – neat.

This only really skims the surface. Camel is a marvellous system and being able to use it to push content to MarkLogic is very handy (if I polish the code a bit).  Wiring routes up in Camel is so much easier, flexible and maintainable, than writing custom, one-off code.

Finally, of course, there’s nothing to say you couldn’t have a route that went away, got some key which was then sent to MarkLogic via Bean to retrieve some data instead and which that then got added to the body (Content Enrichment in EIP). That’ll have to be the subject of another day.

Resources.

  • MarkLogic Developer. http://developer.marklogic.com
  • Apache Camel. http://camel.apache.org
  • Enterprise Integration Patterns (nice hardback book) http://www.eaipatterns.com