MarkLogic magic in Jupyter Notebook

Jupyter Notebook (and it’s alternatives) are being seen more outside the confines of the data-science space. People have realised that you can do much more with them than Markdown, Python and MatLab, though not to say those things aren’t interesting! I’ve been looking at using Jupyter more as a way to capture documentation and code – largely through the use of ‘magics‘, cell level interpreters, that for instance let you execute some ‘R’, run queries in a SQL database, plus a host of other things large and small. I found the bash magic to run commands and the dot magic to create diagrams particularly useful.

Untitled

But, wouldn’t it be even more useful to be able to call out to a MarkLogic server via the REST interface – even better if the output could be captured for subsequent use in the Notebook? Of course, it’s pretty easy in Python to POST out to somewhere with the requests library and get results back, but also far from elegant.  Why not build a magic?

A huge hat-tip to Nicolas Kruchten (@nicolaskruchten) who’s fabulous Pycon15 talk “Make Jupyter/IPython Notebook even more magical with cell magic extensions!” showed me how easy it was to make a magic.
Oh, and that Jupyter had it’s own editor (who knew)? 

So, make a cup of tea, watch his video (it’ll be 30 min well spent, but skip to about +17:15 in if you’re impatient) and come back….

import requests
from requests.auth import HTTPDigestAuth
from requests_toolbelt.multipart import decoder
from IPython.core.magic import needs_local_scope
import json

def dispatcher(line,cell):
 #Split the URI up
 r = requests.utils.urlparse(line)
 session = requests.session()
 session.auth = HTTPDigestAuth(r.username,r.password)
 payload = {r.scheme: cell}
 uri = 'http://%s:%s/v1/eval' % (r.hostname,r.port)
 r = session.post(uri, data=payload)
 # Output is a list of dict
 out = []
 if r.status_code == 200 and 'content-type' in r.headers:
 if r.headers['content-type'].startswith("multipart/mixed"):
 multipart_data = decoder.MultipartDecoder.from_response(r)
 for part in multipart_data.parts:
 ctype = part.headers['Content-Type']
 data = json.loads(part.content) if (ctype == 'application/json') else part.content
 out.append({'data' : data, 'type' : ctype})
 return out 
 
 
def load_ipython_extension(ipython, *args):
 ipython.register_magic_function(dispatcher, 'cell', magic_name="marklogic")

def unload_ipython_extension(ipython):
 pass

Interesting isn’t it? Now you have a good idea how magics work (you watched the video didn’t you?) and the code above should make some sense.

Encouraged by his example and a read of the docs  it was pretty straightforward to create a basic magic for MarkLogic REST in about 30 lines of code. If you want to play along, use the build-in editor (still can’t get over that) , create a file called sample_ext.py in the same folder as your notebook and drop the code above in.

The meat is in the dispatcher method:

  • It takes the first line in the cell and then the rest of the cell as it’s arguments. It assume the first line is a connection-string and the rest of the cell is the code.
  • The connection string is in the format xquery://admin:admin@localhost:8000 which is then split up into uri components.
  • The requests lib is used to construct the call, sent to the standard REST eval endpoint (using the default XDBC port 8000 in this case).
  • The http ‘scheme’ part of the uri; either xquery or javascript tells the eval what sort of code is being sent (sparql might be nice too, but I didn’t get to it) .

There isn’t anything special with the output; a couple of basic checks and then each part of the multipart response is made into a dictionary and added to the list (if it’s JSON, it’s eval’d first, otherwise it’s as it comes). The list is returned as the output from the cell. Certainly, not production grade, but good enough to play with.

Untitled

Next you load or reload the magic and it’s ready to use. Above you can see the results from a trivial bit of XQuery being run on the default REST port on my local MarkLogic server with the results in the output of the cell. One of the reasons for using list/dict as the return format is that it makes it trivial to create a Pandas DataFrame out of the result, which in turn allows all sorts of subsequent data munging and charting shenanigans. Notice especially how ‘the cell-above’ is referred with by “_ “. Both In[] and Out[] variables are maintained by the notebook for all the cells, so Out[196] could just as easily been used.

It works fine with javascript too, with the added ease of use that JSON brings to the table:

Untitled

Now it’s possible to include output from a MarkLogic server a few things come to mind going forward where this capability might be handy; from server monitoring and project configuration (especially matched with cell input controls) to developing code, not to forget simply having access to all that data.

Now it probably isn’t best to start pulling millions of lines of data from your MarkLogic server into a Notebook as a DataFrame. However, what you might do is use MarkLogic to do the heavy lifting across your structured/unstructured data that Jupyter can’t do: search for instance, or BI analytics or semantic inference and then pull that resultant dataset forward into the Python or R space to do further statistical analysis, machine learning, fancy charting, dashboards etc.