This a post written by friend Jamie Grant from AquqQ Enjoy :
Big Data technologies have been all over technology press for what seems like a long time now. MongoDB is an established player in the space, one of the first examples of what have come to be called ‘NOSQL’ databases. With its schema-less design MongoDB is well suited to storing unstructured data and it promises many features to make it easy to scale across multiple servers. MongoDB version 2.6 also introduced support for full text indexing and search.
With our traditional expertise being with kdb+, we decided to look at ways the two (kdb+ and MongoDB) technologies might complement each other. kdb+ excels with time series data analysis with structured data, while MongoDB offers text indexing and search functionality that doesn’t yet have an equivalent in kdb+
For this case study, we’re using MongoDB as the text storage for a kdb+ database – offloading the storage of large text documents from the main kdb+ server, taking advantage of Mongo DB text indexing for searching, while retaining the speed of kdb+ for searches and reporting on the structured data.
mongoq
mongoq is a set of libraries built on top of the mongo c driver, to allow access to a MongoDB database direct from q. The C functions take json strings as parameters, which are converted to bsondocuments by libbson prior to being passed to MongoDB. Result sets are converted from bson to json and returned from C to q as a list of json strings.
Some wrapper q functions are provided which take dictionary parameters, handling json encoding and decoding using j.k
Source code, documentation and examples can be found on our github site here. The interface is in POC/alpha state at the moment – mongo c driver API coverage isn’t complete yet, and there are performance improvements and better error handing planned in the near future.
Reddit Database
For a dataset for our case study, we found 1Gb of reddit comments on this thread.
Loading
We used a q script to parse the data, maintaining the meta data in a kdb+ splayed table, with any long strings (comment text, topic, comment html) being pushed to MongoDB, referenced with a MongoDB id in the kdb+ table.
The loader script outputs the following, including a breakdown of the timings:
> q examples/comments_load.q mydb comments.all.sql
...
Processing - 1020 of 1048 MB : 97.4%
Processing - 1029 of 1048 MB : 98.3%
Processing - 1039 of 1048 MB : 99.2%
Processing - 1047 of 1048 MB : 100%
#####################################
# reading | 0D00:00:00.743248000 #
# parsing | 0D00:02:31.920447000 #
# mongoinsert| 0D00:03:13.843763000 #
# kdbwrite | 0D00:01:40.621209000 #
# mongoindex | 0D00:02:31.501243000 #
# TOTAL | 0D00:09:58.629910000 #
#####################################
Free text data always presents a parsing challenge, and this data has been imported to a sql database, then exported to text again, so the escaping of special characters is inconsistent. We’ve chosen to process the .sql file instead of the .csv as it’s easier to tell apart genuine record breaks from newlines inside comments. We aren’t going to have 100% success rate at processing every post into the correct table structure, but the script should recover within a record or two if it’s thrown off by the content of a post.
The following code is where the loader script writes to the two databases, MongoDB via the mongoq library, and kdb+ appending to a splayed table on disk
processdata:{[t]
mgcols:`link_title`subreddit`body`body_html`author_flair_text;
oid:.mg.add[`comments;mgcols#t];
(` sv db,`comments`) upsert .Q.en[db] (mgcols _ t),'([]mgid:oid);
};
The .mg.add function takes a table or list of dictionaries as a parameter, and returns a vector of MongoDB _id values as a 16 byte GUID type.
This id can be used for retrieval:
q)id:.mg.add[`test;([]time:12 13t;sym:`xx`yyy;price:11.2 34.5)]
q)id
00000000-54f6-4410-dfa8-3258b51290e8 00000000-54f6-4410-dfa8-3258b51290e9
q).mg.find[`test;id;()] / all fields for these ids in 'test' collection
time sym price
--------------------------
"12:00:00.000" "xx" 11.2
"13:00:00.000" "yyy" 34.5
Text Index
MongoDB has a text indexing feature which supports stemming and relevance scoring. We’re going to use this feature as it provides functionality not readily available in q. The mongoq interface includes an ‘addindex’ function which accepts index parameters as a json string. To create a text index on the ‘body’ field in the comments records we’re storing, we do the following at the end of the loader script:
.mg.addindex[`comments;.j.j enlist[`body]!enlist `text]
To index across all text fields (`$”**”) can be used in place of the field name.
Queries
The following are some example queries and their execution time on our machine. Queries which ship large amounts of data to or from from Mongo will be slowed down by the extra step in the bson<->json<->kobject conversion – this would ideally be replaced with a direct bson<->kobject conversion library.
TOP 10 TOPICS FOR SEARCH TERM
q)topics:{[term] 10 sublist select score, subreddit, title:40 sublist' link_title from `score xdesc 0!select max score, first subreddit by link_title from .mg.search[`comments;term]}
q)\t topics "IBM"
63
q)topics "IBM"
score subreddit title
----------------------------------------------------------------------
0.963724 "todayilearned" "TIL that Bill Gates told the creators of"
0.904494 "linux" "Anybody know a bit about older keyboards"
0.803571 "starcraft" "Why, and how much, do Blizzard invest in"
0.779412 "gamecollecting" "Looking for a SEGA Saturn Keyboard."
0.765306 "Bitcoin" "Seeking Alpha''s AJ Watkinson Predicts a"
0.761811 "buildapc" "What is the major differance between Int"
0.760274 "linux" "Linux Desktop''s Missed Opportunities"
0.75 "learnprogramming" "Is anyone here familiar with Assembly? I"
0.666667 "apple" "Apple''s iPhone, iPad used to place over"
0.583333 "Games" "FIFA Manager series cancelled"
THREADS AND COMMENTS AGGREGATED BY SUBFORUM FOR SEARCH TERM
q)threads
{[term]
t:comments comments[`mgid] bin m:.mg.searchid[`comments;term];
t:(`$.mg.find[`comments;m;`subreddit]),'t;
r:`threads xdesc select threads:count distinct link_id, comments:count i by subreddit from t;
r}
q)threads "\"ivy bridge\"" / search exact term
subreddit | threads comments
---------------| ----------------
buildapc | 9 9
gaming | 2 3
pcmasterrace | 2 2
AskReddit | 1 1
WildStar | 1 1
battlestations | 1 1
buildapcforme | 1 1
buildapcsales | 1 1
dayz | 1 1
hardware | 1 1
linux | 1 1
mcservers | 1 1
programming | 1 1
techsupportgore| 1 1
Conclusions
- MongoDB presents an interesting proposition as a text store backing to a kdb+ database, opening up search features that wouldn’t be easily replicated in a kdb+ only database
- Performance of the adaptor is acceptable as long as the amount of data passed between MongoDB and kdb+ is minimised. Future versions of the adaptor will skip the expensive json translation step with a direct bson<->kobject translation library
- All the adaptor code used for this post is available as open source on our github site. It should be considered in alpha state – report bugs and feature requests through github