Data Lab: June 2014

Recently I worked on a project which uses MongoDB as source data system and uses R for analysis and MongoDB again for output storage.

In this project we faced a different problem. We were using R to process source data present in MongoDB and if we gave large number of documents for analysis to R it was becoming slower and a bottleneck. To avoid this bottleneck we had to implement processing of a fixed number of documents in R for a batch.

To achieve this we needed some kind of record number in MongoDB, but being a distributed database getting some sequential number in MongoDB was not supported. Also our MongoDB source was getting populated by a distributed real-time stream so implementing some logic on application side was also deterrent.

To have some batchId field for a fixed number of documents in MongoDB we implemented below algorithm :
1. Find for documents which didn't had batchId field.
2. Sort by some timestamp field.
3. Limit the number of documents (say 10000).
4. Append batchId field to documents and save them (get value of batchId from audit table).

MongoDB shell command for this is :

db['collection1'].find({batchId:null}).sort({systemTime:1}).limit(10000).forEach(
    function (e) {
// get value of batchId from audit table
        e.batchId = 1;
        db['collection1'].save(e);
    }
);

Using the above code we appeneded batchId to MongoDB documents and picked only current batchId for analysis in R.

Java code for above MongoDB shell command is :

Data Lab

Friday, 27 June 2014

Update Fixed number of MongoDB Documents