Data Lab: September 2014

Tuesday 30 September 2014

What is Write Concern in MongoDB?

In MongoDB there are multiple guarantee levels available for reporting the success of a write operation, called Write Concerns. The strength of the write concerns determine the level of guarantee. A weak Write Concern has better performance at the cost of lesser guarantee, while a strong Write Concern has higher guarantee as clients wait to confirm the write operations.

MongoDB provides different levels of write concern to better address the specific needs of applications. Clients may adjust write concern to ensure that the most important operations persist successfully to an entire MongoDB deployment. For other less critical operations, clients can adjust the write concern to ensure faster performance rather than ensure persistence to the entire deployment.

Write Concern Levels

MongoDB has the following levels of conceptual write concern, listed from weakest to strongest:

Unacknowledged
With an unacknowledged write concern, MongoDB does not acknowledge the receipt of write operations. Unacknowledged is similar to errors ignored; however, drivers will attempt to receive and handle network errors when possible. The driver’s ability to detect network errors depends on the system’s networking configuration.

Write operation to a ``mongod`` instance with write concern of ``unacknowledged``. The client does not wait for any acknowledgment.

Acknowledged
With a receipt acknowledged write concern, the mongod confirms the receipt of the write operation. Acknowledged write concern allows clients to catch network, duplicate key, and other errors. This is default write concern.

Write operation to a ``mongod`` instance with write concern of ``acknowledged``. The client waits for acknowledgment of success or exception.

Journaled
With a journaled write concern, the MongoDB acknowledges the write operation only after committing the data to the journal. This write concern ensures that MongoDB can recover the data following a shutdown or power interruption.
You must have journaling enabled to use this write concern.

Write operation to a ``mongod`` instance with write concern of ``journaled``. The ``mongod`` sends acknowledgment after it commits the write operation to the journal.

Replica Acknowledged
Replica sets present additional considerations with regards to write concern. The default write concern only requires acknowledgement from the primary. With replica acknowledged write concern, you can guarantee that the write operation propagates to additional members of the replica set.

Write operation to a replica set with write concern level of w:2 or write to the primary and at least one secondary.

Hive UDF to get Latitude and Longitude

In my previous post I explained about Hive GenericUDF.
In this post I will give an example of Hive GenericUDF to get Latitude and Longitude of a given location using Google Geocoding API. Lets call this Hive function as GeoEncodeUDF. GeoEncodeUDF function takes a String location and returns an array of Float containing latitude and longitude.

For obtaining latitude and longitude information I am using Google geocode API which is available here http://maps.googleapis.com/maps/api/geocode/json?address=<address>, this returns a JSON objects containg matching places and their latitude and longitude. This might return multiple address but for sake of simplicity I am taking the first address's latitude and longitude. I have created a helper method getLatLng in class GeoLatLng which takes location string and returns latitude and longitude in an array of float. This class is given below -
The GenericUDF is GeoEncodeUDF
I have overwritten initialize(), evaluate() and getDisplayString() methods which I have already described in my previous post.

Now to use this UDF in Hive we need to create a jar file of this UDF and add it to Hive. The commands to add this UDF to Hive are -

ADD JAR /path/to/HiveUDF.jar;
CREATE TEMPORARY FUNCTION geo_points AS 'com.rishav.hadoop.hive.ql.udf.generic.GeoEncodeUDF';

Now we can use geo_points function on any table having address string like this -

hive> select geo_points("india") from test.x limit 1;
[20.593683,78.96288]

This HQL will return an array containing lat-lng, to get them as separate columns use -

hive> select latlng[0], latlng[1] FROM (select geo_points("india") as latlng from test.x) tmp limit 1;
20.593683 78.96288

Introduction to Hive UDFs

Apache Hive comes with a set of pre-defined User Defined Functions (UDFs). A complete listing of Hive UDFs is available here. Some common UDFs are unix_timestamp(), to_date(string timestamp), concat(string|binary A, string|binary B...), etc. However sometimes custom UDF is needed to solve specific problems.

In this post I will go through the process of creating custom UDFs.

Difference between UDF and GenericUDF
Hive UDFs are written in Java. In order to create a Hive UDF you need to derive from one of two classes UDF or GenericUDF. GenericUDFis bit complex to develop compared to UDF but it offers better performance and it supports all non-primitive parameters as input parameters and return types.

For writing custom UDFs by extending GenericUDF we need to overwrite 3 methods: initialize(), evaluate() and getDisplayString().

initialize()
This method only gets called once per JVM at the beginning to initilize the UDF. initilialize() is used to assert and validate the number and type of parameters that a UDF takes and the type of argument it returns. It also returns an ObjectInspector corresponding to the return type of the UDF.

evaluate()
This method is called once for every row of data being processed. Here the actual logic for transformation/processing of each row is written. It will return an object containing the result of processing logic.

getDisplayString()
A simple method for returning the display string for the UDF when explain is used.

Apart from these we can have these Annotations also -

@UDFType(deterministic = true)

A deterministic UDF is one which always gives the same result when passed the same parameters. An example of such UDF are length(string input), regexp_replace(string initial_string, string pattern, string replacement), etc. A non-deterministic UDF, on the other hand can return different result for the same set of parameters. For example, unix_timestamp() returns the current timestamp using the default time zone. Therefore, when unix_timestamp() is invoked with the same parameters (no parameters) at different times, different results are obtained, making it non-deterministic. This annotation allows Hive to perform some optimization if the UDF is deterministic.

@Description(name="my_udf", value="This will be the result returned by explain statement.", extended="This will be result returned by the explain extended statement.")

This annotation tells Hive the name of your UDF. It will also be used to populate the result of queries like `DESCRIBE FUNCTION MY_UDF` or `DESCRIBE FUNCTION EXTENDED MY_UDF`.

In my next post I will give an example of GenericUDF to latitude and longitude of a location.

Tuesday 16 September 2014

HBase: MapReduce On Multiple Input Table

Starting with version 0.94.5 HBase supports reading multiple tables as input to MapReduce jobs using MultiTableInputFormat class.
In this post I am giving an example of MapReduce job which reads from two HBase tables performs some aggregation on one table and merges (SQL UNION ALL operation) it with the content of second table and stores the result in an output table.

The first table is 'storeSales' table and it has store-wise sales for each date. The create statements are -

create 'storeSales', 'cf1'
put 'storeSales', '20130101#1', 'cf1:sSales', '100'
put 'storeSales', '20130101#2', 'cf1:sSales', '110'
put 'storeSales', '20130102#1', 'cf1:sSales', '200'
put 'storeSales', '20130102#2', 'cf1:sSales', '210'

The second table is 'onlineSales' table and it has online sale for each date. The create statements are -
create 'onlineSales', 'cf2'
put 'onlineSales', '20130101', 'cf2:oSales', '400'
put 'onlineSales', '20130102', 'cf2:oSales', '130'

Using a MapReduce job I am going to merge aggregated (at date level) store sales with online sales.
Lets create a output table for the same -
create 'totalSales', 'cf1'

The mapper class for this job is -

Note that in mapper I am getting table name of current split and using different context.write based on table name. If your source tables have rowkeys with different prefixes you can use that also for different context.write logic.

The reducer class for this job is -

Based on intermediate key value I am using aggregation in reducer.

Finally the driver class for this job is

In the driver there are 2 HBase Scan for 2 input tables and I am passing these scans in a list to TableMapReduceUtil.initTableMapperJob method.

Package jar file (to hbase-union.jar) and execute below commands to invoke MapReduce job -
export HADOOP_CLASSPATH=`hbase classpath`
hadoop jar hbase-union.jar com.rishav.hbase.union.UnionJob

Once the job is complete use HBase shell to verify output results -
hbase(main):034:0> scan 'totalSales'
ROW                                        COLUMN+CELL
o#20130101                                column=cf1:tSales, timestamp=1410848221034, value=\x00\x00\x01\x90
o#20130102                                column=cf1:tSales, timestamp=1410848221034, value=\x00\x00\x00\x82
s#20130101                                column=cf1:tSales, timestamp=1410848221034, value=\x00\x00\x00\xD2
s#20130102                                column=cf1:tSales, timestamp=1410848221034, value=\x00\x00\x01\x9A
4 row(s) in 0.0410 seconds
hbase(main):035:0> org.apache.hadoop.hbase.util.Bytes.toInt("\x00\x00\x01\x90".to_java_bytes)
=> 400
hbase(main):036:0> org.apache.hadoop.hbase.util.Bytes.toInt("\x00\x00\x00\x82".to_java_bytes)
=> 130
hbase(main):037:0> org.apache.hadoop.hbase.util.Bytes.toInt("\x00\x00\x00\xD2".to_java_bytes)
=> 210
hbase(main):038:0> org.apache.hadoop.hbase.util.Bytes.toInt("\x00\x00\x01\x9A".to_java_bytes)
=> 410

MultiTableInputFormat can be used for doing HBase table joins too, I shall try that some time.