In MongoDB there are multiple guarantee levels available for reporting the success of a write operation, called Write Concerns. The strength of the
write concerns determine the level of guarantee. A weak Write Concern has better performance at the cost of lesser guarantee, while a strong Write Concern has higher guarantee as clients wait to confirm the write operations.
MongoDB provides different levels of write concern to better address
the specific needs of applications. Clients may adjust write concern to
ensure that the most important operations persist successfully to an
entire MongoDB deployment. For other less critical operations, clients
can adjust the write concern to ensure faster performance rather than
ensure persistence to the entire deployment.
Write Concern Levels
MongoDB has the following levels of conceptual write
concern, listed from weakest to strongest:
Unacknowledged
With an unacknowledged write concern, MongoDB does not acknowledge
the receipt of write operations. Unacknowledged is similar to
errors ignored; however, drivers will attempt to receive and handle
network errors when possible. The driver’s ability to detect network
errors depends on the system’s networking configuration.
Acknowledged
With a receipt acknowledged write concern, the mongod confirms the receipt of the write operation. Acknowledged write
concern allows clients to catch network, duplicate key, and other
errors. This is default write concern.
Journaled
With a journaled write concern, the MongoDB acknowledges the write
operation only after committing the data to the journal. This
write concern ensures that MongoDB can recover the data following a
shutdown or power interruption.
You must have journaling enabled to use this write concern.
Replica Acknowledged Replica sets present additional considerations
with regards to write concern. The default write concern only requires
acknowledgement from the primary.
With replica acknowledged write concern, you can guarantee that the
write operation propagates to additional members of the replica set.
Write operation to a replica set with write concern level of w:2 or write to the primary and at least one secondary.
In my previous post I explained about Hive GenericUDF.
In this post I will give an example of Hive GenericUDF to get Latitude and Longitude of a given location using Google Geocoding API. Lets call this Hive function as GeoEncodeUDF. GeoEncodeUDF function takes a String location and returns an array of Float containing latitude and longitude.
For obtaining latitude and longitude information I am using Google geocode API which is available here http://maps.googleapis.com/maps/api/geocode/json?address=<address>, this returns a JSON objects containg matching places and their latitude and longitude. This might return multiple address but for sake of simplicity I am taking the first address's latitude and longitude. I have created a helper method getLatLng in class GeoLatLng which takes location string and returns latitude and longitude in an array of float. This class is given below -
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@Description(name = "GeoEncodeUDF", value = "Get Lat-Lng", extended = "fetches location co-ordinates for given location from Google geocode Api and returns an ARRAY of 2 floats [lat,lng]")
@UDFType(deterministic = true)
public class GeoEncodeUDF extends GenericUDF {
private ArrayList<FloatWritable> result;
// Verify the input is of the required type.
@Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
// Exactly one input argument
if( arguments.length != 1 ) {
throw new UDFArgumentLengthException(GeoEncodeUDF.class.getSimpleName() + " accepts exactly one argument.");
}
// Is the input a String
if (((PrimitiveObjectInspector)arguments[0]).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING ) {
throw new UDFArgumentTypeException(0,"The single argument to " +GeoEncodeUDF.class.getSimpleName() + " should be String but " + arguments[0].getTypeName() + " is found");
Apache Hive comes with a
set of pre-defined User Defined Functions (UDFs). A
complete listing of Hive UDFs is available here. Some common UDFs are unix_timestamp(), to_date(string timestamp), concat(string|binary A, string|binary B...), etc. However sometimes custom UDF is needed to solve specific problems.
In this post I will go through the process of creating custom UDFs.
Difference between UDF and GenericUDF Hive UDFs are written in Java. In order to create a Hive UDF you need to derive from one of two classes UDF or GenericUDF. GenericUDFis bit complex to develop compared to UDF but it offers better performance and it supports all non-primitive parameters as input parameters and return types.
For writing custom UDFs by extending GenericUDF we need to overwrite 3 methods: initialize(), evaluate() and getDisplayString().
initialize() This method only gets called once per
JVM at the beginning to initilize the UDF. initilialize() is used to
assert and validate the number and type of parameters that a UDF takes
and the type of argument it returns. It also returns an ObjectInspector corresponding to the return type of the UDF.
evaluate() This method is called once for every row of data being processed. Here the actual logic for transformation/processing of each row is written. It will return an object containing the result of processing logic.
getDisplayString() A simple method for returning the display string for the UDF when explain is used.
Apart from these we can have these Annotations also -
@UDFType(deterministic = true)
A deterministic UDF is
one which always gives the same result when passed the same parameters.
An example of such UDF are length(string input), regexp_replace(string
initial_string, string pattern, string replacement), etc. A
non-deterministic UDF, on the other hand can return different result for
the same set of parameters. For example, unix_timestamp() returns the
current timestamp using the default time zone. Therefore, when
unix_timestamp() is invoked with the same parameters (no parameters) at
different times, different results are obtained, making it
non-deterministic. This annotation allows Hive to perform some
optimization if the UDF is deterministic.
@Description(name="my_udf", value="This will be the result
returned by explain statement.", extended="This will be result returned by the
explain extended statement.")
This annotation tells Hive the
name of your UDF. It will also be used to populate the result of queries
like `DESCRIBE FUNCTION MY_UDF` or `DESCRIBE FUNCTION EXTENDED MY_UDF`.
In my next post I will give an example of GenericUDF to latitude and longitude of a location.
Starting with version 0.94.5 HBase supports reading multiple tables as input to MapReduce jobs using MultiTableInputFormat class.
In this post I am giving an example of MapReduce job which reads from two HBase tables performs some aggregation on one table and merges (SQL UNION ALL operation) it with the content of second table and stores the result in an output table.
The first table is 'storeSales' table and it has store-wise sales for each date. The create statements are -
create 'storeSales', 'cf1'
put 'storeSales', '20130101#1', 'cf1:sSales', '100'
put 'storeSales', '20130101#2', 'cf1:sSales', '110'
put 'storeSales', '20130102#1', 'cf1:sSales', '200'
put 'storeSales', '20130102#2', 'cf1:sSales', '210'
The second table is 'onlineSales' table and it has online sale for each date. The create statements are -
create 'onlineSales', 'cf2'
put 'onlineSales', '20130101', 'cf2:oSales', '400'
put 'onlineSales', '20130102', 'cf2:oSales', '130'
Using a MapReduce job I am going to merge aggregated (at date level) store sales with online sales.
Lets create a output table for the same -
create 'totalSales', 'cf1'
The mapper class for this job is -
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Note that in mapper I am getting table name of current split and using different context.write based on table name. If your source tables have rowkeys with different prefixes you can use that also for different context.write logic.
The reducer class for this job is -
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Based on intermediate key value I am using aggregation in reducer.
Finally the driver class for this job is
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters