Querying MongoDB via Apache Pig

This crisp post in on querying MongoDB for HDFS Data Transfer, via Pig.

Below are the steps involved for the same:

1. Install MongoDB on box
a. Download MongoDB binaries
b. Extract mongodb and export bin path to $PATH
c. create db dir for mongodb
d. start mongodb by the command : mongod –dbpath

INFO: Mongo listens on port 27017 by default.

2. Start Mongo Shell
a. Goto mongo installation dir in new terminal
b. $> ./bin/mongo
c. type exit to exit mongo shell

3. Load data into Mongo
a. Create a json data file for importing data to mongo
b. call mongoimport to import the data file into mongo:
mongoimport –db test_db –collection docs –file /data/yash-tests/data.json
INFO: test_db need not exist before executing the command. docs is the collection name.
c. Verify data import:
$> ./bin/mongo
mongo> show dbs
mongo> use test_db
mongo> show collections
mongo> db[“docs”].find()

4. PIG:

4.1. Get Avro JSON and piggybank Jars.
INFO: Avro is a data serialization system based on schema based files.
JSON Simple would provide json reading capabilities and piggybank contains several pig util udf’s.

REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/avro-1.7.4.jar
REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER /usr/local/pig-0.12.0/contrib/piggybank/java/piggybank.jar
DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

4.2. Register Mongo JARS:
a. Download Mongo Java driver (https://github.com/mongodb/mongo-java-driver/downloads) (or, http://central.maven.org/maven2/org/mongodb/mongo-java-driver/)
b. Download Mongo hadoop-code & hadoop-pig JARS. Build by sbt.
– https://github.com/mongodb/mongo-hadoop/archive/master.zip
– $> ./sbt package

c. REGISTER /data/yash-tests/jars/mongo-hadoop-master/mongo-2.10.1.jar
REGISTER /data/yash-tests/jars/mongo-hadoop-master/core/target/mongo-hadoop-core_2.2.0-1.2.0.jar
REGISTER /data/yash-tests/jars/mongo-hadoop-master/pig/target/mongo-hadoop-pig_2.2.0-1.2.0.jar

5. Load data in pig
data = LOAD ‘mongodb://localhost/test_db.docs’
USING com.mongodb.hadoop.pig.MongoLoader(‘FirstName:chararray, LastName:chararray, Email:chararray’)
AS (FirstName, LastName, Email);

Final Script

 

REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/avro-1.7.4.jar
 REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/json-simple-1.1.jar
 REGISTER /usr/local/pig-0.12.0/contrib/piggybank/java/piggybank.jar
 DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
REGISTER /data/yash-tests/jars/mongo-2.10.1.jar
 REGISTER /data/yash-tests/jars/mongo-hadoop-master/core/target/mongo-hadoop-core_2.2.0-1.2.0.jar
 REGISTER /data/yash-tests/jars/mongo-hadoop-master/pig/target/mongo-hadoop-pig_2.2.0-1.2.0.jar
data = LOAD 'mongodb://localhost:27017/test_db.docs'
 USING com.mongodb.hadoop.pig.MongoLoader('FirstName:chararray, LastName:chararray, Email:chararray')
 AS (FirstName, LastName, Email);
EXPLAIN data;
 DUMP data;
// OUTPUT:
 (Bruce,Wayne,bwayne@Wayneenterprises.com)
 (Lucius,Fox,lfox@Wayneenterprises.com)
 (Dick,Grayson,dgrayson@Wayneenterprises.com)
// Loading mongo data without providing schema info
 raw = LOAD 'mongodb://localhost:27017/test_db.docs' using com.mongodb.hadoop.pig.MongoLoader;

OUTPUT:

 ([FirstName#Bruce,LastName#Wayne,Email#bwayne@Wayneenterprises.com,_id#533521760d94ec446461a335])
 ([FirstName#Lucius,LastName#Fox,Email#lfox@Wayneenterprises.com,_id#533521760d94ec446461a336])
 ([FirstName#Dick,LastName#Grayson,Email#dgrayson@Wayneenterprises.com,_id#533521760d94ec446461a337])

 

 

Hope the post was helpful. Cheers \m/

Yash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar and enjoys teaching as part time hobby.
Talk to Yash about Distributed Systems and Data platform designs.

One Comment

Leave a Reply to Rahul Tripathi Cancel reply

Your email address will not be published. Required fields are marked *