Querying MongoDB via Apache Pig

This crisp post in on querying MongoDB for HDFS Data Transfer, via Pig.

Below are the steps involved for the same:

1. Install MongoDB on box
a. Download MongoDB binaries
b. Extract mongodb and export bin path to $PATH
c. create db dir for mongodb
d. start mongodb by the command : mongod –dbpath

INFO: Mongo listens on port 27017 by default.

2. Start Mongo Shell
a. Goto mongo installation dir in new terminal
b. $> ./bin/mongo
c. type exit to exit mongo shell

3. Load data into Mongo
a. Create a json data file for importing data to mongo
b. call mongoimport to import the data file into mongo:
mongoimport –db test_db –collection docs –file /data/yash-tests/data.json
INFO: test_db need not exist before executing the command. docs is the collection name.
c. Verify data import:
$> ./bin/mongo
mongo> show dbs
mongo> use test_db
mongo> show collections
mongo> db[“docs”].find()

4. PIG:

4.1. Get Avro JSON and piggybank Jars.
INFO: Avro is a data serialization system based on schema based files.
JSON Simple would provide json reading capabilities and piggybank contains several pig util udf’s.

REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/avro-1.7.4.jar
REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER /usr/local/pig-0.12.0/contrib/piggybank/java/piggybank.jar
DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

4.2. Register Mongo JARS:
a. Download Mongo Java driver (https://github.com/mongodb/mongo-java-driver/downloads) (or, http://central.maven.org/maven2/org/mongodb/mongo-java-driver/)
b. Download Mongo hadoop-code & hadoop-pig JARS. Build by sbt.
– https://github.com/mongodb/mongo-hadoop/archive/master.zip
– $> ./sbt package

c. REGISTER /data/yash-tests/jars/mongo-hadoop-master/mongo-2.10.1.jar
REGISTER /data/yash-tests/jars/mongo-hadoop-master/core/target/mongo-hadoop-core_2.2.0-1.2.0.jar
REGISTER /data/yash-tests/jars/mongo-hadoop-master/pig/target/mongo-hadoop-pig_2.2.0-1.2.0.jar

5. Load data in pig
data = LOAD ‘mongodb://localhost/test_db.docs’
USING com.mongodb.hadoop.pig.MongoLoader(‘FirstName:chararray, LastName:chararray, Email:chararray’)
AS (FirstName, LastName, Email);

Final Script

 

REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/avro-1.7.4.jar
 REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/json-simple-1.1.jar
 REGISTER /usr/local/pig-0.12.0/contrib/piggybank/java/piggybank.jar
 DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
REGISTER /data/yash-tests/jars/mongo-2.10.1.jar
 REGISTER /data/yash-tests/jars/mongo-hadoop-master/core/target/mongo-hadoop-core_2.2.0-1.2.0.jar
 REGISTER /data/yash-tests/jars/mongo-hadoop-master/pig/target/mongo-hadoop-pig_2.2.0-1.2.0.jar
data = LOAD 'mongodb://localhost:27017/test_db.docs'
 USING com.mongodb.hadoop.pig.MongoLoader('FirstName:chararray, LastName:chararray, Email:chararray')
 AS (FirstName, LastName, Email);
EXPLAIN data;
 DUMP data;
// OUTPUT:
 (Bruce,Wayne,bwayne@Wayneenterprises.com)
 (Lucius,Fox,lfox@Wayneenterprises.com)
 (Dick,Grayson,dgrayson@Wayneenterprises.com)
// Loading mongo data without providing schema info
 raw = LOAD 'mongodb://localhost:27017/test_db.docs' using com.mongodb.hadoop.pig.MongoLoader;

OUTPUT:

 ([FirstName#Bruce,LastName#Wayne,Email#bwayne@Wayneenterprises.com,_id#533521760d94ec446461a335])
 ([FirstName#Lucius,LastName#Fox,Email#lfox@Wayneenterprises.com,_id#533521760d94ec446461a336])
 ([FirstName#Dick,LastName#Grayson,Email#dgrayson@Wayneenterprises.com,_id#533521760d94ec446461a337])

 

 

Hope the post was helpful. Cheers \m/

Yash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar and enjoys teaching as part time hobby.
Talk to Yash about Distributed Systems and Data platform designs.

One thought on “Querying MongoDB via Apache Pig

Leave a Reply

Your email address will not be published. Required fields are marked *