This crisp post in on querying MongoDB for HDFS Data Transfer, via Pig.
Below are the steps involved for the same:
1. Install MongoDB on box
a. Download MongoDB binaries
b. Extract mongodb and export bin path to $PATH
c. create db dir for mongodb
d. start mongodb by the command : mongod –dbpath
INFO: Mongo listens on port 27017 by default.
2. Start Mongo Shell
a. Goto mongo installation dir in new terminal
b. $> ./bin/mongo
c. type exit to exit mongo shell
3. Load data into Mongo
a. Create a json data file for importing data to mongo
b. call mongoimport to import the data file into mongo:
mongoimport –db test_db –collection docs –file /data/yash-tests/data.json
INFO: test_db need not exist before executing the command. docs is the collection name.
c. Verify data import:
$> ./bin/mongo
mongo> show dbs
mongo> use test_db
mongo> show collections
mongo> db[“docs”].find()
4. PIG:
4.1. Get Avro JSON and piggybank Jars.
INFO: Avro is a data serialization system based on schema based files.
JSON Simple would provide json reading capabilities and piggybank contains several pig util udf’s.
REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/avro-1.7.4.jar
REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER /usr/local/pig-0.12.0/contrib/piggybank/java/piggybank.jar
DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
4.2. Register Mongo JARS:
a. Download Mongo Java driver (https://github.com/mongodb/mongo-java-driver/downloads) (or, http://central.maven.org/maven2/org/mongodb/mongo-java-driver/)
b. Download Mongo hadoop-code & hadoop-pig JARS. Build by sbt.
– https://github.com/mongodb/mongo-hadoop/archive/master.zip
– $> ./sbt package
c. REGISTER /data/yash-tests/jars/mongo-hadoop-master/mongo-2.10.1.jar
REGISTER /data/yash-tests/jars/mongo-hadoop-master/core/target/mongo-hadoop-core_2.2.0-1.2.0.jar
REGISTER /data/yash-tests/jars/mongo-hadoop-master/pig/target/mongo-hadoop-pig_2.2.0-1.2.0.jar
5. Load data in pig
data = LOAD ‘mongodb://localhost/test_db.docs’
USING com.mongodb.hadoop.pig.MongoLoader(‘FirstName:chararray, LastName:chararray, Email:chararray’)
AS (FirstName, LastName, Email);
Final Script
REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/avro-1.7.4.jar REGISTER /usr/local/pig-0.12.0/build/ivy/lib/Pig/json-simple-1.1.jar REGISTER /usr/local/pig-0.12.0/contrib/piggybank/java/piggybank.jar DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
REGISTER /data/yash-tests/jars/mongo-2.10.1.jar REGISTER /data/yash-tests/jars/mongo-hadoop-master/core/target/mongo-hadoop-core_2.2.0-1.2.0.jar REGISTER /data/yash-tests/jars/mongo-hadoop-master/pig/target/mongo-hadoop-pig_2.2.0-1.2.0.jar
data = LOAD 'mongodb://localhost:27017/test_db.docs' USING com.mongodb.hadoop.pig.MongoLoader('FirstName:chararray, LastName:chararray, Email:chararray') AS (FirstName, LastName, Email);
EXPLAIN data; DUMP data;
// OUTPUT: (Bruce,Wayne,bwayne@Wayneenterprises.com) (Lucius,Fox,lfox@Wayneenterprises.com) (Dick,Grayson,dgrayson@Wayneenterprises.com)
// Loading mongo data without providing schema info raw = LOAD 'mongodb://localhost:27017/test_db.docs' using com.mongodb.hadoop.pig.MongoLoader;
OUTPUT:
([FirstName#Bruce,LastName#Wayne,Email#bwayne@Wayneenterprises.com,_id#533521760d94ec446461a335]) ([FirstName#Lucius,LastName#Fox,Email#lfox@Wayneenterprises.com,_id#533521760d94ec446461a336]) ([FirstName#Dick,LastName#Grayson,Email#dgrayson@Wayneenterprises.com,_id#533521760d94ec446461a337])
Hope the post was helpful. Cheers \m/
Awesome.Thanks:)