Indexing csv data in Solr via Python – PySolr

Here is a crisp post to index Data in Solr using Python. 1. Install Pre-requisites – pip – PySolr 2. Python Script #!/usr/bin/python import sys, getopt import pysolr import csv, json #SOLR_URL=http://54.254.192.149:8983/solr/feeddata/ def main(args): solrurl=” inputfile=” try: opts, args = getopt.getopt(args,”hi:u:”) except getopt.GetoptError: print ‘index_data.py -i -u ‘ sys.exit(2) for opt, arg in opts: if …

More

How to get Pig Logical plan (Execution DAG) from Pig Latin script

TLDR; A Pig Logical plan is the Plan DAG that is used to execute the chain oj Jobs on Hadoop. Here is the code snippet for obtaining a Pig latin Logical Plan DAG frpm a Pig Script- https://github.com/yssharma/pig-on-drill/blob/b2d8a23c11d03974e16eb2ff44e021b1e957f03f/exec/java-exec/src/main/java/org/apache/drill/exec/pigparser/parser/PigLatinParser.java#L53 Yash SharmaYash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar …

More

PySolr : How to boost a field for Solr document

Adding a Quick note – PySolr : How to boost a field for Solr document Index time boosting conn.add(docs, boost={‘author’: ‘2.0’,}) Query time boosting qf=title^5 content^2 comments^0.5 Read: http://java.dzone.com/articles/options-tune-document%E2%80%99s     Yash SharmaYash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar and enjoys teaching as part time hobby. Talk …

More

JSolr Exception – Exception in thread “main” org.apache.solr.common.SolrException: Bad Request

Exception in thread “main” org.apache.solr.common.SolrException: Bad Request Bad Request request: http://54.254.192.149:8983/solr/feeddata/update?wt=javabin&version=2 Solution: Check Solr logs. INFO – 2014-11-07 07:04:42.985; org.apache.solr.update.processor.LogUpdateProcessor; [feeddata] webapp=/solr path=/update params={wt=javabin&version=2} {} 0 1 ERROR – 2014-11-07 07:04:42.985; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id Here it is: Document is missing mandatory uniqueKey field: id   Another instance: INFO – …

More

Indexing CSV data file in Solr – Using annotated java pojo’s

1. Java pojo: Add the Java POJO with the required fields- import org.apache.solr.client.solrj.beans.Field; /** * Created by yash on 18/11/14. */ public class ProductBean { @Field private int id; @Field(“rank”) private int rank; @Field(“prodid”) private long prodid; @Field(“cat”) private int cat; @Field(“subcat”) private int subcat; public ProductBean(){} // Required by Solr to initialize bean. public …

More

Minimal Hadoop and Yarn installation

New best tutorial around. Keeping a note of it 🙂 Check it out, you might love it too. https://raseshmori.wordpress.com/2012/09/23/install-hadoop-2-0-1-yarn-nextgen/ Yash SharmaYash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar and enjoys teaching as part time hobby. Talk to Yash about Distributed Systems and Data platform designs. http://www.confusedcoders.com

Minimal Spark hello world

1. Build Sbt Create a build.sbt file. This manages all dependencies and stuffs that would had been in your pom file- import AssemblyKeys._ import sbtassembly.Plugin._ name := “FeedSystem” version := “1.0” scalaVersion := “2.10.5” organization := “com.snapdeal” resolvers += “Typesafe Repo” at “http://repo.typesafe.com/typesafe/releases/” libraryDependencies ++= Seq(“org.apache.spark” % “spark-core_2.10” % “1.3.1” % “provided”, “org.apache.spark” % “spark-mllib_2.10” …

More

Apache Drill access via Java JDBC API

Here is a quick draft on accessing Apache Drill via the java JDBC. 1. Add the Drill dependency- <dependency> <groupId>org.apache.drill.exec</groupId> <artifactId>drill-jdbc</artifactId> <version>1.1.0</version> </dependency> 2. Java Code to access Drill import java.sql.Connection; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; import java.util.List; import java.util.Properties; import org.apache.drill.exec.ExecConstants; import org.apache.drill.exec.client.DrillClient; import org.apache.drill.exec.proto.UserBitShared.QueryType; import org.apache.drill.exec.rpc.RpcException; import org.apache.drill.exec.rpc.user.QueryDataBatch; import org.apache.drill.jdbc.Driver; public class …

More