Indexing csv data in Solr via Python – PySolr

Here is a crisp post to index Data in Solr using Python.

1. Install Pre-requisites

– pip
– PySolr

2. Python Script


import sys, getopt
import pysolr
import csv, json


def main(args):
    opts, args = getopt.getopt(args,"hi:u:")
  except getopt.GetoptError:
    print ' -i -u '

  for opt, arg in opts:
    if opt == '-h':
      print ' -i -u '
    elif opt in ("-i"):
      inputfile = arg
    elif opt in ("-u"):
      solrurl = arg

  # create a connection to a solr server
  s = pysolr.Solr(solrurl, timeout=10)
  keys=("rank", "pogid", "cat", "subcat", "question_bucketid", "brand", "discount", "age_grp", "gender", "inventory",   "last_updated")
  for line in open(inputfile, 'r').readlines():
    splits = line.split(',')
    record_count += 1
    # add record for indexing
    items=[{"id":record_count, "rank":splits[0], "pogid":splits[1], "cat":splits[2], "subcat":splits[3],   "question_bucketid":splits[4], "brand":splits[5], "discount":splits[6], "age_grp":splits[7], "gender":splits[8],   "inventory":splits[9], "last_updated":splits[10]}]

  s.add(items, commit=True)
  print 'Done !!'

if __name__ == "__main__":

NOTE: Indentation is a little messed up.

3. Trouble shooting:

You might face couple of error like below. Check Solr logs for Root cause and solution.
– IP Address Error
– Undefined feild error

Yash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar and enjoys teaching as part time hobby.
Talk to Yash about Distributed Systems and Data platform designs.

Leave a Reply

Your email address will not be published. Required fields are marked *