Knowledge Graph part-2 : Modelling tabular data as graph

In our previous post, we discussed about getting started with knowledge graph where we only saw how to install neo4j in Docker.

In this post, we will discuss about modelling the data in a graphical way. Here we have used Stack Overflow 2018 Developer Survey from kaggle to explain how we can push the data to our graphical database.

Lets look at the data

Let’s have a look on the overview of data and know how and what columns we can use for our knowledge graph.

Let’s see few column names with description that we have in data :

Modelling Data for graphical representation

Nodes represent important entities/subjects/objects in our graphs. By having multiple types of node we can take advantage of the connected nature of the graph. For example : if our data set contains data about user and places with some metadata/attributes about both, we will create nodes for User and Place, and connect these by some Relation/edge (eg. User LIVES-IN Place) .

So for our dataset, we will try to do the same. We will identify important nodes for our dataset. In this project, we will only select few columns for our knowledge graph.

Identifying Nodes for our knowledge graph

Person Node

In this node, we will see all the attributes related to a person like :

  • User_id
  • code_as_hobby
  • contributes_to_open_source
  • is_student
  • employment_status
  • company_size
  • total_years_of_coding_experience

Country Node

We have a country name in our dataset. We will use this as an id for our country node. We will use the user id to create a connection/edge between the user and country nodes. Country doesn’t have any other attributes.

  • User_id
  • Country

Major Node

Similarly in major, we will select following columns to create a nodes and connection :

  • User_id
  • UndergradMajor

 

Work domain Node

Similarly for work :

  • User_id
  • DevType

At the end of this activity, this is how our connection looks like for a single user :

 

Getting to the code

By this time Neo4j should already be running in background. Refer to part 1 for installation details. Here we will use py2neo for bulk insertion of our dataset into graph database.

Configuring Graph database

Creating uniqueness constraints

Insert data into graph

Function : read_data()

A utility function that read our dataset as a pandas data frame. We will pass this data frame to other functions to add data to the graph.

Function : process_user_data()

This functions takes data frame as an input and extracts the attributes/columns that we are interested in. This function also defines the Neo4j cypher query for bulk insertion into the graph.

The cypher query uses UNWIND syntax that expects List-of-Dictionary for execution. We will also transform our data into this format for running the query.

 

Function : run_neo_query()

This utility function takes data frame and query as an input and runs the query on Neo4j.

Function : get_batches()

The data frame contain approximately 100k rows. Running all of these in a single query might crash before completion . So this is a small utility function that break a larger list into smaller batches.

Other functions

Similarly, for adding country, major and work domain, we apply the same logic. Work domain dataset contains multiple values separated by semicolon (;). We need to explode them into multiples row for insertion.

Once we insert the data, we can see all the nodes and relationship in Neo4j. We can also run Cypher query from Neo4j web UI for data analysis.

 

Following graph shows relationship of multiple users :

 

 

All the code : https://github.com/nikkisharma536/knowledge_graph/tree/master/data_extraction.

Hope you like the post. Stay tuned for the next post of the series :

We will discuss about creating REST API to expose this data to application and users.

Checkout my portfolio here: https://confusedcoders.com/nikita-sharma-greenhorn-data-science-student

I am a greenhorn Data Science student with interest in finding patterns in data. My language of choice is Python and I am starting to get my hands dirty with R.

I blog on Medium.com [1] and ConfusedCoders.com [2]. I share my code on Github.com [3].

  1.  https://medium.com/@nikkisharma536
  2. https://confusedcoders.com/author/nikita
  3. https://github.com/nikkisharma536

Leave a Reply

Your email address will not be published. Required fields are marked *