In our previous post, we discussed about getting started with knowledge graph where we only saw how to install neo4j in Docker.
In this post, we will discuss about modelling the data in a graphical way. Here we have used Stack Overflow 2018 Developer Survey from kaggle to explain how we can push the data to our graphical database.
Lets look at the data
Let’s have a look on the overview of data and know how and what columns we can use for our knowledge graph.
Let’s see few column names with description that we have in data :
Modelling Data for graphical representation
Nodes represent important entities/subjects/objects in our graphs. By having multiple types of node we can take advantage of the connected nature of the graph. For example : if our data set contains data about user and places with some metadata/attributes about both, we will create nodes for User and Place, and connect these by some Relation/edge (eg. User LIVES-IN Place) .
So for our dataset, we will try to do the same. We will identify important nodes for our dataset. In this project, we will only select few columns for our knowledge graph.
Identifying Nodes for our knowledge graph
Person Node
In this node, we will see all the attributes related to a person like :
- User_id
- code_as_hobby
- contributes_to_open_source
- is_student
- employment_status
- company_size
- total_years_of_coding_experience
Country Node
We have a country name in our dataset. We will use this as an id for our country node. We will use the user id to create a connection/edge between the user and country nodes. Country doesn’t have any other attributes.
- User_id
- Country
Major Node
Similarly in major, we will select following columns to create a nodes and connection :
- User_id
- UndergradMajor
Work domain Node
Similarly for work :
- User_id
- DevType
At the end of this activity, this is how our connection looks like for a single user :
Getting to the code
By this time Neo4j should already be running in background. Refer to part 1 for installation details. Here we will use py2neo for bulk insertion of our dataset into graph database.
Configuring Graph database
Creating uniqueness constraints
Insert data into graph
Function : read_data()
A utility function that read our dataset as a pandas data frame. We will pass this data frame to other functions to add data to the graph.
Function : process_user_data()
This functions takes data frame as an input and extracts the attributes/columns that we are interested in. This function also defines the Neo4j cypher query for bulk insertion into the graph.
The cypher query uses UNWIND syntax that expects List-of-Dictionary for execution. We will also transform our data into this format for running the query.
Function : run_neo_query()
This utility function takes data frame and query as an input and runs the query on Neo4j.
Function : get_batches()
The data frame contain approximately 100k rows. Running all of these in a single query might crash before completion . So this is a small utility function that break a larger list into smaller batches.
Other functions
Similarly, for adding country, major and work domain, we apply the same logic. Work domain dataset contains multiple values separated by semicolon (;). We need to explode them into multiples row for insertion.
Once we insert the data, we can see all the nodes and relationship in Neo4j. We can also run Cypher query from Neo4j web UI for data analysis.
Following graph shows relationship of multiple users :
All the code : https://github.com/nikkisharma536/knowledge_graph/tree/master/data_extraction.
Hope you like the post. Stay tuned for the next post of the series :
We will discuss about creating REST API to expose this data to application and users.
Hi Nikita, I’m also a data science enthusiast. This blog is of real help. I highly appreciate your dedication to this. Congratulations.
I’ve started following the same pathways, of Knowledge engineering /ontology. If you can help me understand it better, would be actually great.
Vishal