How to create custom NER in Spacy

Named Entity Recognition (NER)

NER is also known as entity identification or entity extraction. It is a process of identifying predefined entities present in a text such as person name, organisation, location, etc. It is a statistical model which is trained on a labelled data set and then used for extracting information from a given set of data.

Sometimes we want to extract the information based on our domain or industry. For example : in medical domain, we want to extract disease or symptom or medication etc, in that case we need to create our own custom NER.

Spacy

It is an open source software library for advanced Natural Language Programming (NLP).

The Spacy NER environment uses a word embedding strategy using a sub-word features and Bloom embed and 1D Convolutional Neural Network (CNN).

  • Bloom Embedding : It is similar to word embedding and more space optimised representation.It gives each word a unique representation for each distinct context it is in.
  • 1D CNN : It is applied over the input text to classify a sentence/ word into a set of predetermined categories 

How Spacy works

  1. It tokenises the text, i.e. broken-up input sentence into words or word embedding
  2. Words are then broken-up into features and then aggregated to a representative number
  3. This number is then fed to fully connected neural structure, which makes a classification based on the weight  assigned to each features within the text.

How to train Spacy

  • Training data : Annotated data contain both text and their labels
    • Text : Input text the model should predict a label for.
    • Label : The label the model should predict.
  • Gradient : Calculate how to change the weights to improve the predictions. (Compare the prediction label with the actual label and adjusts its weights so that the correct action will score higher next time.)
  • Finally save the model

Spacy Training Data Format

Spacy needs a particular training/annotated data format :

Code walkthrough

Load the model, or create an empty model

We can create an empty model and train it with our annotated dataset or we can use existing spacy model and re-train with our annotated data.

  • We can create an empty model using spacy.black(“en”) or we can load the existing spacy model using spacy.load(“model_name”)
  • We can check the list of pipeline component names by using nlp.pipe_names() .
  • If  we don’t have the entity recogniser in  the pipeline, we will need to create the ner pipeline component using nlp.create_pipe(“ner”) and add that in our model pipeline by using nlp.add_pipe method.

Adding Labels or entities

In order to train the model with our annotated data, we need to add the labels (entities) we want to extract from our text.

  1. We can add the new entity from our annotated data to the entity recogniser using ner.add_label().
  2. As we are only focusing on entity extraction, we will disable all other pipeline components to train our model for ner only using nlp.disable_pipes().

Training and updating the model

  • We will train our model for a number of iterations so that the model can learn from it effectively.
  • At each iteration, the training data is shuffled to ensure the model doesn’t make any generalisations based on the order of examples.
  • We will update the model for each iteration using  nlp.update(). 

Evaluate the model

 

 

Checkout my portfolio here: https://confusedcoders.com/nikita-sharma-greenhorn-data-science-student

I am a greenhorn Data Science student with interest in finding patterns in data. My language of choice is Python and I am starting to get my hands dirty with R.

I blog on Medium.com [1] and ConfusedCoders.com [2]. I share my code on Github.com [3].

  1.  https://medium.com/@nikkisharma536
  2. https://confusedcoders.com/author/nikita
  3. https://github.com/nikkisharma536

Leave a Reply

Your email address will not be published. Required fields are marked *