How to create custom NER in Spacy

Named Entity Recognition (NER)

NER is also known as entity identification or entity extraction. It is a process of identifying predefined entities present in a text such as person name, organisation, location, etc. It is a statistical model which is trained on a labelled data set and then used for extracting information from a given set of data.

Sometimes we want to extract the information based on our domain or industry. For example : in medical domain, we want to extract disease or symptom or medication etc, in that case we need to create our own custom NER.

Spacy

It is an open source software library for advanced Natural Language Programming (NLP).

The Spacy NER environment uses a word embedding strategy using a sub-word features and Bloom embed and 1D Convolutional Neural Network (CNN).

  • Bloom Embedding : It is similar to word embedding and more space optimised representation.It gives each word a unique representation for each distinct context it is in.
  • 1D CNN : It is applied over the input text to classify a sentence/ word into a set of predetermined categories 

How Spacy works

  1. It tokenises the text, i.e. broken-up input sentence into words or word embedding
  2. Words are then broken-up into features and then aggregated to a representative number
  3. This number is then fed to fully connected neural structure, which makes a classification based on the weight  assigned to each features within the text.

How to train Spacy

  • Training data : Annotated data contain both text and their labels
    • Text : Input text the model should predict a label for.
    • Label : The label the model should predict.
  • Gradient : Calculate how to change the weights to improve the predictions. (Compare the prediction label with the actual label and adjusts its weights so that the correct action will score higher next time.)
  • Finally save the model

Spacy Training Data Format

Spacy needs a particular training/annotated data format :

Code walkthrough

Load the model, or create an empty model

We can create an empty model and train it with our annotated dataset or we can use existing spacy model and re-train with our annotated data.

  • We can create an empty model using spacy.black(“en”) or we can load the existing spacy model using spacy.load(“model_name”)
  • We can check the list of pipeline component names by using nlp.pipe_names() .
  • If  we don’t have the entity recogniser in  the pipeline, we will need to create the ner pipeline component using nlp.create_pipe(“ner”) and add that in our model pipeline by using nlp.add_pipe method.

Adding Labels or entities

In order to train the model with our annotated data, we need to add the labels (entities) we want to extract from our text.

  1. We can add the new entity from our annotated data to the entity recogniser using ner.add_label().
  2. As we are only focusing on entity extraction, we will disable all other pipeline components to train our model for ner only using nlp.disable_pipes().

Training and updating the model

  • We will train our model for a number of iterations so that the model can learn from it effectively.
  • At each iteration, the training data is shuffled to ensure the model doesn’t make any generalisations based on the order of examples.
  • We will update the model for each iteration using  nlp.update(). 

Evaluate the model

 

 

1 thought on “How to create custom NER in Spacy”

  1. Hi Nikita,

    This is very knowledgable article Thanks for that.I am also doing same thing for my project but I want to train the model for same TAG.
    I am facing some problem like when I trained my model on data corpus which have multiple entities on same line .
    For eg:- If we want to build NER for fruits label/tag —‘FRUIT”
    TRAIN DATA =[“Apple and Mango are fruits. “, entities : {[(0,4, “FRUIT”), (10,14, “FRUIT”)] }]
    Following error comes-
    ValueError: [E103] Trying to set conflicting doc.ents: ‘(0,4, ‘TAG_NAME’)’ and ‘(10,14 ‘TAG_NAME’)’. A token can only be part of one entity, so make sure the entities you’re setting don’t overlap.
    I tried overwrite_ents =TRUE and some other solutions search by oogle but didnt get result.
    How can I train such models having multiple entities with same label. I have a large corpus of data and I can not distinguish entities or make such cor[pus line by line.Is any way to do it in another way.
    Thanks for the help.

Leave a Reply

Your email address will not be published. Required fields are marked *