How to build deep neural network for custom NER with Keras


In this post, we will learn how we can create a simple neural network to extract information ( NER) from unstructured text data with Keras.

Named Entity Recognition (NER)

NER is also known as entity identification or entity extraction. It is a process of identifying predefined entities present in a text such as person name, organisation, location, etc. It is a statistical model which is trained on a labelled data set and then used for extracting information from a given set of data.

Sometimes we want to extract the information based on our domain or industry. For example : in medical domain, we want to extract disease or symptom or medication etc, in that case we need to create our own custom NER.

Model Architecture

Here we will use BILSTM + CRF layers. The LSTM  layer is used to filter the unwanted information and will keep only the important features/information and the CRF layer is used to deal with the sequential data.


BI-LSTM is used to produce vector representation for our words. It takes each word in a sentence as an input and produce a vector representation of each word in both directions (i.e; forward and backward) where forward direction access past information and backward direction access future. It is then combined with the CRF layer

CRF Layer

CRF layer is an optimisation on top of BI-LSTM layer. It can be used to efficiently predict the current tag based on the past attributed tags. Here is a great post on why CRF layer is useful on top of BI-LSTM


Image result for BILSTM and crf model"

Data Preprocessing

Data Format

For this example I have used this Kaggle dataset. For our model, we need a data frame  that contain ‘Sentence_id’/ ‘Sentence’ column, ‘word’ column and the ‘tag’ column.

Wrapping input data in SentenceGetter

We will create list of list of tuples to organise our input data and to differentiate sentences from each other. After loading data, we will use the SentenceGetter class to retrieve sentences with their label.

Output :

Here is how three sentences would look like:

Dictionaries of words and tags

Keras (and most other ML models) expect all the ids to be numeric, this is an optimisation to save memory. We will use word2idx dictionary to convert each word to a corresponding integer ID and  tag2idx to convert tag to integer ID.


Pad Sequence

The BI-LSTM layer expects all texts/sentences to be of the same length. We select the padding size to be the length of the longest sentence.

Create Model (and understand layer parameters)

Let’s discuss in brief about different layers used to create our model.

Input Layer

The input layer takes a shape parameter that is a tuple that indicates the dimensionality of the input data.

Embedding Layer

It is basically a dictionary lookup that takes integers as input and returns the associated vectors.

It takes three parameters :

  • input_dim : Size of the vocabulary in the text data i.e; n_words+1
  • output_dim : Dimensionality of the embeddings
  • input_length :  length of input_sequence  i.e; length of the longest sentence


It takes five parameters :

  • units : Dimensionality of the output space
  • return_sequences :  If return_sequence = True, it returns the full sequence of output else, it return the last output in the output sequence.
  • dropout : Fraction of the units to drop for the linear transformation of the inputs. It lies between 0 and 1.
  • recurrent_dropout : Fraction of the units to drop for the linear transformation of the recurrent state. It lies between 0 and 1.
  • kernel_initializer : Initializer for the kernel weights matrix, used for the linear transformation of the inputs.


TimeDistributed Layer

It is a wrapper that allow us to apply one layer to every element of our sequence independently. It is used in sequence classification to keep one-to-one relations on input and output.

CRF Layer

We have not applied any customisation to the CRF layer. We have passed the number of output classes to the CRF layer.

Code for model creation

Fit and Evaluate Model

Compile model

We need to configure the learning process, before training a model.

It takes three parameters :

  • optimizer : It will update itself based on the data it sees and its loss function
  • loss : It will be able to measure its performance on the training data.
  • metrics :  List of metrics to be evaluated by the model during training and testing.

Callbacks list

It is used to update/save the model weights to the model file, if and only if the validation accuracy improves.

It takes five parameters :

  • filepath : path to the destination model file
  • monitor : monitors the model’s validation accuracy
  • verbose :  If verbose = 1, it will show both progress bar and one line per epoch, if verbose = 0, it will not show anything and if verbose = 2, it will only show one line per epoch i.e. epoch no./total no. of epochs.
  • save_best_only :  if save_best_only=True, the latest best model according to the quantity monitored will not be overwritten.
  • mode : We set mode=’min’ if the monitored value is val_loss as we want to minimise it and  mode = ‘max’ if the monitored value is val_acc as we want to maximise it.

Fit model

It will train the model for a fixed number of epochs.

It takes seven parameters :

  • X: Input data
  • y : target data
  • batch_size : Number of samples per gradient update,  batch_size will default to 32.
  • epochs : An epoch is an iteration over the entire x and y data provided. Number of epochs to train the model.
  • validation_split : Fraction of the training data to be used as validation data.
  • verbose : If verbose = 1, it will show both progress bar and one line per epoch, if verbose = 0, it will not show anything and if verbose = 2, it will only show one line per epoch i.e. epoch no./total no. of epochs.
  • callbacks : List of callbacks to apply during evaluation.

Full code link :

That’s all for this post. Stay tuned for more interesting blogs.


Leave a Reply

Your email address will not be published. Required fields are marked *