Introduction
In this post, we will learn how we can create a simple neural network to extract information ( NER) from unstructured text data with Keras.
Named Entity Recognition (NER)
NER is also known as entity identification or entity extraction. It is a process of identifying predefined entities present in a text such as person name, organisation, location, etc. It is a statistical model which is trained on a labelled data set and then used for extracting information from a given set of data.
Sometimes we want to extract the information based on our domain or industry. For example : in medical domain, we want to extract disease or symptom or medication etc, in that case we need to create our own custom NER.
Model Architecture
Here we will use BILSTM + CRF layers. The LSTM layer is used to filter the unwanted information and will keep only the important features/information and the CRF layer is used to deal with the sequential data.
BI-LSTM Layer
BI-LSTM is used to produce vector representation for our words. It takes each word in a sentence as an input and produce a vector representation of each word in both directions (i.e; forward and backward) where forward direction access past information and backward direction access future. It is then combined with the CRF layer
CRF Layer
CRF layer is an optimisation on top of BI-LSTM layer. It can be used to efficiently predict the current tag based on the past attributed tags. Here is a great post on why CRF layer is useful on top of BI-LSTM
Data Preprocessing
Data Format
For this example I have used this Kaggle dataset. For our model, we need a data frame that contain ‘Sentence_id’/ ‘Sentence’ column, ‘word’ column and the ‘tag’ column.
Wrapping input data in SentenceGetter
We will create list of list of tuples to organise our input data and to differentiate sentences from each other. After loading data, we will use the SentenceGetter class to retrieve sentences with their label.
Output :
Here is how three sentences would look like:
Dictionaries of words and tags
Keras (and most other ML models) expect all the ids to be numeric, this is an optimisation to save memory. We will use word2idx dictionary to convert each word to a corresponding integer ID and tag2idx to convert tag to integer ID.
Pad Sequence
The BI-LSTM layer expects all texts/sentences to be of the same length. We select the padding size to be the length of the longest sentence.
Create Model (and understand layer parameters)
Let’s discuss in brief about different layers used to create our model.
Input Layer
The input layer takes a shape parameter that is a tuple that indicates the dimensionality of the input data.
Embedding Layer
It is basically a dictionary lookup that takes integers as input and returns the associated vectors.
It takes three parameters :
- input_dim : Size of the vocabulary in the text data i.e; n_words+1
- output_dim : Dimensionality of the embeddings
- input_length : length of input_sequence i.e; length of the longest sentence
BI-LSTM Layer
It takes five parameters :
- units : Dimensionality of the output space
- return_sequences : If return_sequence = True, it returns the full sequence of output else, it return the last output in the output sequence.
- dropout : Fraction of the units to drop for the linear transformation of the inputs. It lies between 0 and 1.
- recurrent_dropout : Fraction of the units to drop for the linear transformation of the recurrent state. It lies between 0 and 1.
- kernel_initializer : Initializer for the kernel weights matrix, used for the linear transformation of the inputs.
TimeDistributed Layer
It is a wrapper that allow us to apply one layer to every element of our sequence independently. It is used in sequence classification to keep one-to-one relations on input and output.
CRF Layer
We have not applied any customisation to the CRF layer. We have passed the number of output classes to the CRF layer.
Code for model creation
Fit and Evaluate Model
Compile model
We need to configure the learning process, before training a model.
It takes three parameters :
- optimizer : It will update itself based on the data it sees and its loss function
- loss : It will be able to measure its performance on the training data.
- metrics : List of metrics to be evaluated by the model during training and testing.
Callbacks list
It is used to update/save the model weights to the model file, if and only if the validation accuracy improves.
It takes five parameters :
- filepath : path to the destination model file
- monitor : monitors the model’s validation accuracy
- verbose : If verbose = 1, it will show both progress bar and one line per epoch, if verbose = 0, it will not show anything and if verbose = 2, it will only show one line per epoch i.e. epoch no./total no. of epochs.
- save_best_only : if save_best_only=True, the latest best model according to the quantity monitored will not be overwritten.
- mode : We set mode=’min’ if the monitored value is val_loss as we want to minimise it and mode = ‘max’ if the monitored value is val_acc as we want to maximise it.
Fit model
It will train the model for a fixed number of epochs.
It takes seven parameters :
- X: Input data
- y : target data
- batch_size : Number of samples per gradient update, batch_size will default to 32.
- epochs : An epoch is an iteration over the entire x and y data provided. Number of epochs to train the model.
- validation_split : Fraction of the training data to be used as validation data.
- verbose : If verbose = 1, it will show both progress bar and one line per epoch, if verbose = 0, it will not show anything and if verbose = 2, it will only show one line per epoch i.e. epoch no./total no. of epochs.
- callbacks : List of callbacks to apply during evaluation.
Full code link : https://www.kaggle.com/nikkisharma536/ner-with-bilstm-and-crf
That’s all for this post. Stay tuned for more interesting blogs.