How to apply Deep Learning on tabular data with FastAi

It’s a common sentiment that Deep Learning is only good for images and language models. This post is about using Deep Learning on tabular data, for both Regression and Classification problems. We will use fastai library for creating our deep learning models. We will use Kaggle competitions as benchmarks to see how our solutions compares to other solutions using traditional ML models.

If you haven’t watched fastai tutorials already, please visit this link for the awesome and free tutorials.

Network architecture

Important concepts:

  1. Categorical embeddings: Similar to latent features, embedding categories into N-dimensional features.
  2. Continuous variables: Batch Normalisation for continuous variables
  3. Hidden layers
  4. Output Hidden layer

 

Fastai figures out the default value for N for us here but lot of these can be customised while creating the data bunch and learner. To understand deeper- the Embeding(8, 5) means that a categorical embedding was created with 8 input values and 5 output latent features.

This is the internal network structure of a 2 layer model I created for one of the challenges. This is how our model looks like with fastai defaults:

 

 

Let’s have a look at the challenges and some code now.

Part 1 – Classification Problem : Titanic challenge

Here we have used Titanic dataset from kaggle to explain how fastai works for classification problem.

Lets look at the data

Let’s have a look on the overview of data and know the data types of each features, to understand the importance of features.

Load and analyse data

Fast ai expects the data to be loaded as a Data Bunch and then a Fast ai Learner can use this data for the models. Fastai takes care of a lot of feature engineering for us and prepares the data in a format that the neural net can understand.

Few things to keep in mind here:

  1. Data Bunch: A data format for fast ai input
  2. Dependent variable: The variable to predict
  3. Categorical columns: The text/label columns. Or the columns with low cardinality, eg. gender, type, year etc.
  4. Continuous columns: Numeric value columns, usually with higher cardinality, eg. salary, price, temperature.
  5. Transformations: Feature engineering and handling data eg. Missing values, Normalisation etc

We can then pass this Data Bunch to our Fastai learner for training on the data.

 

Fit the deep learning model

Now that we have our Data Bunch ready, we can go ahead creating our Deep Learning model. Fastai provides a bunch of learners for our convenience inspired from state-of-art papers.

This is typically composed of following steps :

  1. Create Learner : Create an appropriate learner for data. A learner creates a neural network for us.
  2. Find the learning rate : We need to find a suitable learning rate for our training
  3. Fit the model 

We will create a tabular learner for our challenge.

 

Get Prediction

Let’s get the prediction and create the submission file to submit in Kaggle.

 

I found that I was able to get upto 0.8XX with an ensembled XGB + RF based model, but that involved lot of feature engineering and cross validation + ensembling.

So there is a scope of improving the deep learning model here. However this is not bad at all, without any feature engineering and network tuning.

Part 2 – Regression Problem : House price prediction

Lets look at the data

Let’s apply the same concepts on a regression problem now.

Most of the code  is very similar to apply the same on regression problem.Here we have used House Prediction dataset from kaggle to explain how fastai works for regression problem.

 

Load and analyse data

As  we know that Fast ai expects the data to be loaded as a Data Bunch.
Few things to keep in mind for a regression model :

  • Here we have to pass a label_cls (target value) as  FloatList in our  data bunch
  • We also pass log as True in our data bunch, as values of our target column are very large .

 

Fit the deep learning model

We are applying  the same concept as classification model. Here we have pass metrics as ‘rmse’. Note that the metrics being passed here doesn’t change  how the network compute, it is just to print the training progress being made.

 

Get Prediction

Let’s get the prediction and create the submission file to submit it in Kaggle.

This actually performed better than my ensembled tree based models. The network captured the features better with just the defaults of fastai. This was impressive.

All the code

All the code for this task can be found here on Kaggle kernels:

 

Checkout my portfolio here: https://confusedcoders.com/nikita-sharma-greenhorn-data-science-student

I am a greenhorn Data Science student with interest in finding patterns in data. My language of choice is Python and I am starting to get my hands dirty with R.

I blog on Medium.com [1] and ConfusedCoders.com [2]. I share my code on Github.com [3].

  1.  https://medium.com/@nikkisharma536
  2. https://confusedcoders.com/author/nikita
  3. https://github.com/nikkisharma536

Leave a Reply

Your email address will not be published. Required fields are marked *