Engineering best practices for Data Science projects

Introduction

Code Refactoring

This is the first step for having better code. It is the process of simplifying the design of existing code, without changing its behavior. 

Data science projects are written on jupyter notebooks most of the time and can get out-of-control pretty easily. A code refactoring step is highly recommended before moving the code to production.

Issues addressed

  • Improved code readability – Make it easy to understand for our teams
  • Reduced complexity – smaller and more maintainable functions/modules

Action items

  • Breaking down code into smaller functions
  • Comment functions
  • Better naming standards
  • Remove unused code bits

Unit Tests

A unit test is a method of testing each function present in a code. The purpose is to validate that each function in the code performs as expected.

Testing almost always gets ignored in Data Science projects. There are few parts in your project that might not require test cases but in a project, there are many other components that can easily be unit tested.

For example, model evaluation is done in the experimentation phase and we probably do not need testing that again in unit tests, but the data cleaning and data transformations are parts that could definitely be unit tested.

Issues addressed

  • It helps to fix bugs early 
  • It helps the new starters to understand what code does
  • enables quick code changes
  • Ensures bad code is not merged in

Action items

  • Create functions that accept all required parameters as arguments, rather than computing within functions. This makes them more testable
  • If the function reads spark data frame within the function, change the function to accept a data frame as a parameter. We can then pass handcrafted data frames to test these functions.
  • We will write a  bunch of unit tests for each function
  • We will use python framework like unittest, pytest, etc. for unit testing
  • Tests will be part of the code base and will ensure no bad code is merged
  • These tests will be used further by our CI/CD pipeline to block the deployment of bad code

Integration Tests 

Integration testing tests the system as a whole. It is checking if all the functions are working fine when combined together. 

Lot of times the project will have a dependency on external systems, for example, your pyspark code might be reading/writing data to Cassandra. We can create integration tests to test the whole project as a single unit or test how the project behaves with external dependencies.

Issues addressed 

  • It makes sure that the whole project works properly.
  • It detects the errors related to multiple modules working together.

Action items

  • We will create a local infrastructure to test the whole project
  • External dependencies can be created locally on Docker containers
  • Test framework like pytest or unittest will be used for writing integration tests
  • Code will be run against local infra and tested for correctness

Code Linting  

Writing projects on jupyter notebooks don’t essentially follow the best naming or programming patterns, since the focus of notebooks is speed. Linting helps us to identify the syntactical and stylistic problems in our python code.  

Issues addressed 

  • It helps to detect styling errors 
  • Forces better / optimal writing style
  • Detects structural problems like the use of an uninitialized or undefined variable
  • Makes code pleasant to work with

Action items

  • Flake8 or black will be used to detect both logical and code style best practices.
  • Next step, Lint tests will be integrated into CI/CD to fail builds on bad writing style

Code Coverage

Code coverage helps us find how much of our code did we test via our test cases. It’s a good quality indicator to inform which parts of the project need more testing.

Issues addressed 

  • monitor how much of the code is tested.

Action items

  • Tools like coverage.py or pytest-cov will be used to test our code for the coverage.

GitHub repo Branch Permission

We will set permissions to control who can read and update the code in a branch on our Git repo. This will keep our master (deployment branch) clean and force a Pull Request + Build tests based process to get code merged in master.

Also forcing a peer review process and automated testing ensures we have fewer bugs merging in our codebase, and other teammates are aware of the changes merging in the project.

Issues addressed 

  • Master is always clean and ready to be deployed
  • Force best practices – Pull Request + Automated build tests
  • Accidentally deleting the branch will be avoided
  • Avoiding bad code to merge on the master

Action items

We will set the branch setting with the following :

  • Rewriting branch history will not be allowed for master branch
  • We can’t directly merge the code in master without a Pull Request
  • At least 1 approval is needed to merge the code to master
  • Code will only merge once all automated test cases are passed

Automated test execution on branches

When our pull request is created, it is a good idea to test it before merging to avoid breaking any code/tests.

Issues addressed 

  • Automate run on the test 
  • Avoiding bad code to merge on the master

Action items

  • CI/CD setup on Github.
  • Automatic tests should be triggered on any new branch code push
  • Automatic tests should be triggered on Pull requests created
  • Deploy code to production environment if all tests are green

Monitoring  & Alerting

This is a very important step in the Software engineering world, but almost always gets skipped for Data Science projects. We will monitor our job and will raise an alert if we got some runtime errors in our code.

Depending on if your project is only doing predictions you might not very extensive alerting, but if the project is talking to a few systems and processing a lot of data/requests, having monitoring is going to make your life a lot easier in the long run.

Issues addressed 

  • More Visibility, rather than black-box code executions
  • Monitor input and output processing stats
  • Monitor infra availability/dependency
  • Past runs failures/successes trend
  • Alert us when we ML pipeline fails/crashes

Action items

  • If you have a monitoring tool (highly recommended) – send events for input/output stats to monitor
  • If no monitoring tool available –  log all the important stats in your log files.
  • If no monitoring tool – We could potentially add the important stats of a run to a DB for future reference
  • Build Slack/Microsoft teams integration to alert us Pipeline pass/fail status

 

That’s all

That’s all for this post. Hope these are useful tips. Please share your thoughts and the best practices you applied to your Data Science projects.

Credit : Photo by Jon Tyson on Unsplash

Leave a Reply

Your email address will not be published. Required fields are marked *