Handson SQL guide for Data Science beginners – From databases to data lakes

Are you an aspiring Data Scientist, or a greenhorn Data Science student like me ? Are you trying to start with SQL and are lost finding your way through the available options ?

This post might just be the thing for you !


My SQL Journey

I started my Data Science journey few months back and more recently I have been focusing on getting better at SQL. Here are my learnings, and the compilation of things I did along.

Let’s get started.

SQL (Structured Query Language), is a programming language designed to manage data stored in relational databases. SQL is used to query, insert, update and modify data.


Understand SQL basics

Before digging deep, we need to understand the basics of SQL and it’s syntax.

You can use following site to know the basic idea and syntax of SQL :

Don’t stress yourself, if you are not able to get all the syntax. Once you do hands on  SQL, the concept and syntax are going to stick.


Getting hands dirty – try SQL Online

Now that we know the basics of SQL, let’s try our skills on few of the online SQL editors.

Here are some of the online editors I tried :

Make sure you try all the exercises to get comfortable with the syntax.


Install SQL on Local Box



We have learned the basics of SQL and now it’s time to go more deep on SQL. Most of the online editors are limited to the challenge/exercise on the website and don’t allow us to practice all the available  SQL syntaxes.

Lets install SQL on our local box, so that we can get messy and are not limited to the online website.

Here is the post where I walk through the installation of My SQL and My SQL workbench on local box.  This post also talks about accessing My SQL via python code.


Try a SQL Challenge

I came across this great github repo which has different SQL challenges and solutions for you. The repo also has sample data and DDLs which you can use to create dummy tables. You can then try out your local My SQL setup to practice the challenges.

Here is the link to the repo. Try not to look at the solution before attempting the challenges.


Let’s level up – try Hacker Rank

Image result for sql challenge on hacker rank


We have trained ourself enough on basic SQL and now it’s time to level up the game. Let’s move to hacker rank for a more serious/advanced SQL.

You can select advanced/intermediate level based on your comfort and can learn a lot here.


Let’s go turbo – get a feel of Big Data

Now we have a lot of knowledge on SQL.  It’s about time that we acknowledge the scale of data. Here is a great tutorial on Kaggle that walks us through the basics of Google’s Big Query.

At this point, the tutorial should look very familiar to you.

Kaggle has its own limitation on the Big Query usage. If you would like to get around that you can create your own Big Query trial account here.


Get a feel of Data Lake – with S3 and Spark-SQL

Image result for data lake aws


Here in this post I have installed Apache Spark on AWS- EMR cluster, and queried it via Apache Zeppelin using Spark-SQL.  This post talks about saving data on amazon S3, and creating external tables via Spark- SQL.

Have a look at this post to get a feel of Spark-SQL.

Make sure you read about the optimization techniques for Big Data and Data lakes. Data partitioning and file formats are big wins in Big Data ecosystem.

At this point there is no turning back. Keep reading on the concepts.

That’s all folks

This is my journey on SQL so far and I have learned heaps of new concepts. Looking forward to more learnings ahead and to get my hands dirty on more complex real world problems.

I hope this post can be helpful for other greenhorn data scientist/analyst like me. I would love to hear your feedback. Please comments your thoughts below.

Leave a Reply

Your email address will not be published. Required fields are marked *