1. Essential Theoretical Knowledge of Statistics and Calculus
I think you kind of expected this to be the first one, but before you just skip to the other section or just another article, let me tell you why it needs to be the first point mentioned. An okay data scientist learns how to use a bunch of tools like PowerBI, Scikitlearn, etc. This will be fine for building baseline models, but you will soon find out that it’s not enough and you need to improve your model.
This brings us to reading ML research papers. And you have to trust me on this, you will not understand most ML papers if you don’t understand essential statistics, and if you don’t understand most of the papers, you probably won’t be able to implement them and improve them, which is a big issue.
I remember struggling with understanding ML papers at university, it used to take me a few days if not weeks to fully grasp them. However, all this changed when I spent a few weeks learning the fundamentals of statistics and calculus. Now, I can easily digest those papers in an hour or 2. If you haven’t already done so, you will not believe how much papers rely on those foundations.
One very important point that I want to stress here is that I am not asking you to be an expert in these foundations. This is what most people struggled with in high school—being good enough at math and statistics to get through an exam. You don’t need this here. You just need to understand the foundations to digest the research papers. Understanding them is much easier than actually being good at solving theoretical math problems (which is a good skill to have, but a hard one to acquire).
Khan Academy is an excellent place to start. You can start by checking out their algebra course here and their stats one here.
2. Essential Programming Basics
You have now got your math and stats knowledge, now it’s time to move into something more practical and hands-on. A lot of people get into data science from non-technical backgrounds (which is actually quite impressive). Believe me when I tell you this, the worst way to learn programming is to keep watching courses endlessly. I know there are tons of articles and videos about learning programming and I don’t want this to just be another duplicate. I do however want to give you the most important tips that will help you save a lot of time.
When I was learning programming basics I used to watch tons of tutorials, which was useful. But, a lot of people (including me) think that watching more tutorials equals improvement in our skills as programmers, it does not! Tutorials only tell you how to do something. But you never learn until you actually do it yourself. Although this seems straightforward and obvious, it needs to be said: it’s actually harder to code than just seeing other people code. So, simply put, here is the next tip:
For every few tutorials you watch or articles you read, make sure you implement at least one of them. If you aren’t doing this, you are wasting your time.
If you don’t believe me, feel free to check out articles by TraversyMedia and FreeCodeCamp that are going to affirm this idea. A lot of programmers realize this, but it’s usually a bit later than they should have.
I am not going to point you to a course. Instead, I am going to point you to one of the best places to improve your programming skills and, more importantly, improve your problem-solving skills. I wish I had received this when I was at university because programming languages change all the time, problem-solving skills don’t. And when you actually start applying for jobs, a decent interviewer will be examining your problem-solving skills, not your syntax accuracy.
Start by integrating at least 2-3 hours every week of easy HackerRank or LeetCode into your schedule, if you are struggling. Watch some tutorials, but start with approaching the problems first (not the other way around).
3. Experience, experience, experience
At this point, you know your theory, you have good programming and problem-solving skills and you are ready to start gaining data science skills. The best way to do this is to start developing end-to-end data science projects. From my experience, the best projects must have at least a few of these components:
- Data gathering, filtering, and engineering: This can be as simple as an online search or as complex as building a web scraping server that aggregates certain websites and saves the required data into a database. This is actually the most significant stage because if you don’t have data, then you don’t have a data science project! This is actually the reason why a lot of AI startups fail. Once I realized this, it was quite an eye-opener for me, even though it’s kind of obvious!“Model training is only the tip of the iceberg. What most users and AI/ML companies overlook is the massive hidden cost of acquiring appropriate datasets and cleaning, storing, aggregating, labeling, and building reliable data flow and an infrastructure pipeline.”—The Single Biggest Reason Why AI/ML Companies Fail to Scale?
- Model Training (this is too obvious to explain)
- Gathering metrics & exploring model interpretability: One of the biggest mistakes that I made in my first few ML projects was not giving this point due credit. I was extremely eager to learn and so I kept jumping from model to model too quickly. Don’t do this. When you train a model, fully evaluate it, explore its hyperparameters, check out interpretability techniques and, most importantly, figure out why it works well and why it doesn’t.One of the best places to learn these concepts (except data gathering) is on Kaggle, I can’t stress enough how much you will learn from doing a few Kaggle competitions.
Model Deployment & Data Storage
A central piece of your data science project is selecting the correct data storage framework. Keep in mind that your production model will be consistently using and updating this data. If you don’t choose the correct data storage framework, your whole app will face quality and performance issues.
One of the fastest-growing storage frameworks is data lakes.
“A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning—to guide better decisions.” — Amazon
Data lakes are being widely used by top companies currently to manage the insane amount of data that is being generated. If you are interested, I suggest checking out this talk by Raji Easwaran, a manager at Microsoft Azure about the “Lessons Learned from Operating an Exabyte Scale Data Lake at Microsoft.”
There are also frameworks that operate on data lakes that ease the consumption of data by machine learning models. I used to think that adding these layers is not that effective, but separating these operations into different layers saves you the time you will have to debug your models in the long run. This is actually the backbone of most high-quality web applications/software projects.
Final Thoughts
The biggest misconception I had going into data science was that it’s all about model fitting and data engineering. Although that is, of course, an important part, it’s not the most difficult and significant one. There are multiple factors (as discussed above) that are in play when getting into data science and developing high-quality ML projects.