A long-term Data Science roadmap which WON’T help you become an expert in only several months
Some thoughts on becoming a data scientist. It isn’t easy or fast and requires a lot of efforts, but if you are interested in data science, it is worth it.
[
](https://artgor.medium.com/?source=post_page-----4436733e63ff--------------------------------)
[Andrew Lukyanenko
](https://artgor.medium.com/?source=post_page-----4436733e63ff--------------------------------)
[Dec 2, 2018·7 min read
From time to time I am asked: how does one become a data scientist? What courses are necessary? How long will it take? How did you become a DS? I have answered this question several times, so it seems to me that writing a post could be a good idea to help the aspiring data scientists.
关于做一名数据科学家,我有一些想法。成为一名数据科学家并不容易,需要付出很多努力,但如果你对数据科学充满兴趣,那一切都是值得的。
时常有人问我:如何成为一名数据科学家?必修的课程是什么?需要多长时间?你是怎么成为数据科学家的?我已多次回答过这些问题。所以在我看来,写一篇汇总的文章也许能帮助那些想要成为数据科学家的人。
About me
I got a masters degree at MSU Faculty of Economics (Russia, Moscow) and worked for ~4 years as an analyst/consultant in ERP-system implementation sphere. It involved talking with clients, discussing their needs and formalizing them, writing documentation, explaining tasks to programmers, testing the results, organizing projects and many other things.
But it was a stressful job, with lots of problems. What is more important, I did not really like it. Most of the things were not inspiring, though I liked working with data. So, in the Spring-Summer 2016 I have started looking for something else. I got a Green Belt in Lean Six Sigma, but there were no opportunities nearby. One day I have found out about BigData. After a couple weeks of googling and reading numerous articles I realized that this could be my dream career.
I left my job and 8 months later got my first position as a data scientist in a bank. Since then I have worked in a couple of companies but my passion for data science is still strong. I have completed several courses in ML and DL, made several projects (such as a chat-bot or a digit recognizer app), took part in many ML competitions and activities, got three silver medals on Kaggle and so on. Thus, I have some experience with studying data science and working as data scientist. Of course I have a lot of things to learn and a lot of skills to acquire still.
我(俄罗斯,莫斯科)在密歇根州立大学经济学院获得硕士学位,并在 ERP 系统规划领域做了 4 年的分析师和顾问。我的工作涉及与客户交谈,讨论他们的需求并将其落地,编写文档,向程序员说明任务,测试结果,组织项目和许多其它事情。
这是一项压力很大的工作,需要处理很多问题。更重要的是,我并不喜欢它。尽管我喜欢处理数据,但我做的大多数事情还是令人索然无味。所以,在 2016 年的春夏之交,我开始另谋出路。我通过了精益六西格玛(Lean Six Sigma)的绿带测试,但还未找到新的就业机会。有一天我发现了大数据(BigData)。在 google 上搜索和阅读了许多文章后,我意识到这可能是我的梦想职业。
我辞去工作,并在八个月后在一家银行找到了第一份数据科学家的工作。从那之后,我先后就职了几家公司,但我对数据科学的热情日益增加。我完成了一些关于机器学习和深度学习的课程,实践了一些项目(如聊天机器人或数字识别 APP),先后参加了许多机器学习的比赛和活动,在 Kaggle 上获得了三枚银牌。总之,我有一些学习数据科学和作为数据科学家工作的经验。当然,我还有很多技能需要学习。
Disclaimer
This article contains my opinions. Some people may disagree with them and I want to point out that I do not want to offend anyone. I think that anyone wishing to become a data scientist must invest a lot of time and effort in it or they will fail. Courses or MOOCs claiming that you can become an expert in ML/DL/DS in several weeks or months are not entirely truthful. You can get some knowledge and skills within weeks/months. But without extended practice (which is not a part of most courses) you will not prevail.
本文所述仅为我自己的观点。可能有些人会对其中的内容持反对态度,但我无意冒犯任何人。我认为想成为一名数据科学家必须投入大量的时间和精力,否则将一事无成。Course 或 MOOC 声称可以让你在几周或几个月内成为机器学习/深度学习/数据科学专家的广告语并不是真的。你可以在数周/数月内获得一些知识和技能。但如果没有广泛的实践(大多数课程内不包含这一部分),你无法真正掌握它。
You do need internal motivation, but, more importantly, you need discipline, so that you will continue working after the motivation went away.
Let me repeat again — you need to do things by yourself. If you ask the most basic questions without even trying to use Google/StackOverflow or thinking for a couple of minutes, you will never be able to catch up with the professionals.
In most of the courses which I took, only around 10–20% of people completed them. Most of those who dropped out did not have dedication or patience.
你确实需要内在的动力,但更重要的是,你需要严格地规范自己,这样你可以在动力消失后继续努力。
再说一遍——你需要自己动手动脑。如果你在提出最基础的问题之前没有用 Google/ StackOverflow 或思考几分钟,那你将永远无法赶上专业人士。
在我参加的大多数课程中,只有大约 10-20%的人完成了这些课程。半途而废的人基本都缺乏耐心和决心。
Who is a Data Scientist?
There are many pictures showing data scientist’s core skills. For the purposes of this post any of them is good, so let us look at this one. It shows that you need Math & Stats, Programming & Devops, Domain knowledge and Soft skills.
上图显示了数据科学家所需的一些核心技能,比如:数学和统计学,编程和开发,领域相关知识和软技能。
That’s a lot! How is it possible to know all of this? Well, it really takes a lot of time. But here are good news: it is not necessary to know everything.
这么多技能!怎么可能完全掌握呢?嗯,需要花费很多时间。但告诉你一个好消息:没必要掌握全部。
There was an interesting talk on 21 October 2018 at Yandex. It was said that there are many types of specialists, who have different combinations of aforementioned skills.
2018 年 10 月 21 日,Yandex 上有一个有趣的演讲,其中提到数据科学专家类型有很多,他们只是拥有上述技能中的某几种而已。
Data Scientists are supposed to be in the middle, but in fact they can be in any part of triangle, having different levels in any of the three spheres.
In this article I will talk about data scientists as they are usually assumed — those who can talk with customers, perform analysis, build models and deliver them.
数据科学家应该处于图片中间的位置,但实际上他们可以处于三角形的任何位置,不同位置对应了不同的专家能力。
在本文中,我将讨论的一类数据科学家是那些可以与客户交谈,进行分析,构建模型并实施项目的人。
Switching careers? This means you already have something!
Some people say that switchings career is quite difficult. While it is true, have a career to switch from, means you already know something. Maybe you have experience with programming & devops, maybe you have worked in a math/stats heavy sphere or you honed your soft skills everyday. At a bare minimum you have an expertise in your domain. So always try to use your strong sides.
有人说转行相当困难。虽然这是事实,但转行也通常意味着你对现在工作已经有所了解。也许你有编程和开发经验,也许你在数学/统计学领域工作过,或者你每天锻炼你的软技能。至少你拥有一些自己领域的专业知识。你可以扬长避短。
Roadmap from Reddit
In fact there will be two roadmaps :)
The first one is from Reddit, from this topic:
First, read fucking Hastie, Tibshirani, and whoever. Chapters 1–4 and 7–8. If you don’t understand it, keep reading it until you do.
You can read the rest of the book if you want. You probably should, but I’ll assume you know all of it.
Take Andrew Ng’s Coursera. Do all the exercises in python and R. Make sure you get the same answers with all of them.
Now forget all of that and read the deep learning book. Put tensorflow and pytorch on a Linux box and run examples until you get it. Do stuff with CNNs and RNNs and just feed forward NNs.
Once you do all of that, go on arXiv and read the most recent useful papers. The literature changes every few months, so keep up.
There. Now you can probably be hired most places. If you need resume filler, so some Kaggle competitions. If you have debugging questions, use StackOverflow. If you have math questions, read more. If you have life questions, I have no idea.
And from one of the comments:
Still not enough. Come up with a novel problem where there’s no training data and figure out how to collect some. Learn to write a scraper, then do some labeling and feature extraction. Install everything on EC2 and automate it. Write code to continuously retrain and redeploy your models in production as new data becomes available.
While being short, harsh and very difficult, this guide is quite great and it will get you to a hireable level.
Of course there are many other ways to data science, so I will offer mine. It is not perfect, but it is based on my experience.
第一个来自 Reddit:
首先,阅读 Hastie、Tibshirani 和 Jerome Friedman 所著的《The Elements of Statistical Learning》第 1-4 章和 7-8 章。就算暂时不理解,也要坚持阅读。
如果需要,你可以阅读该书的其它部分。假设你对全书都已有所了解。
观看 Andrew Ng 的 Coursera 课程。用 python 和 R 语言完成所有练习。确保你能写出正确答案。
然后阅读一本深度学习书。在 Linux 系统中运行 tensorflow 和 pytorch 框架并实践示例项目,直到完成。尝试使用卷积神经网络、循环神经网络和前馈神经网络。
完成所有这些后,继续在 arXiv 上阅读最新的有用论文。文献不断在更新,所以要跟上大部队。
完成这些的你现在会被大多数公司录取。如果你需要完善简历,可以参加一些 Kaggle 比赛。如果你有调试问题,请使用 StackOverflow。如果在数学方面有问题,请多读文献。如果生活上问题,自己看着办吧。(以上引用)
《The Elements of Statistical Learning》地址:https://web.stanford.edu/~hastie/ElemStatLearn//printings/ESLII_print10.pdf
其中一条评论:
这些还不够。还有一个新问题:没有训练数据,想想怎么收集。学会写 scraper,然后做一些标注和特征提取。在 EC2 上完成所有安装并实现自动化。尝试编写代码,以便有新数据时,在生产中不断重新训练和部署模型。
虽然这些听起来很简短,但很严苛且非常困难,如果做到了,它可以让你有个饭碗。
当然,还有许多其它的数据科学方法,我提供的只是我自己的方法。它并不完美,但却是基于我的个人经验。
My Roadmap
There is one skill which will get you very far. If you do not have it yet, I urge you to develop it. This skill is… =formulating thoughts, searching for information, finding it and understanding it.= Seriously! Some people cannot formulate thoughts, some are unable to find solutions to the most basic questions, some do not know how to properly create google queries. This is a basic and necessary skill, and you must perfect it!
- Choose a programming language and study it. Usually it would be Python or R. I highly recommend choosing Python. I won’t list the reasons, because there are a lot of arguments about R/Python out there already, I personally think Python is more versatile and useful. Spend 2–4 weeks on learning the language, so that you can do basic things. Get a general understanding of the libraries used, such as pandas/matplotlib or tydiverse/ggplot2.
- Go through the ML course by Andrew NG. It is old, but it gives a great foundation. It could be useful to complete the tasks in Python/R, but it is not necessary.
- Now take one more good course in ML (or a couple of them). For R users I recommend Analytics Edge, for Python users — mlcourse.ai. If you know Russian language, this course on Coursera is also great. In my opinion mlcourse.ai is the best among these three. Why? It provides good theory and some tough assignments, which could already be enough. However, it also teaches people to take part in Kaggle competitions and make standalone projects. This makes it great for practice.
- Study SQL. In most companies data is kept in relational databases so that you will need to be able to get it. Make yourself comfortable with using select, group by, CTE, joins and other things.
- Try to work with raw data to get the experience of working with dirty datasets.
- While the previous point may not be necessary, this one is mandatory: complete at least 1 or 2 complete projects. Perform a detailed analysis and modelling of some dataset, or create an app, for example. The main thing to learn is how to create an idea, plan its implementation, get data, work with it and bring the project to completion.
- Go to Kaggle, study kernels and take part in competitions.
- Join a good community. I have joined ods.ai — a community of 15k+ active Russian data scientists (by the way, this community is open to data scientista from any countries) and it helped me a lot.
我的路线图:
有一项技能可以让你走得很远。如果你还没拥有这项技能,希望你尽快掌握。这项技能是——独立构思,检索信息,发现信息,理解信息。有些人无法独立构思,有些人无法找到最基本问题的解决方案,有些人甚至不知道如何正确使用谷歌搜索。这是一项必备的基础技能,你必须掌握它!
- 选择一门编程语言进行学习。Python 或 R 语言就不错。我强烈建议选择 Python。理由我就不多说了,关于 R / Python 的争论已经多得数不胜数,不过我个人认为 Python 更通用,更实用。花上 2-4 周学习语言,这样你就能做基本的事情了。大致了解要使用的库,例如 pandas / matplotlib 或 tydiverse / ggplot2。
- 过一遍 Andrew NG 的 ML 课程。这门课比较老,但它可以帮你夯实基础。用 Python/R 完成其中的任务可能很有用,但没有必要。
- 再选择一门(或几门)进阶的 ML 课程进行学习。对于 R 用户,我推荐 Analytics Edge,对于 Python 用户,我推荐 mlcourse.ai。如果你懂俄语,那么 Coursera 上的这门课(https://www.coursera.org/specializations/machine-learning-data-analysis)也很棒。在我看来,mlcourse.ai 是三者中最好的。为什么?它提供了良好的理论和一些有挑战的任务,这已经足够了。同时它还教人们参加 Kaggle 比赛并制作独立项目。这对实践很有帮助。
- 学习 SQL。大多数公司的数据都保存在关系数据库中,因此你要能够获取它。可以使用 select 语句、group by 语句、join 语句、CTE 表达式等。
- 尝试使用原始数据,获得处理杂乱数据集的经验。
- 上一条是选做,但这一条是必须完成的:完成至少 1 或 2 个完整项目。例如,对某些数据集进行详细的分析和建模,或创建一个应用程序。最重要的是学习如何构思一个想法、规划实施、获取数据、实施并完成项目。
- 参加 Kaggle 比赛。
- 加入一个好的社区。我加入了 ods.ai——由一万五以上活跃的俄罗斯数据科学家组成的社区(不过这个社区对任何国家的数据科学家都开放),这对我帮助很大。
Studying Deep Learning is a completely different topic.
学习深度学习是另一个全新的话题。
This is only the beginning. Following this roadmap (or doing something similar) will help you start your journey to becoming a data scientist. The rest is up to you!
这仅仅是个开始。遵循这个路线图(或做相似的事)将帮助你开启数据科学家之路。剩下的路要靠你自己走!
Now you can also read this blog post on my website here.