A long-term Data Science roadmap which WON’T help you become an expert in only several months
Some thoughts on becoming a data scientist. It isn’t easy or fast and requires a lot of efforts, but if you are interested in data science, it is worth it.
[
](https://artgor.medium.com/?source=post_page-----4436733e63ff--------------------------------)
[Andrew Lukyanenko
](https://artgor.medium.com/?source=post_page-----4436733e63ff--------------------------------)
[Dec 2, 2018·7 min read
From time to time I am asked: how does one become a data scientist? What courses are necessary? How long will it take? How did you become a DS? I have answered this question several times, so it seems to me that writing a post could be a good idea to help the aspiring data scientists.
关于做一名数据科学家,我有一些想法。成为一名数据科学家并不容易,需要付出很多努力,但如果你对数据科学充满兴趣,那一切都是值得的。
时常有人问我:如何成为一名数据科学家?必修的课程是什么?需要多长时间?你是怎么成为数据科学家的?我已多次回答过这些问题。所以在我看来,写一篇汇总的文章也许能帮助那些想要成为数据科学家的人。
About me
I got a masters degree at MSU Faculty of Economics (Russia, Moscow) and worked for ~4 years as an analyst/consultant in ERP-system implementation sphere. It involved talking with clients, discussing their needs and formalizing them, writing documentation, explaining tasks to programmers, testing the results, organizing projects and many other things.
But it was a stressful job, with lots of problems. What is more important, I did not really like it. Most of the things were not inspiring, though I liked working with data. So, in the Spring-Summer 2016 I have started looking for something else. I got a Green Belt in Lean Six Sigma, but there were no opportunities nearby. One day I have found out about BigData. After a couple weeks of googling and reading numerous articles I realized that this could be my dream career.
I left my job and 8 months later got my first position as a data scientist in a bank. Since then I have worked in a couple of companies but my passion for data science is still strong. I have completed several courses in ML and DL, made several projects (such as a chat-bot or a digit recognizer app), took part in many ML competitions and activities, got three silver medals on Kaggle and so on. Thus, I have some experience with studying data science and working as data scientist. Of course I have a lot of things to learn and a lot of skills to acquire still.
我(俄罗斯,莫斯科)在密歇根州立大学经济学院获得硕士学位,并在 ERP 系统规划领域做了 4 年的分析师和顾问。我的工作涉及与客户交谈,讨论他们的需求并将其落地,编写文档,向程序员说明任务,测试结果,组织项目和许多其它事情。
这是一项压力很大的工作,需要处理很多问题。更重要的是,我并不喜欢它。尽管我喜欢处理数据,但我做的大多数事情还是令人索然无味。所以,在 2016 年的春夏之交,我开始另谋出路。我通过了精益六西格玛(Lean Six Sigma)的绿带测试,但还未找到新的就业机会。有一天我发现了大数据(BigData)。在 google 上搜索和阅读了许多文章后,我意识到这可能是我的梦想职业。
我辞去工作,并在八个月后在一家银行找到了第一份数据科学家的工作。从那之后,我先后就职了几家公司,但我对数据科学的热情日益增加。我完成了一些关于机器学习和深度学习的课程,实践了一些项目(如聊天机器人或数字识别 APP),先后参加了许多机器学习的比赛和活动,在 Kaggle 上获得了三枚银牌。总之,我有一些学习数据科学和作为数据科学家工作的经验。当然,我还有很多技能需要学习。
Disclaimer
This article contains my opinions. Some people may disagree with them and I want to point out that I do not want to offend anyone. I think that anyone wishing to become a data scientist must invest a lot of time and effort in it or they will fail. Courses or MOOCs claiming that you can become an expert in ML/DL/DS in several weeks or months are not entirely truthful. You can get some knowledge and skills within weeks/months. But without extended practice (which is not a part of most courses) you will not prevail.
本文所述仅为我自己的观点。可能有些人会对其中的内容持反对态度,但我无意冒犯任何人。我认为想成为一名数据科学家必须投入大量的时间和精力,否则将一事无成。Course 或 MOOC 声称可以让你在几周或几个月内成为机器学习/深度学习/数据科学专家的广告语并不是真的。你可以在数周/数月内获得一些知识和技能。但如果没有广泛的实践(大多数课程内不包含这一部分),你无法真正掌握它。
You do need internal motivation, but, more importantly, you need discipline, so that you will continue working after the motivation went away.
Let me repeat again — you need to do things by yourself. If you ask the most basic questions without even trying to use Google/StackOverflow or thinking for a couple of minutes, you will never be able to catch up with the professionals.
In most of the courses which I took, only around 10–20% of people completed them. Most of those who dropped out did not have dedication or patience.
你确实需要内在的动力,但更重要的是,你需要严格地规范自己,这样你可以在动力消失后继续努力。
再说一遍——你需要自己动手动脑。如果你在提出最基础的问题之前没有用 Google/ StackOverflow 或思考几分钟,那你将永远无法赶上专业人士。
在我参加的大多数课程中,只有大约 10-20%的人完成了这些课程。半途而废的人基本都缺乏耐心和决心。
Who is a Data Scientist?
There are many pictures showing data scientist’s core skills. For the purposes of this post any of them is good, so let us look at this one. It shows that you need Math & Stats, Programming & Devops, Domain knowledge and Soft skills.
上图显示了数据科学家所需的一些核心技能,比如:数学和统计学,编程和开发,领域相关知识和软技能。
That’s a lot! How is it possible to know all of this? Well, it really takes a lot of time. But here are good news: it is not necessary to know everything.
这么多技能!怎么可能完全掌握呢?嗯,需要花费很多时间。但告诉你一个好消息:没必要掌握全部。
There was an interesting talk on 21 October 2018 at Yandex. It was said that there are many types of specialists, who have different combinations of aforementioned skills.
2018 年 10 月 21 日,Yandex 上有一个有趣的演讲,其中提到数据科学专家类型有很多,他们只是拥有上述技能中的某几种而已。
Data Scientists are supposed to be in the middle, but in fact they can be in any part of triangle, having different levels in any of the three spheres.
In this article I will talk about data scientists as they are usually assumed — those who can talk with customers, perform analysis, build models and deliver them.
数据科学家应该处于图片中间的位置,但实际上他们可以处于三角形的任何位置,不同位置对应了不同的专家能力。
在本文中,我将讨论的一类数据科学家是那些可以与客户交谈,进行分析,构建模型并实施项目的人。
Switching careers? This means you already have something!
Some people say that switchings career is quite difficult. While it is true, have a career to switch from, means you already know something. Maybe you have experience with programming & devops, maybe you have worked in a math/stats heavy sphere or you honed your soft skills everyday. At a bare minimum you have an expertise in your domain. So always try to use your strong sides.
有人说转行相当困难。虽然这是事实,但转行也通常意味着你对现在工作已经有所了解。也许你有编程和开发经验,也许你在数学/统计学领域工作过,或者你每天锻炼你的软技能。至少你拥有一些自己领域的专业知识。你可以扬长避短。
Roadmap from Reddit
In fact there will be two roadmaps :)
The first one is from Reddit, from this topic:
First, read fucking Hastie, Tibshirani, and whoever. Chapters 1–4 and 7–8. If you don’t understand it, keep reading it until you do.
You can read the rest of the book if you want. You probably should, but I’ll assume you know all of it.
Take Andrew Ng’s Coursera. Do all the exercises in python and R. Make sure you get the same answers with all of them.
Now forget all of that and read the deep learning book. Put tensorflow and pytorch on a Linux box and run examples until you get it. Do stuff with CNNs and RNNs and just feed forward NNs.
Once you do all of that, go on arXiv and read the most recent useful papers. The literature changes every few months, so keep up.
There. Now you can probably be hired most places. If you need resume filler, so some Kaggle competitions. If you have debugging questions, use StackOverflow. If you have math questions, read more. If you have life questions, I have no idea.
And from one of the comments:
Still not enough. Come up with a novel problem where there’s no training data and figure out how to collect some. Learn to write a scraper, then do some labeling and feature extraction. Ins