In my recent post 5 Years of Data Science – Thoughts on hard starts, overrated comments, and garbage data we talked about 5 things, that I learned in my first 5 years of data science.
However, I was curious about what you learned. Hence, I turned to reddit /r/datascience to ask others about their 5 major learnings from their first 5 years of data science.
The thread was fairly popular with over 50,000 views. I filtered through the many lists posted in the comments, and distilled my personal top 5 of your learnings.
A short note: I tried to reach out to the user’s mentioned in this post to be able to credit them properly, e.g., with a link to their Twitter or GitHub. If you know, or are one of these users and want to be credited, please reach out: email@example.com.
1. When to fire your stakeholder
If your stakeholder cares about a project less than you, stop working on that project.
In corporate settings, you often find yourself working in a data science department, which does projects for specific stakeholders in the company. The same holds true for many other scenarios, e.g., in academia, or as an independent consultant.
You should only work on a project, if your stakeholder cares about it at least as much as you do. We might condense this statement to: “Your stakeholder needs to have skin in the game”.
If that is not the case, bad things can happen. They might not provide you with adequate meta-data, give you an inaccurate description of their needs, or simply waste resources on an ill constructed problem.
Not every project is going to save the day. In fact, only a fraction of them do.
But working on something, that even the project’s owner thinks is not worth their attention, is a waste of everybody’s time.
In his podcast People I (Mostly) Admire, Steve Levitt often elaborates on the skill of quitting: This ability is particularly relevant for data scientists: Add “quitting a project when its stakeholder does not care” to your repertoire.
A bonus: Don't build titanium bridges.
Interestingly, a user who identifies as a “stakeholder” adds:
“Don’t build titanium bridges”. Or in other words: Don’t overkill.
Big problems require big solutions, but small problems only require small ones.
Don’t build a neural network, if counting out is enough.
2. Aristotle was a data scientist
Sending a Product Owner/Manager a notebook isn't a proper way to share your insights.
Aristotle would have never given a Jupyter Notebook to a PowerPoint audience.
According to the Greek Philosopher, you persuade people with Logos, Ethos, and Pathos: Content, Credibility, and (Knowledge of your) Customer (poor alliteration is mine).
No matter how insightful your results (Logos), and how many accolades you collected (Ethos), when you hand a Jupyter Notebook to somebody who has never written code before, you disrespect Pathos.
Be more persuasive by meeting people where they are.
The great news is, with a tool like Quarto, you can transform your beloved Jupyter Notebook into PowerPoint slides for the marketing team, an interactive website for the customer, or into a print-out for your boss.
Great data scientists use their skill-set to honor Pathos and meet people where they are.
3. Truly understanding your data is hard
Training models is easy for many, cleaning and understanding the data is not, and I've seen many 'experienced' data scientists get tripped up with that.
This relates to a point from my original article: Exciting data makes for an exciting project. Acquiring data which is hard to acquire and cleaning data which is not only dirty, but filthy sets you apart.
However, really understanding the data at hand, adds a new dimension.
Tabular, image, video, audio, network, genomic, or text data all look vastly different, and it’ll take time just to figure out what the rows and columns of your data actually represent.
Going one step further and truly understanding how the data was collected, what was measured, and which information can be extracted takes time, but is invaluable.
4. How do we ask better questions?
Getting a mediocre answer to the right question >>>>> getting a great answer to the wrong question.
We are trying to answer a question with data analysis. If we assume that we ask the perfect question, all that is left to do, is to give the most precise answer.
However, what if our question is actually wrong?
How do you ask better questions? To this probe, I found some articles, e.g., this one by Admond Lee on medium. However, to me, it still seems to be an ongoing debate.
One starting point can be, to note down the facts of your task and really think about your problem:
- What data do I have?
- Can I get more data?
- How much time do I have?
- What is the given goal of the project?
- What is the real goal of the project?
Down the road, our answers will get better and better anyway, as computing resources improve, and the field of machine learning invents even more “precise” methods.
The bottleneck is the problem itself: If we ask the wrong question, we end up with a precise, but wrong, answer.
Would we rather shoot at the right target imprecisely, or precisely at the wrong target?
The latter makes for a more fancy presentation, but the former is the honest choice.
I highly recommend this post by Roger Peng: Tukey, Design Thinking, and Better Questions, if you want to learn more about the thought process behind “asking better questions”. Especially his graph “Strength of Evidence vs. Quality of Question” is very insightful.
5. Professionals do unit tests
Write as many tests as you can. Don't take things for granted, even if it looks simple.
In the original post, I talked about how coding makes you more professional: Unit tests make you professional at writing code.
What is a unit test?
Often overlooked in data science, unit tests become crucial when moving a model to the production stage, and can save your data analysis in a research context.
In my experience, data science departments shove this responsibility to the “real programmers” who implement your prototype as a production version. However, there is no reason for that. Unit tests are not complicated, and best implemented by the person writing the code in the first place.
assert a == b, "A is not equal to B: There is a problem!"
There you go, a unit test.
Of course, these can become more complex and there are even dedicated packages for implementing these, e.g.,
pytest. But ultimately, it comes down to a statement like the one above: Check that a condition is
true, otherwise notify me.
Ideally, you write your unit test even before implementing the respective code:
- Which result do I want?
- Write a unit test that checks for this result
- Write code, which passes this unit test
By doing so, you already have a unit test which catches many bugs in your code. As long as the test is “good”, your code will do what you want it to do.
Thank you very much for these submissions! They got me thinking a lot, and I hope that they point others towards new ideas as well.
Do you want more ds-econ? Join us! ✉️