4 min read

5 years of Data Science - Thoughts on hard starts, overrated comments, and garbage data

In this post, I look back at my first 5 years in data science and some of the most important things that I learnt. We answer the questions, why learning a second programming language is easier, how you can enhance you code documentation and why hard to get data is the real deal.
5 years of Data Science - Thoughts on hard starts, overrated comments, and garbage data
Photo by James Harrison / Unsplash

5 years ago, in 2017, I had my first contact with data science as a bachelor’s student in a beginner tutorial for R. Coincidentally, this was also the year that David Donoho published the namesake paper of this post “50 Years of Data Science”. Now, towards the end of my master’s at the University of Tübingen in Data Science in Business and Economics, it is a good point to think back and identify 5 important learnings from the past 5 years.

1. The start is hard

Learning a programming language is hard. At least when you learn your first one. After drudging through the first hours of learning R, I got frustrated. I thought that learning a programming language is similar to learning a foreign language, and that you would need to start from scratch every time you try a new framework.

While you always have to read up on your language’s syntax, studying basic concepts is a one-time deal. Almost any programming language relies on similar constructs, such as for-loops, functions, or datatypes. In a way, learning to code is more similar to learning math.

When we learned math in elementary school, we had to go through painful worksheets of addition, subtraction, and division. The work of training our brain to perform these mathematical operations was hard. However, in high school, when we dealt with vector multiplication, we did not have to start from scratch: We already knew what it meant to multiply, we just had to learn how to apply this concept to vectors.

These transfers are even easier for coding languages: After you figured out for-loops in R, you can also write them in Python, Julia, or Matlab. Learning your first programming language has a huge buy-in, but for every additional language this fixed cost is lower and lower. After you make these initial investments, improving your coding skills is a breeze, because the instant reward of getting it right is addicting.

2. Economies of scale

Automation is important. Try and catch yourself when you shy away from putting in the extra work, as it will benefit you down the line. View your computer as an employee, and try to let him take as much work off your shoulders as possible.

Start off with one “minimal” line of code. One example, that performs the operation with minimal amount of flexibility, such as one set of arguments on a data-frame. In a second step, turn the function arguments into parameters (i.e., assign the arguments through a variable). Now, go even one step further and automate repetitive operations with a loop or a function.

However, there are some caveats. Keep in mind what the goal of your code is! Do you want to publish the code as a documentation for your research? A well commented “line-by-line” script, or even a notebook, might be more adequate than a highly modularized version. On the contrary, if you want to publish your code as an installable package there is no way around modularization.

Keeping economic principles in mind when writing code is helpful. Marginal returns and network effects, show up in data science as well. Critically think about whether a chunk of code is worth it, and whether others can benefit from it. Writing a whole package for a simple operation is an overkill, while code that cannot be shared loses its appeal.

3. Documentation > Comments

A common rule for documentation refers to comments (green) and code (black): "Great code has as much green as black”.

I think this statement should be more bland:
“Great code has as much documentation as necessary”.

Over the last 5 years, I learnt that “great documentation” goes beyond the text behind the “#” or “%”. It includes any attempt to make your code more understandable to future users (including yourself). Creating an online vignette, writing roxygen for your functions, or adding tooltips to your dashboard: Documentation can take different shapes – and all of them can be necessary.

4. Be a professional

Writing a script and working like a programmer are two different things.

Many social scientists know how to write some code in statistical software packages like Stata, SPSS, or R. However, only a small group of these people follows the workflows of “real” programmers.

Version control, vectorization, or cluster computing. All of these are useful for researchers, even though they appear to be exclusive to “professionals".
For example, git allows you to work together with colleagues on the same code, documents changes to your code, and hedges you against losing your work. Or take cluster computing: Suddenly, you are able to do data analysis which would have melted your laptop. These tools yield improvements to your workflow, and enable you to conduct research, which was impossible with a “standard” skillset.

Writing code goes beyond statistics. It also helps you to communicate like a professional. Not using Word, PowerPoint and Excel can be the first step towards setting your work apart. I started using Latex for my CV, wrote my master thesis in R, and used scripts to generate online surveys.

Programming goes beyond the statistics, it makes you more professional.

5. Interesting I/O

We often hear the mantra: "garbage in, garbage out". How do we deal with it?

Platforms like kaggle offer large amounts of public datasets and are invaluable resources. However, I think that these do not make for interesting research.
The reason for this is, that the internet is a large and fast space: If there was something interesting here, somebody has already written about it. Often, the posts below such a dataset already contain a great analysis.

What we are looking for is data, which does not cost money but is also not "free".

In my experience, APIs or web-scraping already pose a barrier which scares off many people. The harder it is to get the data (or the harder it is to clean it), the fewer people went through the pain of analyzing it, and the more interesting your project will be.

Hence, work on your skills in data wrangling, API querying and web scraping, as you will have an advantage when it comes to acquiring and transforming data.
Use data which invokes a “Wow, how did you even get that?”.

In the spirit of this article, I hope that the next 5 years bear even more learnings, and that some of the points listed here resonate with you.


You read this far? – Congratulations 🎉

🔓 Access granted to join the get_update() mailing list!

You'll receive every new post straight to your inbox (every other Wednesday), and the get_update() (every Saturday).

Stay in the loop and become part of a future article!