Good Data Scientists care about the quality of the software they write. This is something my past experiences have led me to firmly believe. Coming from an undergraduate degree in Systems Design Engineering I highly value building robust systems, especially software systems that answer tough data questions. Prior to joining Forecast I worked in a variety of positions including data science, data engineering, and software engineering. To do all three of these well you need to be able to program effectively.
Programming is a tool that solves a variety of challenges and like any complex tool the principles to use it well are dependent on the problem. However, there are still some key ideas that can help to make any software project more successful. Three undervalued ideas in data science are modular code, testing, and readability. Proficiency at these things won’t improve your model results, but it can multiply the impact of your work.
Data science work often starts off as a single script trying to solve the entire problem. These projects very quickly become a mess if they lack organisation. I recommend breaking your code into logical, independent modules as soon as possible. Use existing frameworks as they often prescribe a best way to do this. Modular code takes marginally more time to write but helps you structure your thoughts, isolate errors, and reuse your software. Structuring your code from the start avoids time consuming rewrites later.
Trust in your results is the most important thing you can have. Every piece of code between your raw data and your output is a place to break that trust. Inevitably you will write software with logical errors in it, the best way to safeguard against this is to write tests. Tests are small snippets of code that help you assert some truth about your software. Given a fixed input your function should give fixed results and you can codify this. You can also test your data to make sure certain assumptions about it are true. The more aspects that are tested the faster you can move while being confident in your results.
Finally, your code should be readable, even if you think you will be the only person to use it. Often rushed work becomes unreadable after a break from writing it. Documentation in the form of a readme and comments are good ways to add context to your code. However, this takes additional time and should not be used as a stopgap for confusing logic. By using meaningful variable names and commonly used patterns your code should be clearly legible to anyone. Modularisation in the form of functions helps again here as it forces you to name your logical chunks. If you care about the continued maintenance of your code, whether by you or someone else, you should care about readability.
At Forecast we aim to deliver high quality work quickly and without errors. Given this, these ideas are even more important. The scope or goal of a project can change, so we need modular code to adapt. Scrutinizing every data point or intermediate result isn’t feasible, so we need tests to ensure we don’t make mistakes. Our clients are the ones that use our work, so our code needs to be understandable. Attempting to move fast by skimping on these principles will just slow things down in the long run.
Continuing my work at Forecast I hope to apply these three ideas to my work, building off my past experiences to solve exciting data problems. There are many things a data scientist should care about. Software Engineering principles should certainly be one of them.