Data scientists: What is your job?
It might not be what you expect it to be, depending where you work.
Introduction
Being a data scientist has become one of the coolest jobs of our time. And it sounds pretty good, doesn’t it? Use cutting-edge algorithms to make sense of huge amounts of data. Build models that are highly valued and highly valuable.
But sometimes the reality does not live up to the hyperbole that we feed to new and aspiring data scientists. A quick DuckDuckGo or browsing of data science blogs will lead you to articles with titles like Why you’ll quit your data science job or Why all your data scientists are leaving. These articles include helpful hints on why the actual roles do not match the expectation. I’m going to dig deeper into some of these.
A data scientist’s view of the importance of a model compared with the data and other stuff.
Using the latest and greatest algorithms
There is a whole world of interesting algorithms out there and a plethora of potential neural network topologies to explore.
But your role as a data scientist in industry is not to find novel techniques to build models.
This is not your job.
If this is what you want, there are jobs that let you do this – as a researcher at a university, as an example. Or you could do it entirely for fun and bragging rights or to make yourself more attractive to potential partners or build your personal feeling of self-worth by participating in Kaggle contests (not that there’s anything wrong with Kaggle contests). Go wild! Model from sun-up to sun-down! Contribute to open-source projects!
These are all good things. But they are not your job.
Working on a problem that will change the world
You actually might be fortunate enough to be working with an organisation where you can change the world.
I am happy you get that chance. Really. But many data scientists work in more pedestrian functions.
For many of us, changing the world is not your job.
If you do want to change the world, there are avenues that will help you do this. One example is the Good Data Institute. And again, you could contribute to open-source projects.
Working on interesting problems
Here, again, bright-eyed and eager data scientists might be disappointed that you do not find the problems you are working on particularly interesting. Should the box on this website be green or orange to get the best click-through rate from your target segment of stay-at-home dads? And I put it to you that if you have this problem, the problem most likely lies with you as a data scientist.
Specifically, the problem is with what you call ‘interesting’. What interests you is not relevant.
Being interested is not your job.
If, however, what interests you is delivering valuable outcomes to your customers – irrespective of how you do it – you will find your interests better match the expectations of your role.
The data is a mess
I have read this more times than I have read ‘rinse and repeat’ on a bottle of shampoo. ’Data scientists spend most of their time finding and cleaning the data rather than building models.’ Like it’s a bad thing. Like it is something to be avoided. Like it is something unexpected.
An alternative but advantageous viewpoint is this: one of the most valuable things a data scientist does is to develop predictive features from messy, unspecified and under-utilised data sources.
Cleaning and wrangling the data is your job.
Build these predictive features and get them out there. These are mini data products. Get people using them. Put them into your models. Send them downstream to be used in the AI modules of smart systems. Use them as event triggers for further action. Build them into your reporting and alerting tools.
Another way to look at the interestingness of problems
If you do all this, you will discover there are indeed interesting problems out there. But – you have to find the problems. And once you have found the problems, you can solve the problems.
This is a gap I see working as a data scientist in industry and in the content of machine learning and data science courses. There is a lot of discussion and tools around solving a well-defined problem, but a lot less discussion about finding the right problems to solve.
Once you solve the problem, you will need your business sponsor to start using it. And to do that, they will need to make a change. You will have to sell your solution to your business sponsor. You will need to use techniques such as storytelling (with data) and rapid prototyping and gathering user feedback.
You will need to understand, at a very fine level of detail, your sponsor’s business and technological processes. (Spending ‘a day in the life’ is an excellent way of doing this. Sit with the people doing the role where the change will happen. Ask them questions. If possible, help them or do part of their role. Everything you do will help your understanding. As a bonus, it will help you build rapport and trust.)
The real importance of a model needs to be considered in context.
If you stop and think about it, it makes sense. If the problem you are working on has not been solved, then the components are not going to be in place. Of course the data is messy.
Where the problems have been solved, you may find that model building is automated – a bit like a sausage factory. Meat comes in, goes through the grinder and comes out as sausages. Business areas like marketing, credit risk and insurance pricing are often mature in their data science needs. Data comes in one end, goes through the grinder, and comes out as value-adding models. These business areas find value by incrementally improving what they already have. Their transformation has already happened.
Innovations in algorithms will not make a difference. Innovations in data will make a difference.
This is true even of the big tech firms like Amazon, Uber, Netflix and so on. (Not that I have worked at any of these.) Once the problem is defined, finding an algorithm to solve it is straightforward.
Just think, once the problem is well defined you have structured data sets and you have a target against which you can model. All you need is to find a good-enough algorithm to relate the two and you have your model. You could just send it to Kaggle for a small prize. Competitors on Kaggle would build you an algorithm that solves the problem extremely well. (Although it might not be efficient, interpretable or implementable.) All for less money than an aspiring data scientist would like to be paid, I am sure.
Conclusion
Data scientists, be clear on what your job is and what brings value. It is okay to want to be interested and engaged, but think about whether there are better avenues to fulfil your needs.
Expect to solve problems, expect to clean data and expect to sell your services to your business stakeholders.
That is your job.