If you do all this, you will discover there are indeed interesting problems out there. But – you have to find the problems. And once you have found the problems, you can solve the problems.
This is a gap I see working as a data scientist in industry and in the content of machine learning and data science courses. There is a lot of discussion and tools around solving a well-defined problem, but a lot less discussion about finding the right problems to solve.
Once you solve the problem, you will need your business sponsor to start using it. And to do that, they will need to make a change. You will have to sell your solution to your business sponsor. You will need to use techniques such as storytelling (with data) and rapid prototyping and gathering user feedback.
You will need to understand, at a very fine level of detail, your sponsor’s business and technological processes. (Spending ‘a day in the life’ is an excellent way of doing this. Sit with the people doing the role where the change will happen. Ask them questions. If possible, help them or do part of their role. Everything you do will help your understanding. As a bonus, it will help you build rapport and trust.)
The real importance of a model needs to be considered in context.
If you stop and think about it, it makes sense. If the problem you are working on has not been solved, then the components are not going to be in place. Of course the data is messy.
Where the problems have been solved, you may find that model building is automated – a bit like a sausage factory. Meat comes in, goes through the grinder and comes out as sausages. Business areas like marketing, credit risk and insurance pricing are often mature in their data science needs. Data comes in one end, goes through the grinder, and comes out as value-adding models. These business areas find value by incrementally improving what they already have. Their transformation has already happened.
Innovations in algorithms will not make a difference. Innovations in data will make a difference.
This is true even of the big tech firms like Amazon, Uber, Netflix and so on. (Not that I have worked at any of these.) Once the problem is defined, finding an algorithm to solve it is straightforward.
Just think, once the problem is well defined you have structured data sets and you have a target against which you can model. All you need is to find a good-enough algorithm to relate the two and you have your model. You could just send it to Kaggle for a small prize. Competitors on Kaggle would build you an algorithm that solves the problem extremely well. (Although it might not be efficient, interpretable or implementable.) All for less money than an aspiring data scientist would like to be paid, I am sure.