3 Reasons Data Science Products are Harder Than You Think They Are

in #art7 years ago

“Can you throw together a quick model for me?”

Data Science is pretty hot right now. Consumer products like Facebook and Netflix have made exposure to data science fairly ubiquitous, and folks like myself love to talk about all the wonderful things data science can do.

While educating people about data science is a pursuit I see as being mostly positive, it does occasionally lead to horror stories like this one:

An executive once told me that they knew “just enough about data science to be dangerous.” It was the most terrifying thing I’ve ever heard.

Data science has the potential to create dynamic, personalized user experiences. Its power to tune output based on user behavior open up worlds of possibilities in a world where user behavior is often difficult for us to predict.

With data science, we are freed up from having to figure out what input will create our desired output. Instead, we can train an algorithm to systematically test which outputs produce optimal results for us.

Of course, in practicality, it’s not quite that simple.

While we all wish that these amazing innovations were just within arms reach, requiring us to only decide to use them, often there’s more to it than “Let’s do some data science!”

Data science isn’t a silver bullet — it doesn’t always work


source: xkcd

This sort of interaction is incredibly common across the world of software. Simple things often sound incredibly more complicated than they are when it comes to systematically achieving certain functionality.

Oftentimes, data science is striving to assess the relationship between behaviors and outcomes which logically should be related. Of course, sometimes these behaviors turn out to not be quite as tightly tied as we think.

Predicting an outcome requires knowing a reasonable amount of the information that goes into making that decision. If you’re trying to predict how bad traffic is going to be tomorrow, and all you have is information about gas prices, you might find that you have only a very tiny piece of the puzzle.

Humans are incredibly complex decision engines. Many factors play into even our most simple decisions, and we often can’t accurately account for which factors those are. If we don’t know, it’s hard for our algorithms to know.

You’re limited by the quality of your data

Data quality is the bane of any ambitious data scientist’s life. The data is seriously never good enough.

What’s worse, by the time anyone realizes the data’s not good enough, it’s too late. You can’t turn back the clock and ask better questions, and so we’re often asked to make quality decisions with less-than-quality data.

Data quality is something that needs to be systematically measured and optimized, because great organizations realize that their decisions are only as good as their data.

Completeness — Do we have some sort of data?

Validity — Is the data that’s there the right kind of data?

Accuracy — Does the data reflect the real-world objects they represent?

Integrity — Are the relationships between entities consistent?

Consistency — Is data consistent across systems? Are there duplicates?

Timeliness — Is the data there when you need it?

There are countless reasons why your data isn’t as high quality as it could be. Perhaps you don’t have strict enough rules in place, resulting in tables being created without any consistent structure. Perhaps a lack of governance is leading to a bunch of crap making its way into your data. Or maybe you’ve been throwing stuff into a data lake for so long that you don’t know what anything means!

Data science depends on thorough testing

It’s interesting how many companies invest in their attempt at harnessing data science before they make a dedicated investment in their ability to test the results of their experiments. Data science is an experiment, and as a result, we need to be able to measure its results in order to really know when it’s working.

Successful organizations are ones that have reached their “testing singularity” — the point at which they begin to make testing a core competency. It often comes from a few stories of groups avoiding pitfalls that would have cost their team months of time as a result of a few days of testing.

Data science can only thrive in an environment that pursues testing as a core competency. If you’re attempting to create an algorithm that optimizes for clickthrough rate, you’d definitely need to be able to measure that clickthrough, and you probably want to be able to test multiple approaches — multiple versions of your algorithm — so that you can find the approach that best solves your problem.



Posted from my blog with SteemPress : https://selfscroll.com/3-reasons-data-science-products-are-harder-than-you-think-they-are/
Sort:  

This user is on the @buildawhale blacklist for one or more of the following reasons:

  • Spam
  • Plagiarism
  • Scam or Fraud

Coin Marketplace

STEEM 0.17
TRX 0.24
JST 0.034
BTC 96170.07
ETH 2806.40
SBD 0.67