Upstream Data Mining and Data Viz Go Hand in Hand

A worthwhile read by Enrico Bertini on “Why Visualization Cannot Afford Ignoring Data Mining and Vice Versa”.

Here are a few notable excerpts that we really enjoyed from his article:

Data is full of rubbish: I repeated it several times in this blog. Data never comes for free, you have to manipulate it in order to accommodate the needs you have for your project. The most classical things you will need to deal with are: missing values, outliers detection, normalization, aggregation, sampling, etc., but every project comes with its own bag of necessary data wrangling. Each one of these requires robust and solid techniques, it is not something you can improvise. And no matter how skilled a data visualization expert you are, you will need to borrow solid techniques from dataminers, otherwise you are an amateur.

Humans don’t scale, machines do: There is no way to visualize a billion items. really believe me, there’s no way to do that effectively. If you assign every item to one single pixel (known as pixel-based visualization), which is the maximum scalability available, you will need either a huge screen or very tiny pixels. In both cases our body has limitations. With a huge screen your perception is hampered by the maximum field of view, that is, there’s no way to embrace the whole screen with your eyes. With tiny pixels the human eye is limited by its maximum resolution. On the other hand machines do scale and can crunch monstrous amounts of data. Add a number of machines to your cluster and you have more power.

You cannot trust black boxes. The issue of trust is very well known among dataminers: the models data mining algorithms build are often arcane and even if something seems to work, there’s no way to really understand why and how it works. Visualization has the power to shorten this gap and help model builders gain better confidence on the babies they build.

There’s no right answer. Data Mining has a long tradition for providing tools to build models that give clear cut answers automatically: “should I give the loan to this customer or not?“. This is fine and useful and it’s been a very successful model for data mining so far. But many of the modern inquiries on data are not so clear-cut. Data analysis is often exploratory and and there’s no right answer. When mining is used for this purpose it necessarily needs a certain level of flexibility: ask a question, produce some initial results, visualize them, understand better the problem, change the parameters, use another algorithm, compare alternative results etc.… and how do you do that without visualization?

Recent Posts