Managing Data: A Lesson in Predictability

"Our lives are not our own. We are bound to others, past and present, and by each crime and every kindness, we birth our future." –from Cloud Atlas by David Mitchell

There’s a certain pattern most people exhibit before they reach a certain future. This can be the case in life, but also in purchasing decisions. And it’s because there’s a consistent pattern, we are able to capture it, and find others bound to that same pattern. For example, people who join a gym are also likely to purchase gym shoes, supplements, maybe a yoga mat or some limited in-home equipment; people who purchase a kitchen mixer will likely buy baking pans; people who purchase a boat may be interested in buying fishing gear; and so on. We are able to draw these assumptions about a consumer based on the mass of consumers before them who did the same thing.

But persona marketing can be a tricky science. Luckily, there are tools and methods out there that can help us determine which attributes to select. It’s called data. And businesses have lots of it. But what if we don’t want to hand craft everything that goes into a model? Let’s say we have hundreds of things we are trying to predict. And on top of that, let’s say there’s also a certain portion of the population with an unclear or unstructured past – a scattered purchase history. Maybe I joined a gym, bought a mixer and like to fish.

There will always be little drivers that steer us off the path. And depending on how much we steer, it can completely change the course of our future. In prediction, these little drivers are sometimes the most important attributes of a model. They might be that key, that detail, that sets us up to completely customize a customer’s experience. But because we don’t completely understand these little drivers, we can’t necessarily label them or organize them into useable structured databases. Enter, unsupervised learning. It’s the key to organizing and understanding the small drivers

Unsupervised learning can be a powerful modeling method. It can classify unstructured past and present data on its own. This can save time and energy on the human side, but at the risk of machine processing time. To get the most accurate result in unsupervised learning, every time something changes, it recalculates a new algorithm on its own. If there is something to be predicted up to the minute though, unsupervised learning may not be the best solution. It can be time consuming to re-collect, re-organize, and re-cleanse data. And that’s before re-training and re-calculating the new algorithm.

Here are some things that can be done to facilitate the retraining of data, known as inductive biases:

- Label as much data as possible. This way, the machine that’s learning the data doesn’t have to start from scratch. For our gym goer, for example, we’ll include everything in the first round – what time do they go to the gym? Will they have eaten first? Do they go with a friend?

- Get rid of useless features. These won’t always be the same – there’s no silver bullet. But it is important to reduce your data points to a more manageable set. Once we have the larger set, we can drill down from there.

- Simplify your hypotheses. From the logic of Occam’s razor, we know that if there are multiple hypotheses, the simpler one is the better one. There are infinite ways to answer the question “what are these customers?” Scale it down. Scale it beyond “are they likely to buy these ten products in the next year?” Start with one question: “will someone who goes to the gym be likely to buy a yoga mat?” If you start with a simple question, it’s a lot easier to develop a model that will predict whether or not they will do that first. Once we establish that, we move on: “of those who purchased a yoga mat, will their next purchase be gym shoes?” It’s taking your mass of data and scaling it to one simple question to get started.

At the end, perhaps it’s a cross between the two types of learning, known as semi-supervised learning that might just be the best approach. The greatest model may just need to be formulated from the known alongside the unknown. It should be based on a hypothesis that is neither simple nor complex. But rather, a hypothesis that clearly gets to the root of the problem and gives a possible solution that is executable. There are many “little drivers” of life that we could focus on. And we might even be able to formulate better hypotheses based on these findings. But as long as we get to where we need to go, the small things might just be that – a small thing.

Managing Data: A Lesson in Predictability

Michael Caccavale

About this blog

Subscribe to Email Updates

Most Read

Posts by Topic

Authors List

It's not just a transaction. It's a relationship.

Markets

Services

Articles

Twitter Feed