A lot of what I do at Speedeon is focused on building predictive models that help our clients find people who would benefit from their product or service. As a part of that, I am often thinking about how to improve the predictions of a specific model or of our models in general. What I have learned is that the most impactful way of improving a model is to focus on the data.
If you will permit me a bit of oversimplification, there are three ways to improve a model:
- Get more data
- Get better data
- Use a better algorithm
And, in terms of importance, it is often in that order. Getting more data simply means training your model with more examples of the subject you are trying to predict. If the subject happens to be “response to a direct mail campaign,” you can get more data by running further test campaigns to collect more examples of people who respond and do not respond. If you have too few examples of responders, there is little chance you can build a model that learns how to differentiate them from non-responders.
If getting more data means increasing the “number of rows” in your dataset, getting better data means improving the quantity and quality of the “columns”. For an analyst, the columns in a model training dataset are called variables or features. These features are the attributes that describe the subject (rows) you are analyzing. We might have enough data and use modern algorithms, but if we do not have the right attributes to differentiate responders from non-responders, we will be left with a weak model. It helps to build a mental model first, asking yourself: “what attributes should be important for this project?” You can then take inventory of the features you have available and determine if you can continue or should seek to append additional attributes to your dataset.
Only after ensuring you have enough examples and an appropriate set of variables should you look to modifying the algorithm. There is no perfect algorithm that works best in all situations, so model development often requires iteration and testing different methods. However, most of the analytics problems we work on at Speedeon are classification problems using tabular (spreadsheet-like) datasets, where we need to predict the likelihood someone with certain attributes is likely to be in a given class (such as responder or non-responder). We have found that algorithms based on a collection of simpler decision-tree models work well on this type of data because they produce flexible models that capture details simpler models may miss while still offering stability and the ability to generalize to new data.
What does all this mean in practice? Our analytics team at Speedeon spends a lot of time thinking about data. We work with our clients to ensure we are collecting enough examples to build good models, and we work with our engineering teams to source or build new features that make our datasets better.