Think of buying a second-hand car: You have a particular make and model in mind and a quick search online shows a variety of prices and conditions. In terms of coalitional game theory, the “game” is predicting the price of a specific car. The prediction will have a combination of features, called a “coalition”. The “gain” is the difference between the predicted price for a car against the average predicted price for all combinations of features. The “players” are the feature values that you input into the model which work together to create the gain (or difference from the average value).
Say the average price of your desired car is $20,000. Several factors will move that price up or down for a given vehicle. Age, trim level, condition, and mileage will all influence the price on the vehicle. That’s why it can be difficult to tell if a specific car is priced properly above or below market given all the variables.
Machine learning can solve this problem by building a model to predict what the price should be for a specific vehicle, taking all the variables into account. A SHAP analysis of that model will give you an indication of how significant each factor is in determining the final price prediction the model outputs. It does this by running a large number of predictions comparing the impact of a variable against the other features.
In our example, it’s easy to see that if I look the prices of cars with varying mileage but the same model year, condition and trim level, I can ascertain the impact of mileage on the overall price. SHAP is a bit more complicated since the analysis runs against the varying ‘coalitions’ or combinations of the other variables to get an average impact of the mileage of the car against all possible combinations of features.
In our example, we would end up running a machine learning model varying mileage against all the possible combinations of trim level, model year and condition. Obviously, this means running a lot of combinations through the machine learning model, as the number of combinations grows exponentially with the number of variables you are looking at.
In the used car case, we would have the following coalitions:
- Trim Level
- Mileage
- Model Year
- Condition
- Trim Level + Mileage
- Trim Level + Model Year
- Trim Level + Condition
- Mileage + Model Year
- Mileage + Condition
- Trim Level + Mileage + Model Year
- Trim Level + Model Year + Condition
- Mileage + Trim Level + Condition
- Mileage + Model Year + Condition
- Mileage + Trim Level + Model Year + Condition
Take the output of all these predicted prices and compare it to the average of all the predictions, and you can calculate the overall average impact (or Shapely value) of a feature against the average price predicted by the model. A SHAP analysis presents these visually with an indication of the influence on the outcome: within each line, higher values are represented in red and lower in blue.