Customer Discrimination is Essential and Beneficial – if Done in the Right Way
The term discrimination has obvious negative connotations, but in data science and in business in general it’s a critical tool and one that needs to be handled with care.
The Opportunity and the Challenge
With my predictive modeller’s hat on, discrimination is a term that I don’t shy away from. Discrimination on age, for example, can be legal as long as it meets the criteria set out in the Equality Act 2010. While discrimination is commonly associated with unfair or illegal practices, a predictive model’s purpose is to explain/account for variations observed in Y and attribute those to differences in X, i.e. discriminate. In the context of retail banking, I’d be looking for those factors that discriminate between those customers that do default on their loan and those that don’t. That said, the (not so abstract) problem arises when gender or other protected classes show predictive power but cannot be included for a variety of reasons. How can we address this?
Discrimination via Proxy
Proxy discrimination is the idea that, even without including any of the protected classes directly, one can include enough correlated features to allow the system to effectively have access to the same – or a large portion of the same – information (variation) as it would with direct inclusion and its associated unwanted perceived discrimination.
It’s at this point that it starts to get more difficult. In reality, ‘discrimination’ is more nuanced than this. Often, the features (e.g. covariates) are not meant to be strictly causal and are proxies for what we really mean. Access to the underlying, hypothesized, causal drivers is what we’re really after, but often they’re impossible to collect or too abstract in nature to try and capture. Some examples of this:
- Including an Experian credit score in a loan default model makes sense, but a higher score itself does not lead to a lower probability of default; it’s no more than a sensible proxy for behaviour that is indicative of someone’s ability and willingness to pay.
- Telematics and car insurance quotes can be a great example of a proxy variable due to a related self-selection characteristic. While different insurers attempt to reward (adjust) you differently for taking the ‘black-box’, the fact that you are willing to have your driving behaviour monitored and scrutinised is itself indicative, at the time of acquisition, of your presumably safer driving style, associated with lower expected claims, theoretically reducing your premium. The Telematics is a great objective tool to assess this driver quality, but exercising the option to take it is indicative of certain behavioural traits in its own right.
- In some rapid-response modelling, forecasting adverse customer behaviour when the pandemic hit, we noticed that older customers were a lot less likely to take advantage of the voluntary offer to reduce their bill (temporarily). Is age a driving factor here, or is age simply the proxy for behaviours such as ‘cannot be bothered to change it’, ‘I don’t have the digital skills to access my online profile’ or even ‘I don’t look at emails’. Note that we’re not passing a value judgement on whether the communication here was ‘reasonable’ across age groups but are simply observing a (reasonable) proxy effect.
Legal and Moral Considerations
Legally and morally there is a grey area related to how we handle such proxies. In the absence of certainty around what actually discriminates, one might be tempted to adopt gender, ethnicity or religion as a vehicle to capture significant variances. A savvy modeller might try to proxy for those classes, in turn capturing the underlying drivers. Trying to identify religion by trying to collect data related to ‘how many times they go to church’ may be too on-the-nose, but a media giant could correlate more indirect features from their customers’ viewing behaviour, or an e-commerce player might look at their customers’ interest in particular (niche) product categories. What is a proxy, and what is a driver?
Sometimes, something as simple as gender might turn out not be a proxy for the behaviour we would typically associate (observe) with one gender over another, but in fact is more directly related to the underlying driver, biologically. Take the example of pricing personal loans where a small portion of the expected losses on a loan is associated with the risk of the customer passing away, leaving the principal very hard to recover. While for the average customer this might be so small that it could be safely ignored, for the over 60 age group the numbers start to become more meaningful, often with ~5% chance over the lifetime of a loan. In the absence of incredibly granular data that might account for some of the variability here (general health, history of smoking, family history (itself a proxy for your health!) and other medical information), it’s safest to rely on a standard mortality table, such as those published by the Institute and Faculty of Actuaries. These tables distinguish between males and females using a relatively objective view of the world: women, on average, have longer life expectancies, a statistic that, if used, would involve few people would deeming it ‘unfair’. Yet, using broad brush mortality figures based solely on gender would not be allowed and would be perceived as morally suspect.
So far this has not touched on fairness at all, and for good reason. Fairness is not a consideration that is only relevant for features that are associated with unwanted or prejudiced model discrimination. It applies to all features. This is particularly true for the case of personalised pricing, particularly in financial services. For a lot of financial services products, the cost of servicing the customer is related to expected losses, as opposed to a ‘simple’ cost of goods, like in telecom or e-commerce. For instance, the pricing work for an insurance product would consist of working out, based on historical claims and a lot of collected data points, an expected loss (price). This is a no-margin, ‘fair’ price, compensating for exactly the cost of delivering the service, based on your personal characteristics, choices and circumstances. To this we would not only add some margin, but an element of pricing that looks purely at your willingness to pay as well. Is that fair? Is it fair if the willingness to pay (negatively) correlates with existing metrics of customer vulnerability?
The Path Forward
These considerations highlight that the criteria of a successful model build include a well-specified model, the abundant availability of high quality data and, critically, a lot of common sense.
To see Paul and Neil (Director of Forecast UK) dig deeper into this topic, please see the video below.