RC W3D4 - Applying ML to flag risky payments - Experiments in managing perfectionism

I initially intended to write this 1 week after the post on payments and chargebacks, but advice to my younger self was hard to push back.

I'll add that it was Impossible Project Day at RC. Change of plans here too - I wanted to build a static site generator with Idris but ended up creating a basic WebAssembly compiler. It was clearly impossible since I planned to do it at my last RC batch and hadn't looked at it in the 2.5 years since.

In any case, risky payments. Imagine you start a payments processing startup. Everything goes well at first, but at some point you start to notice fraudsters processing stolen credit cards. You hire someone to lead up the operations team, going through every payment to make sure they're legitimate. Over time your payment volume goes up. Next you hire a software engineer to build a payment review queue, say to review the payment if it's above $1,000 and has an IP address outside of the US.

This decision tree is simple to understand and set up, but also easy to get around. Fraudsters may realize the $1,000 threshold and start processing $999 payments instead. In addition, what is thought of as fraud signal may not be it. For example, instead of the IP address being the red flag, it's actually the language setting on the browser.

This is where machine learning (ML) models can help. First, instead of a 'review' or 'no review' determination, the model would return a score between 0 (not fraud) and 1 (fraud). We can set up the queue such that payments above a certain threshold are reviewed; this is set based on review capacity and risk tolerance. Second, the model would use all possible features as input, and give higher weights to more predictive features based on historical data. This is often called a classification model.

Different models have different trade-offs. Logistic regression is the more 'explainable' model, in which there's a direct relationship between the weights and the likelihood of fraud. We also have 'black box' models like random forests, that employ a 'wisdom of crowds' approach by training a 'forest' of trees. These are harder to understand but have much better predictive performance.

How do we think about performance metrics? The 'precision' of each model is the percentage of cases reviewed by the ops team that end up being blocked, over the total number of cases created. The 'recall' is the total blocked dollar amount over total dollar amount across all cases. Tracking precision alone may end up solving for risky but low-dollar payments, so we'll want to optimize for both.

Another consideration is 'lagging' vs 'leading' indicators. Chargebacks are our 'ground truth' but can arrive up to 3 months from the payment time; this is our lagging indicator. Having the ops team review payments helps provide a leading indicator, especially against fraud rings, but what the ops team blocks may not necessarily end up being chargebacks.

Once we have an effective baseline workflow, we can potentially layer on a process to automatically block payments that are extremely likely to be fraudulent. This helps free up ops team capacity. In addition, if we think of high-scoring payments as 'bad', then low-scoring payments can highlight 'good' customers we want to do more business with. In other words, we get a 'credit-like' model for free.