For my first data science project, I trained a machine learning model using Lending Club data. The hypothesis is I can use that data to ‘cherry pick’ loans in a way that outperforms the average return. More specifically, all loans within a certain grade pay the same interest rate and if I can stack rank to select the top 5% say, then I beat the average.
What I didn’t quite expect is to later work for a startup that did exactly this.
I digress. The point I’m trying to get to relates to the loan data itself. The model I built mainly used the numerical features like the borrower’s FICO score and income. There was a text blob column which is what the borrower wrote as their intention for the loan proceeds. This text blob became simple features like number of characters, number of words, number of sentences.
With text embeddings, this column can now be a vector and the whole model be trained on both the numerical features and the semantic meaning of the text. It’s one use case of many that language models now unlock.