RC W6D3 - What did I miss?

I switched my focus from ML to data infra about 3 years ago. The natural question to ask as I started looking into ML again is, what did I miss?

For sequential modeling, LSTM was the state-of-the-art. Now they’re seen as difficult to optimize. Andrej Karpathy shared his thoughts on recent developments in his 3rd video on `makemore` - residual connections, (batch, layer and group) normalization layers, as well as better optimizers (like RMSProp and Adam).

What I also found interesting his approach to diagnostics; he’s very deliberate at making sure viewers develop awareness of how things can go wrong and what to do about it. Note to self to spend a bit more time on backprop from 1st principles.