Experiments in managing perfectionism

RC W6D5 - Never graduate

Time flies when you’re having fun. It’s hard to believe it’s been 6 weeks already. I extended my stay at RC to a full batch but I'll start layering on intro calls and interview prep. It’s unclear how long the job search will be, this is a reminder to pace myself.

I presented on Prolog at Friday presentations. I started with introducing declarative programming as a 3rd paradigm alongside imperative and functional programming, talked about a math puzzle that Prolog is particularly well-suited for, highlighted the relationship with relational algebra and Datalog, then went into a demo. I emphasized, as the main key take away, how Datalog as a query language is much cleaner for doing recursive calls. Over the weekend I came across Eric Zhang’s senior thesis, which will be a treasure trove... when I get to it.

I finished watching the last video in Andrej Karpathy series, where he builds GPT from scratch. It’s still not clear to me what makes a neural network a transformer. My take away is the use of attention, where instead of giving equal weights to each token in a sequence, that weighting itself is ’trained’. For example, the nouns in a sentence may have a clearer meaning when coupled with nearby adjectives rather than other nouns, so that link is given a higher weight.

It’s actually a great single video to ‘catch up’ on recent developments in training and optimizing neural network models. With regards to attention, he discusses (1) having multiple heads of attention since tokens have a lot to ’talk’ about, as well as (2) alternating between communicating and computation (talking and thinking). With regards to deep networks, the innovations here are (1) having residual connections that act as a temporary ‘highway’ until the deeper layers come online, and (2) the use of layer norm, which is similar to batch norm but normalizes rows instead of columns.

On Friday we also reflected on the batch. How did the batch turn out differently vs what I wrote in my application (snippets here)?

I ended up doing both functional programming and generative modeling. The functional programming discussions were super fun explorations; I discovered a world much bigger than I thought was possible, especially with effect systems. There’s a lot of underlying theory to functional programming - lambda calculus, type theory, category theory. In the words of Richard Eisenberg, “I’ll never be bored again”.

What did I do in batch?

I led the functional programming study group. I learned Haskell, Prolog and Idris. I read up a lot of posts on motivations for each language and a little bit of theory. I discovered that out of all the CS topics, I really enjoy learning about programming languages (it’s also the only project where I’m at the 6th iteration). The next time Python or Rust or TypeScript gets a new feature, I’ll be motivated to dig up papers for more context on its intellectual provenance.

I started with ML when I picked up programming and I feel I’ve come full circle now with generative models.

I wrote a blog post every day in batch.

What were the highs and lows?

The high point was discovering functional programming as a deep well of ideas that I can always draw on.

The low point was wondering if my excursion into functional programming is escapism.

Maybe it’s worth reflecting a little more on the low point. I felt sad thinking that I’ve been jumping around different interests and roles, which makes it difficult for gains to compound. Now I’m a little more reassured that experimenting with different things helps provide a broader foundation to build on. The advice from David Beazley and Peter Norvig to those who want to become Python experts is to learn Python and then everything other than Python. This is the same advice but extended to life.

Never graduate.

RC W6D4 - Asking the right questions

The final video in the `makemore` series involves making the previous model deeper with a tree-like structure. The end result is the convolutional neural network architecture inspired by WaveNet. What’s particularly interesting is the convolutional aspect, where a technique more commonly used for images is used to take advantage of the ‘for’ loop taking place inside the CUDA kernel. Emphasis is also placed on building out an experimental harness to track how training and validation loss changes over time.

Today we also had the final functional programming study group for the half-batch with the topic “What is functional programming?”. What I’ll continue to ponder beyond the session is the notion of ‘what is ergonomic’ vs ‘what is possible’.

For example, in Idris you can set a maximum size to a list and this check is enforced without any additional ‘ceremony', whereas in Python you’ll have to create either a custom class or a custom function to enforce this. Do all Turing-complete language allow you to do the 'same thing' but at varying levels of ease, or are there things that are possible to do in one language but impossible to do in another? Perhaps what's worth reflecting on is, is this even the right question?

RC W6D3 - What did I miss?

I switched my focus from ML to data infra about 3 years ago. The natural question to ask as I started looking into ML again is, what did I miss?

For sequential modeling, LSTM was the state-of-the-art. Now they’re seen as difficult to optimize. Andrej Karpathy shared his thoughts on recent developments in his 3rd video on `makemore` - residual connections, (batch, layer and group) normalization layers, as well as better optimizers (like RMSProp and Adam).

What I also found interesting his approach to diagnostics; he’s very deliberate at making sure viewers develop awareness of how things can go wrong and what to do about it. Note to self to spend a bit more time on backprop from 1st principles.

RC W6D2 - Be careful what you wish for

The next video I watched in Andrej Karpathy’s Neural Networks: Zero to Hero series is where he starts building `makemore`, a model that takes in words and ‘makes more’ words like it. In the video, the model uses human names as training data and generates words that sound like human names.

It’s a bigram character-level language model, in which a single character is used to predict the next character. For example, with the name Emma, the model would use E to predict M, M to predict M, and M to predict A. It starts out with a Markov transition-probability model as a baseline, and progresses to a 1-layer neural network model. The model architecture gets more complex in later videos.

His approach emphasizes the simple building blocks of neural network models. From the engineering perspective, it appears that a lot of the heavy lifting with libraries and frameworks is getting the calculations to run fast at scale. I haven’t used Tensorflow and Torch at length to make a comparison, but I get the impression the latter is loved for having better UX. I admit it’s also cute seeing Karpathy do Youtube influencer poses on the video covers.

As with functional programming, it’s fun to read around the topic. I enjoyed Lex Fridman’s framing of deep learning as the extraction of useful patterns from data, and the analog to heliocentrism vs geocentrism in forming simpler and simpler representation of ideas.

What’s interesting about the broader topic of AI is that, unlike functional programming, everyone has a say. This is understandable given the broader societal implications. According to Politico, the release of ChatGPT has pushed the EU back to the drawing board when it comes to regulation. When asked if it's dangerous to release ChatGPT to the public before we fully understand the risks, Sam Altman responded by saying it’s even more dangerous to develop in secret and release GPT-7 to the world.

This highlights how difficult it is to make predictions about the future, as per this podcast interview of Sam Altman.

I think it’s interesting that if you ask people 10 years ago about how AI was going to have an impact, with a lot of confidence from most people, you would’ve heard, first, it’s going to come for the blue collar jobs working in the factories, truck drivers, whatever. Then it will come for the low skill white collar jobs. Then the very high skill, really high IQ white collar jobs, like a programmer or whatever. And then very last of all and maybe never, it’s going to take the creative jobs. And it’s going exactly the other direction.

Later that day I chatted with a friend who wants to find use cases for GPT but hasn’t been motivated to dig up the more boring tasks at work. I responded by saying how my compass at RC is “is this fun or does this feel like work”, and the dream is to have work be fun! The conversation continued as follows.

RC W6D1 - Yes to ankiGPT, No to pyraftGPT

The first draft of this post was handwritten on paper. This was by design.

I decided to switch gears to learning how ChatGPT works under the hood, post-conversation with a friend last week. I also took the opportunity to update my LinkedIn profile. This was something I was initially hesitant about doing; partly to concentrate on functional programming, partly to spend a bit of time more deliberately thinking what I'd like to do next.

At the start of RC, I wasn't sure whether to learn functional programming or spend time on Andrej Karpathy's videos. Why Andrej Karpathy? I really enjoyed the material he put together for the Stanford CS231n course on computer vision, which then inspired me to come up with a 1-hour intro on neural networks for SquareU.

I spent a bit of time going through the video on a minimal backprop; it's deceptively simple. On a related note, the intro above had a graphic on how the weights change when it goes through stochastic gradient descent but only for a 1-layer network.

The rest of the day I used ChatGPT as a consumer. The first use case was creating Anki flashcards; I did a lot of courses at Bradfield but I've also forgotten a lot of what I learned. GPT-4 is very good at creating flashcards. I would use the text from OSTEP into the prompts for the final version, but the question-and-answers sets it came out with when I tried prompting without any OSTEP text (hence completely made up) were impressive!

What I wasn't able to get GPT-4 to do was implement Raft in Python. I copied the the tests from my implementation into the prompt, and tried to get GPT-4 to write code that would pass the tests. If errors were raised, I would paste the error in full as a prompt to get the next iteration. This failed on the same test 10 times in a row. Next time I'll try helping it along after repeated failures, to see how far we end up.

The amusing part was how, after a few failed iterations, ChatGPT tried to change the function's parameters from 4 to 5 and asked me to change the tests accordingly! I shared this with a friend, who then responded as follows.

A candidate tried that with me during an interview. Human level AI.

RC W5D5 - Switching gears

I met up with a friend over coffee last week. He’s a founding engineer at a hybrid workplace platform. I was keen to hear about the job market. All he wanted to talk about was GPT.

OK. The path is clear.

I also extended RC from 6 to 12 weeks. It’s on!

RC W5D4 - The extensive world of logic programming

Events at RC usually start a few minutes past the hour, and it’s always more fun to spend the time in between with ice breaker questions. This week we had “What’s your favorite programming language mascot?”. I was going to go for Rust’s Ferris, but given this list, my vote has to go to the wonderfully-apt Docker whale (which I learned is called Moby Dock).

The topic this week at the functional programming study group was logic programming. We started with a quick demo of Prolog. The idea of declarative programming - tell the program what you want, rather than what to do - impressed our group; it certainly impressed Joe Armstrong.

What I was impressed by was the extensive use cases of Datalog. Datalog is a subset of Prolog, and since Alain Colmerauer created Prolog to solve problems in computational linguistics, it’s less of a surprise it’s extends nicely to ADTs. Recursive rules are easy to do, ditto graph algorithms. Program analysis involves reducing programs to a graph-like form so we also get this for free. Now there’s theorem-proving, CRDTs and incremental computing. If you haven’t noticed, all the links point to the excellent resource by Philip Zucker.

This excerpt on Frank McSherry's differential dataflow is a sign post for future reading. In particular, he notes how Datalog programs map very nicely onto incremental computation primitives.

Differential dataflow is a data-parallel programming framework designed to efficiently process large volumes of data and to quickly respond to arbitrary changes in input collections.

Thematically-related is Flix, a general-purpose programming language which is distinctive for its first-class Datalog support.

My take away from the week is, having spent a lot of time writing SQL, it would be nice to try out Datalog as a query language. Writing recursive queries is a lot more ergonomic in Datalog. What I also learned was SQL is closer to natural language by design, which perhaps explains hard-to-read SQL queries that correspond to simple logic programs.

RC W5D3 - Excursions with Datalog

My first time at RC, I would go down rabbit holes as long as its fun. This time at RC, I wanted to be more deliberate. In any case, extremes are bad. At least this is how I justified going from working on a series of Prolog exercises to thinking more broadly how things fit together.

In a previous post, the sample solution to the Dinesman puzzle highlights the declarative programming approach. In other words, the program describes the what, not the how. If you’re thinking “this sounds like SQL”, yes, it is very much like SQL. When you run a query in PostgreSQL, the query engine will figure out the scans and the joins, the optimizer speeds things up. You just supply the query.

In fact, this helpful diagram shows how Prolog relates to relational algebra (the theoretical basis for SQL). It’s from an Introduction to Datalog post, which is a highly-recommended read.

How does Datalog compare to Prolog? Prolog is a general-purpose (Turing-complete) language, whereas Datalog is the subset oriented around querying. Datalog seems limited as a subset but benefits in two ways. First, the order of clauses doesn’t matter. Second, Prolog works top-down by returning one result a time, whereas Datalog works with whole sets of data. Rich Hickey talks about Datalog as the query language for Datomic here.

How does Datalog compare to SQL? SQL is an implementation of relational algebra. Relational algebra does not support recursion, while certain SQL dialects extend this by allowing recursion through Common Table Expressions (CTEs). Rules in Prolog are allowed to be recursive but is much cleaner, or perhaps in the term that I’ve become rather fond of, much more ergonomic.

RC W5D2 - Making commitments on Pi Day

I recall committing to specific goals on Pi Day in the past, but perhaps I’ve forgotten what they are because I haven’t shared them publicly. This Pi Day I commit to thinking more deliberately about my long-term career trajectory and to make choices in a way that the gains compound.

This is inspired by Richard Artoul’s post. There aren’t many posts, which makes it more impressive that this one has so many insights.

On learning how things work under the hood.

Fight this urge whenever possible. Know your tools. Accept the abstractions, but only once you’ve studied their implementation and understand their limitations. You’ll never have enough time to do this for every tool that you use, but if you do it for even a small fraction of them you’ll reap massive benefits.

On thinking long-term.

At some point in your career you’ll have to start taking responsibility for your career trajectory. The key insight is that you should make decisions about what projects to work on and which teams and companies to join strategically, not tactically. Think long term.

On accelerating your growth.

Nothing will accelerate your growth faster than spending all day working with other very good engineers. The moment you start to feel like you’re not learning from the engineers around you is the moment you should start looking for a new team.

GPT-4 makes you wonder even more how the future will unfold.

RC W5D1 - First day with Prolog

On occasion I start working on something but then forget the reason I got started in the first place. This time at RC, I find it helpful to go back to SICP to answer the questions like “why learn Prolog?”.

Section 4.3 of SICP is on non-deterministic computing, which involves building an ‘automatic search’ into the evaluator. Here expressions can have more than one possible value, so the evaluator chooses a possible value and checks if requirements are met. If not, the evaluator tries out new choices or backtracks to an earlier state where there are choices. The evaluation either end successfully or fails when there are no more choices.

What’s new here is the notion of ‘requirements being met’. In other words, declaring the desired state and getting the evaluator to find it rather than declaring the exact steps. It’s Day 1, but what comes to mind is querying a database. Instead of instructions, the user declares relations.

Prolog first appeared in 1972. I became more intrigued when I learned Datalog is a successor (how they relate will have to be a different post).

Perhaps it’s helpful to look an example, here from section 4.3.2.

Baker, Cooper, Fletcher, Miller, and Smith live on different floors of an apartment house that contains only five floors. Baker does not live on the top floor. Cooper does not live on the bottom floor. Fletcher does not live on either the top or the bottom floor. Miller lives on a higher floor than does Cooper, Smith does not live on a floor adjacent to Fletcher’s. Fletcher does not live on a floor adjacent to Cooper’s. Where does everyone live?

The Prolog solution on Rosetta Code is compact. Symbols are denoted with quotes, variables are uppercased. Chelsea Troy has a modified (to be non-deterministic) Scheme solution here.

select([A|As],S) :- select(A,S,S1), select(As,S1).

select([],_).

dinesmans(X) :-

%% Baker, Cooper, Fletcher, Miller, and Smith on different floors

%% of an apartment house with five floors.

select([Baker,Cooper,Fletcher,Miller,Smith],[1,2,3,4,5]),

%% Baker does not live on the top floor.

Baker =\= 5,

%% Cooper does not live on the bottom floor.

Cooper =\= 1,

%% Fletcher does not live on either the top or the bottom floor.

Fletcher =\= 1, Fletcher =\= 5,

%% Miller lives on a higher floor than does Cooper.

Miller > Cooper,

%% Smith does not live on a floor adjacent to Fletcher's.

1 =\= abs(Smith - Fletcher),

%% Fletcher does not live on a floor adjacent to Cooper's.

1 =\= abs(Fletcher - Cooper),

%% Where does everyone live?

X = ['Baker'(Baker), 'Cooper'(Cooper), 'Fletcher'(Fletcher),

'Miller'(Miller), 'Smith'(Smith)].

main :- bagof( X, dinesmans(X), L )

-> maplist( writeln, L), nl, write('No more solutions.')

; write('No solutions.’).

Verse and Mercury are functional logic languages; Mercury even calls itself 'Prolog meets Haskell’. Details will follow in later posts.