RC W1D4 - Advice for data scientists

Twitter API

It's very convenient how a HTML/CSS/JS project on repl.it is automatically hosted, so there's no need to explicitly set up a back end (Hello World example here, viewable here).

I've been wanting to play around with Twitter's API, and thought setting up the client JavaScript to make the API call keeps things simple. Annoyingly .env files for HTML/CSS/JS projects get exposed and the more interesting parts of the API is behind the auth.

OK so I'll need my own back end. It turns out this is not too complicated. I stitched together a minimal Flask web server that dumps the response in the HTML (project here, viewable here).

Content: Data science

Today I also presented on applying machine learning to payments use cases - this flows nicely to the content of the day.

https://multithreaded.stitchfix.com/blog/2015/03/31/advice-for-data-scientists

I came across this Stitch Fix blog post when I was first interviewing for a data science role. The advice to choose (or perhaps, give considerable weight to) a company on whether data science makes-or-breaks the business is fantastic - if it's any one article I'd recommend on the topic, it's this.

RC W5D3 - Why doctors hate their computers

WebAssembly

The primary use case for WebAssembly in recent discussions is to enable languages other than JavaScript to run in the browser. The .wasm format, however, could itself become the standard in portable binaries.

Suppose we wanted to speed up Python code with Rust bindings. This requires the Rust code to be compiled to a dynamic library, i.e. .dylib file on Mac, .so on Linux, or .dll on Windows, which can then be imported in Python with a standard Python import (discussion here).

Now suppose the Rust code was compiled to a .wasm file instead. We can similarly import the .wasm file into Python with wasmtime (example here). Since the .wasm format is OS-independent, I believe we now have a portable version of the dynamic library.

Neural networks

At Square, I ran workshops on machine learning and (separately) on neural networks. I presented on the latter today at RC.

Starting with the most basic representation of a neural network as matrix operations, we build it up in stages - reducing loss systematically, capturing non-linear behavior, and introducing regularization with dropout layers. We would train a convolutional neural network by the end of the session, highlighting the improved performance when the model architecture incorporates the 2-dimensional structure of image data.

The Github repo has been updated to Python 3 and the latest Keras API. Then and now, I couldn't resist re-iterating Andrej Karpathy's advice to use model training best practices i.e. "don't be a hero".

Content: Why doctors hate their computers

At the start of my coffee chat with SengMing, I told him I was choosing between New Yorker articles to feature and had just opened a browser tab on Atul Gawande's article. SengMing then shared how much he loved the article; this made the decision easy.

It was published in 2018 but I enjoyed reading it so much more this time around. The references to pain points in the user experience were all too familiar - too many clicks, too difficult to find relevant information, too complicated to use. The issues pertaining to software development are universal except in this case, to use Sengming's words, the quality "literally affects life and death". 

https://www.newyorker.com/magazine/2018/11/12/why-doctors-hate-their-computers

Consider that, in recent years, one of the fastest-growing occupations in health care has been medical-scribe work, a field that hardly existed before electronic medical records. Medical scribes are trained assistants who work alongside physicians to take computer-related tasks off their hands. This fix is, admittedly, a little ridiculous. We replaced paper with computers because paper was inefficient. Now computers have become inefficient, so we’re hiring more humans. And it sort of works.

RC W5D4 - Engineers shouldn’t write ETLs

Classification of audio data

At yesterday's workshop on neural networks, we trained a convolutional neural network (CNN) on the MNIST dataset. The MNIST dataset consists of hand-drawn digits from zero to nine, 28 pixels by 28 pixels in size. A traditional ML model would treat each of the 784 pixels as simply another feature (hence agnostic to column ordering). A CNN would take 'snapshots' of the image, retaining its 2-dimensional structure and (in line with intuition) outperforms the traditional ML model.

There's a team at RC participating in Kaggle's Birdcall Recognition competition. The competition involves identifying the name of the bird in a birdcall audio file. I learned that converting the audio format into an image and then training a CNN on the image is a common technique. The visual representation of the spectrum of frequencies over time is called a spectrogram.

The question that arose was how to deal with metadata. I won't go into too much detail since the competition is still ongoing; we explored a number of possibilities with multi-label models. The basic premise is we have a set of models that perform equally well on the main performance metric, and we use the other labels to select the best (or perhaps, the most robust) among them. Excited to see how this plays out!

Content: Data engineering

A common functional split in a company is between data producers and data consumers. Data producers tend to be engineering team(s) that create and maintain ETLs to move and transform data from one source to another. Data consumers tend to be data science team(s) that use the data for analysis and/or modeling.

The problem with this setup relates to handoffs. Data consumers have more context on the data. When there are data quality issues, data consumer will raise this with data producers. There's usually a bottleneck: data consumers care about the data but are blocked pending a fix, data producers don't have as much insight into use cases but have to prioritize inbound requests.

The following Stitch Fix post proposes a different model, reducing instances where the problem is passed "over the wall". In this model, data producers create tools to make it easy for consumers to create and maintain ETLs i.e. "design new Lego blocks". Data consumers owns the process end-to-end, i.e. "assemble [the Lego blocks] in creative ways to create new data science".

https://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/

This way each team can take ownership of the things they care about, iterate autonomously and move faster together.