RC W5D4 - Engineers shouldn’t write ETLs

Classification of audio data

At yesterday's workshop on neural networks, we trained a convolutional neural network (CNN) on the MNIST dataset. The MNIST dataset consists of hand-drawn digits from zero to nine, 28 pixels by 28 pixels in size. A traditional ML model would treat each of the 784 pixels as simply another feature (hence agnostic to column ordering). A CNN would take 'snapshots' of the image, retaining its 2-dimensional structure and (in line with intuition) outperforms the traditional ML model.

There's a team at RC participating in Kaggle's Birdcall Recognition competition. The competition involves identifying the name of the bird in a birdcall audio file. I learned that converting the audio format into an image and then training a CNN on the image is a common technique. The visual representation of the spectrum of frequencies over time is called a spectrogram.

The question that arose was how to deal with metadata. I won't go into too much detail since the competition is still ongoing; we explored a number of possibilities with multi-label models. The basic premise is we have a set of models that perform equally well on the main performance metric, and we use the other labels to select the best (or perhaps, the most robust) among them. Excited to see how this plays out!

Content: Data engineering

A common functional split in a company is between data producers and data consumers. Data producers tend to be engineering team(s) that create and maintain ETLs to move and transform data from one source to another. Data consumers tend to be data science team(s) that use the data for analysis and/or modeling.

The problem with this setup relates to handoffs. Data consumers have more context on the data. When there are data quality issues, data consumer will raise this with data producers. There's usually a bottleneck: data consumers care about the data but are blocked pending a fix, data producers don't have as much insight into use cases but have to prioritize inbound requests.

The following Stitch Fix post proposes a different model, reducing instances where the problem is passed "over the wall". In this model, data producers create tools to make it easy for consumers to create and maintain ETLs i.e. "design new Lego blocks". Data consumers owns the process end-to-end, i.e. "assemble [the Lego blocks] in creative ways to create new data science".

https://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/

This way each team can take ownership of the things they care about, iterate autonomously and move faster together.