montyanderson.net

How to Search Anything

October 2024

X

A linguistic system is a series of differences of sound combined with a series of differences of ideas. — Ferdinand de Saussure, 1916

Last December I wrote about how you can use OpenAI's text embedding model, Ada, to create vectors of emojis. These can be added, substracted, and searched like any other vector. I've ever since been obsessed with the idea that we have a burgeoning modality for searching any domain of media. As long as you have a model to match; you can scale it to any type of content.

I've always wanted to search my local music collection sonically. I naively used to think would be impossible until I found the discogs-effnet model, used by the cosine.club music search engine. It's trained to contrast the raw audio signal of songs by discogs' genre and style labelling, so it works great to rank songs by similarity.

To play with this, I wrote a CLI tool called vecdb that lets you create and search embedding databases locally, using optimised SIMD instructions. It's very barebones but it lets you search vectors pretty fast. As part of being local first, it uses mmap() to take advantage of your operating systems' filesystem cache between runs. I'm using it below to find the closest possible song to one I have in my collection. Closest here means the smallest euclidean distance to the input song vector.

Finding the closest song to one created by Charli XCX.

Search is really just about the traversal of ordered spaces. The hard part is actually the creation and mapping of the spaces themselves. Luckily, this is what machine learning is all about. The process of training is really sorting messy input data into a continuous and meaningful manifold; creating organisation from chaos. In his famous book Machine Learning with Python, Francois Chollet describes it far better than me:

Imagine two sheets of colored paper: one red and one blue. Put one on top of the other. Now crumple them together into a small ball. That crumpled paper ball is your input data, and each sheet of paper is a class of data in a classification problem. What a neural network is meant to do is figure out a transformation of the paper ball that would uncrumple it...

Uncrumpling paper balls is what machine learning is about: finding neat representations for complex, highly folded data manifolds in high-dimensional spaces (a manifold is a continuous surface, like our crumpled sheet of paper).

Under the banner of our ongoing project vroomai, my friend and collaborator Barney Hill created field*. Starting with a dataset meticulously created by world-leading music archivist hurfyd, he reduced a huge collection of over one million track embeddings into a digestible map. Under the hood it uses umap, uniform manifold approximation and projection, an algorithm chiefly used in computational biology to compress complex huge datasets into a lower-dimensional (2d) space.

Exploring music-space with field*.

This enables a profoundly different approach to cultural exploration. If you had never seen a map of the real world, you could be forgiven for thinking little exists outside of a handful of major cities. Seeing a globe would shock you; there would be entire continents you hardly knew anything about — this is what field* unlocks for audio.

We're at the cusp of being able to digitally index and explore all forms and spaces of human culture. We have a few questions though:

If you know the answers, let me know.