Embeddings, Vectors, and Arithmetic

December 2023

Hacker News |

An embedding is the representation of a piece of text like a word, sentence, or paragraph. Traditionally this value is in the form of a mathematical vector - a point in space. You can think of it as analogous to coordinates on a map that just happens to have many, many dimensions.

Once you have generated these embeddings, you can do all sorts of computationally cheap operations you could do on any set of vectors. Lilian Weng's project shows you a ranking of the closest emojis to your search query in meaning-space.

Lilian Weng's Emoji Search

In this example, the vector representation of every emoji has been computed in advance against OpenAI's Ada model. When you execute a search, only your new query is translated into an embedding. The results are simply the closest few emoji vectors, calculated using the euclidean distance or cosine similarity. The semantic nature of the ranking is a byproduct of the AI model's intuitive association between these related concepts.

Intrigued by this, myself and Barney Hill have started to explore the idea of using arithmetic on language vectors. What does semantic addition look like? We weren't sure, so we built a simple app that lets you add two emojis and see the closest known emoji to that result.

UK + Burger = America.
Me and Barney Hill's emoji arithmetic project.

It worked mostly, remarkably well - although the model reflects many stereotypes and flaws present in the training data. At Prodia, we've started to investigate building safety systems by checking if the input prompts are within a distance threshold of known adult or illegal concepts. The CompVis group have created a widely-used safety filter does something similar , after first processing an image via OpenAI's Clip model.

It is not hard to imagine a radically different, fuzzier future now that machines can reason about meaning deep within text. Who needs files and organisation when semantic search is deeply and widely integrated? Embeddings can be representations of many more things than text, like audio or video. Multi-modal search across many media types may not be far away.

To be continued.