Galaxies of words, with text models

August 05, 2025

Cyprien

(work in progress)

Explanation of embeddings

If you already know what embeddings are, skip this section.

Machine learning models don't deal with raw text, like we do. They deal with representations of text, also known as embeddings.

Why?

Why embeddings? Because, for a computer, seeing text as a string of characters is not very informative.

Consider the 2 words: image, and picture.

As a human, you can understand that those two words are very much related, and similar in terms of meanings.

But how can a computer know that those 2 words are semantically similar? You cannot really reason on the characters alone. This is one kind of problem that embeddings try to solve.

What do embeddings look like?

Embeddings are vectors, in the mathematical sense.

For example, the embedding of the word "image" might look like vector:

[ 0.2, 0.1, 0.6 ]

And we might expect that a related word, like "picture", will have an embedding very close to this one, like:

[ 0.3, 0.1, 0.5 ]

And, an unrelated word, like "elephant":

[ 0.9, 2.2, -1.5 ]

Of course, in the real world, embeddings are much more complicated. They are not 3-dimensional like in my examples above. They usually produce embedding vectors with much higher dimensionality. OpenAI's most capable text embedding model, text-embedding-3-large, produces 3072-dimensional embeddings.

Does that mean I have to write a vector for each word in the English dictionary?

No, embeddings are found by a number of different ways. For neural networks, embeddings are learned during training! Some architectures are designed for the purpose of generating embeddings which are useful for a specific task.

My visualization of embeddings

I really wanted to visualize the space spanned by the embeddings vectors of the current top embedding models.

The Embedding Leaderboard on HuggingFace shows the performance of many models, relative to their parameters size.

For my visualization, I wanted to try 2 really good models: intfloat/multilingual-e5-large-instruct and Qwen/Qwen3-Embedding-0.6B

Methodology

My methodology was very simple:

Embedding the entire English dictionary
Reduce the embeddings dimensionality
Cluster the reduced-dimensionality embeddings
Reduce the dimensionality again to 3D.

The first dimensionality reduction step was necessary for my poor laptop to be able to cluster embedding vectors together. The second reduction is obviously necessary for us poor humans who can only see at most in 3 dimensions.

Plotting

Initially, I just wanted to plot the final 3d clustered embeddings using plotly. You can see the plotly demo in the Embedding space 3d plot page.

But quickly, I realized; navigating the embeddings didn't really feel... fun? So I really wanted to be able "explore" the embeddings. That's why I created Flying in embedding space: a first-person view inside the embedding space, where one can move around using WASD and a mouse. I wanted the embedding space to feel like a galaxy of words "stars", where hopefully the clusters of semantically related words can be apparent.

I did the flying demo using Three.js. It was a fun learning experience, I had never done a game-like 3d thing before. Choosing the controls turned out harder than I expected. In the end, I opted for a familiar solution. I wanted to control the view just like a person would move around in minecraft creative mode. It felt natural to me.

Beyond

The small demos are nice and all, but I think they would be really cool with some of the following ideas I want to try out:

Let the user enter a word or sentence and show where it lands in the embedding space. This is annoying to do in practice because I would need to stand up the model in the backend.
Translate another language dictionary, and show the distance between individual word translations in the embedding space. I think it could be cool to see how a model views the "same" word in a different language.