Embeddings: Turning Words into Vectors

Computers do not understand the word "Dog." They understand numbers. An Embedding is a mathematical vector (a long list of numbers) that represents the Meaning of a piece of text.

1. Meaning in High-Dimensional Space

Imagine a 3D graph.

The word "Puppy" is very close to "Dog."
The word "Cat" is somewhat close to "Dog" (both are pets).
The word "Toaster" is very far from "Dog."

An embedding model (like text-embedding-3-small) puts thousands of these "Meaning dimensions" into the vector.

2. Similarity Search

Because text is now a "Point in space," we can use simple math (Cosine Similarity or Euclidean Distance) to find which chunks of text are "Similar" to a user's question.

User Question: "How do I feed my pet?"
Math Result: "This chunk about 'Dog food' is 0.95 similar."

3. Top Embedding Providers

OpenAI: Reliable, high-performance, and standard.
HuggingFace: Thousands of open-source models (Great for Local AI).
Cohere: Highly optimized for enterprise and multi-lingual.

4. Visualizing the Vector Space

graph TD
    A[Dog] --- B[Puppy]
    A --- C[Cat]
    D[Toaster] --- E[Microwave]
    C -.- D[Far Apart]
    A -.- E[Far Apart]

5. Basic Code Example

from langchain_openai import OpenAIEmbeddings

# Initialize the model
embeddings = OpenAIEmbeddings()

# Convert a single sentence to a list of numbers
vector = embeddings.embed_query("The dog is happy.")

print(f"Vector Dimensions: {len(vector)}")
# Output: 1536 (Standard for OpenAI)

Key Takeaways

Embeddings map text to mathematical meanings.
Similar concepts are "Closer" together in the vector space.
Similarity Search is the engine that powers RAG.
The Dimension count (e.g., 1536) must match between your search and your storage.

Module 6 Lesson 1: What are Embeddings?