Module 6 Lesson 1: What are Embeddings?
The Math of Meaning. How to turn human words into a list of numbers that represent their semantic soul.
Embeddings: Turning Words into Vectors
Computers do not understand the word "Dog." They understand numbers. An Embedding is a mathematical vector (a long list of numbers) that represents the Meaning of a piece of text.
1. Meaning in High-Dimensional Space
Imagine a 3D graph.
- The word "Puppy" is very close to "Dog."
- The word "Cat" is somewhat close to "Dog" (both are pets).
- The word "Toaster" is very far from "Dog."
An embedding model (like text-embedding-3-small) puts thousands of these "Meaning dimensions" into the vector.
2. Similarity Search
Because text is now a "Point in space," we can use simple math (Cosine Similarity or Euclidean Distance) to find which chunks of text are "Similar" to a user's question.
- User Question: "How do I feed my pet?"
- Math Result: "This chunk about 'Dog food' is 0.95 similar."
3. Top Embedding Providers
- OpenAI: Reliable, high-performance, and standard.
- HuggingFace: Thousands of open-source models (Great for Local AI).
- Cohere: Highly optimized for enterprise and multi-lingual.
4. Visualizing the Vector Space
graph TD
A[Dog] --- B[Puppy]
A --- C[Cat]
D[Toaster] --- E[Microwave]
C -.- D[Far Apart]
A -.- E[Far Apart]
5. Basic Code Example
from langchain_openai import OpenAIEmbeddings
# Initialize the model
embeddings = OpenAIEmbeddings()
# Convert a single sentence to a list of numbers
vector = embeddings.embed_query("The dog is happy.")
print(f"Vector Dimensions: {len(vector)}")
# Output: 1536 (Standard for OpenAI)
Key Takeaways
- Embeddings map text to mathematical meanings.
- Similar concepts are "Closer" together in the vector space.
- Similarity Search is the engine that powers RAG.
- The Dimension count (e.g., 1536) must match between your search and your storage.