Identifying Entities from Raw Data: The Extraction Phase

We've moved from "Graph Theory" to "Graph Design." The first—and most critical—step in building a Knowledge Graph is Entity Extraction. If you fail to identify an entity, it won't exist in your graph. If it doesn't exist in your graph, your AI agent is "Blind" to its relationships.

In this lesson, we will look at the techniques for finding the "Nouns" in your raw data. We will compare Statistical Named Entity Recognition (NER) with LLM-Based Extraction, and we'll see why "Entity Discovery" is a balance between precision (getting it right) and recall (getting everything).

1. The Entity Discovery Workflow

Extracting entities from a 500-page PDF isn't a single step. It is a pipeline.

Normalization: Standardizing the text (e.g., removing whitespace, fixing OCR errors).
Candidate Selection: Identifying potential entities (e.g., "Apple," "Project Titan," "Cupertino").
Entity Resolution (Disambiguation): Deciding if "Apple" refers to the fruit or the company.
Labeling: Assigning a type to the entity (e.g., ORG, PERSON, LOC).

2. Statistical NER: The "Fast" Baseline

Before LLMs, we used models like SpaCy, NLTK, or StanfordNLP. These are small, fast, and local.

Pros: Inexpensive, works at thousands of tokens per second.
Cons: Rigid. They are great at "Person" and "Date," but terrible at custom entities like "Internal Code Name" or "Chemical Compound" unless you retrain them.

3. LLM-Based Extraction: The "Smart" Specialist

LLMs (like GPT-4 or Gemini 1.5) are the gold standard for entity extraction in Graph RAG.

Why? Because an LLM understands Context.

Text: "The lead designer stayed in the hotel."
Classic NER: Finds None.
LLM Extraction: "I found an entity 'Lead Designer' (Role) linked to 'Hotel' (Location)."

The Prompting Pattern: You don't just ask "Extract entities." You ask: "Identify all Entities in this text that fall into these categories: MISSION, INSTRUMENT, SCIENTIST. Return them in a JSON list."

graph TD
    Raw[Raw Text] --> P[Prompt Agent]
    P -->|Extract| J[JSON List]
    J -->|Resolution| R[Entity Node]
    
    subgraph "High Precision Filter"
    R
    end

4. Entity Granularity: How Small is Too Small?

One of the biggest mistakes in Graph RAG is Over-Extraction.

If you create a node for every "Meeting" and every "Email Subject Line," your graph becomes a "Hairball."
If you only create nodes for "People," your graph is too sparse.

The Golden Rule: If an entity has its own Lifecycle or Properties, it should be a node. If it is just a piece of metadata, it should be a Property on another node.

5. Implementation: Context-Aware Extraction with Python

We will use an LLM-based pattern (simulated) to show how to extract specific, complex entities.

import json

# The Prompt we would send to an LLM
prompt_template = """
EXTRACT ENTITIES FROM THE TEXT BELOW.
CATEGORIES: [PROJECT, TECHNOLOGY, RISK]
FORMAT: JSON list of objects with 'name' and 'type'.

TEXT:
The 'Valkyrie' initiative will replace our legacy 'COBOL' systems 
but faces a significant 'Cloud Latency' risk.
"""

# The Simulated Output
llm_output = [
    {"name": "Valkyrie", "type": "PROJECT"},
    {"name": "COBOL", "type": "TECHNOLOGY"},
    {"name": "Cloud Latency", "type": "RISK"}
]

def map_to_graph_nodes(extracted_list):
    nodes = []
    for item in extracted_list:
        nodes.append(f"CREATE (n:{item['type']} {{name: '{item['name']}'}})")
    return nodes

# These Cypher-like strings are what we use to build the graph.
print(map_to_graph_nodes(llm_output))

6. Summary and Exercises

Entity extraction is the "Sensing" layer of Graph RAG.

Standard NER is fast but rigid.
LLMs are slow but contextual and flexible.
Granularity is a design decision: Nouns vs. Concepts.
Output should always be structured (JSON) for easy consumption.

Exercises

Manual Extraction: Take a page from a manual or textbook. List every noun. Now, filter them. Which ones "Matter" enough to be a node?
The "Mercury" Resolution: If you have two documents, one about "The Planet Mercury" and one about "Mercury Records," what extra information do you need to extract from those documents to ensure they don't get merged into the same node?
Category Design: Pick a hobby (e.g., Cooking). What are the 5 core Entity Categories you would use to model recipes and ingredients?

In the next lesson, we will look at how to connect these entities: Defining Relationships That Matter.