Unstructured vs Semi-Structured vs Structured Knowledge: The Data Spectrum

To build a Graph RAG system, you must first become a "Data Archeologist." You need to understand the material you are working with. Not all information is created equal; it exists on a spectrum from "Chaos" (Unstructured) to "Order" (Structured).

In traditional Vector RAG, we treat all data as if it were Unstructured—we just turn everything into a blob of text. In Graph RAG, we treat data as a Transcendental Process. We take the chaos of a PDF and extract the rigid structure of a graph. In this lesson, we will explore the three levels of knowledge representation and why the "Semi-Structured" middle ground is where the most valuable AI insights live.

1. Unstructured Data: The Ocean of Text

Examples: Blog posts, emails, Slack messages, voice transcripts, PDF whitepapers.

Unstructured data is information that does not have a predefined data model or is not organized in a pre-defined manner. It is typically heavy in natural language text.

Pros: It preserves nuances, tone, and context. It is easy for humans to read.
Cons: It is extremely hard for a machine to query precisely. You cannot "Sum" the values in a blog post without an LLM parsing it first.
The Vector Trap: This is where most RAG systems start and end. They treat a legal contract as just another collection of tokens.

2. Semi-Structured Data: The Bridge

Examples: JSON files, XML, HTML, Markdown, CSVs.

Semi-structured data is information that doesn't reside in a relational database but has some organizational properties that make it easier to analyze. It usually contains tags or markers to separate semantic elements and enforce hierarchies.

The Importance of Markdown: Markdown is the preferred format for LLM context because it is both human-readable and machine-parseable. Headers (#, ##) and lists (*, 1.) provide "Anchor Points" for the model's reasoning.
The Metadata Advantage: Semi-structured data often has a "Header" (e.g., Author, Date, Category). This allows for simple filtering before the vector search even starts.

3. Structured Data: The Rigid Skeleton

Examples: SQL Databases (Postgres, MySQL), Excel tables (when used correctly), System Logs.

Structured data is information formatted for easy entry, search, and analysis. It follows a strict "Schema." If the schema says a column is an Integer, you cannot put the word "Twelve" into it.

Pros: Total precision. You can calculate the exact average of 1 million rows in milliseconds.
Cons: High "Brittleness." If the user asks a question that wasn't planned for in the database schema, the system returns nothing. It lacks the "Intuition" of an LLM.

graph TD
    A[Raw Text - Unstructured] -->|Parsing| B[JSON/Markdown - Semi-Structured]
    B -->|Entity Extraction| C[Relational Table / Graph - Structured]
    
    subgraph "The Graph RAG Value Add"
    B
    C
    end
    
    style C fill:#4285F4,color:#fff
    style B fill:#34A853,color:#fff

4. The Transformation: Turning Text into Vertices

The "Magic" of Graph RAG is the ability to turn Unstructured inputs into Structured knowledge.

Scenario: A 5-page legal agreement.

Unstructured: "The party of the first part, hereafter referred to as 'The Vendor'..."
Semi-Structured: {"parties": [{"role": "Vendor", "name": "Acme Corp"}]}
Structured (Graph): (Acme Corp)-[:HAS_ROLE]->(Vendor)

By moving data "Up the Spectrum," we enable Logical Queries. We can now ask: "Which of our vendors have roles in contracts that expire in Q3?" This is nearly impossible to answer with raw text alone.

5. Implementation: A JSON-Driven Context Builder

Let's look at a simple Python example of how we can use semi-structured data to help an agent be more precise.

import json

# A Semi-Structured representation of a document
doc_metadata = {
    "doc_id": "TRANS-2026",
    "subject": "Q1 Revenue",
    "entities": [
        {"name": "Sudeep", "type": "CEO"},
        {"name": "London Office", "type": "Location"}
    ],
    "content": "Sudeep visited the London office to announce the Q1 revenue increase of 15%."
}

# The "Structured Context" we send to the LLM
def build_prompt(data):
    return f"""
    CONTEXT ID: {data['doc_id']}
    SUBJECT: {data['subject']}
    ENTITIES MENTIONED: {', '.join([e['name'] for e in data['entities']])}
    
    TEXT: {data['content']}
    
    QUESTION: Who visited the London office?
    """

# By providing the 'ENTITIES MENTIONED' list, 
# we prevent the LLM from 'Hallucinating' that John Doe was there.
print(build_prompt(doc_metadata))

6. Why "Native Graphs" are the Ultimate Structure

While a SQL table is structured, it is "Flat." It stores information in rows. To find a relationship, you must perform a "JOIN," which becomes computationally expensive as the "Hops" increase.

A Knowledge Graph stores information natively as relationships. In a graph, the "Structure" IS the "Relationship." This makes it the only data type capable of scaling to the complexity of the human world.

7. Summary and Exercises

Data exists on a spectrum of Order.

Unstructured data is easy to collect but hard to reasoning about.
Structured data is easy to reason about but hard to create.
Graph RAG bridges the gap by programmatically turning the Unstructured into the Structured.

Exercises

Categorization Task: Look at a receipt from a coffee shop. What parts are Unstructured (e.g., the logo, the "Thank you" note)? What parts are Semi-Structured (e.g., the itemized list)? What parts are Structured (e.g., the Total Price, the Date)?
Schema Design: If you were to turn this lesson into a JSON object, what keys would you use? (Title, Module, Keywords, Key takeaways)?
Transformation Thinking: Take a short news article (3 paragraphs). Manually extract 5 "Facts" from it in the format: (Subject) (Relationship) (Object). (e.g., (The President) (Visited) (France)).

In the next lesson, we will look at the building blocks of these facts: Documents, Chunks, Entities, and Facts.