Audit Logging: Tracking the AI's Reading Habits

Audit Logging: Tracking the AI's Reading Habits

Learn how to record every interaction with your vector data. Master the art of 'Retrieval Auditing' for compliance and security.

Audit Logging: Tracking the AI's Reading Habits

In traditional databases, we log who "Deleted" a record. In Vector databases, we must also log who "Read" a record. If an employee is querying for "Internal financial projections" every day, your security team needs to know—even if they never actually "Download" a file.

In this lesson, we learning how to implement Retrieval Auditing.


1. The Audit Log Schema

A good vector audit log doesn't just store the query; it stores what the database decided to return.

Key Fields:

  • timestamp: When did the search happen?
  • user_id: Who made the request?
  • query_text: What was the literal string searched?
  • retrieved_ids: A list of the IDs that were returned as Top-K results.
  • similarity_scores: How "close" was the match? (Very high scores on sensitive queries are red flags).

2. Real-Time Security Alerts

You can use your audit log to trigger alerts for suspicious behavior:

  • Broad Scanning: One user searching for many different sensitive topics in a short time.
  • Top-K Explosion: A user requesting top_k=500 to "Exfiltrate" as much data as possible in one go.
  • Access Violations: A query that attempted to use a namespace it didn't have access to.

3. Implementation: The Logging Middleware

Since we are using the Middleman Pattern (Module 16.1), we log the interaction in our Python API.

import logging

def secure_vector_search(user_id, query_str):
    # 1. Perform Search
    results = index.query(vector=model.encode(query_str), top_k=5)
    
    # 2. Extract Document IDs for the log
    doc_ids = [m['id'] for m in results['matches']]
    
    # 3. Write to Audit Log (or ELK stack / Datadog)
    logging.info({
        "event": "retrieval",
        "user": user_id,
        "query": query_str,
        "docs": doc_ids
    })
    
    return results

4. Compliance: Proof of Deletion

Under laws like GDPR, you must be able to prove that a user's data was deleted. Your audit logs are the evidence.

If a user requests "Forget Me," you run a delete operation on the vector DB. Your audit log should record: {"event": "deletion", "subject": "user_xyz_data"}. This log entry is your legal protection during an audit.


5. Summary and Key Takeaways

  1. Log the Output: Record which document IDs were retrieved, not just the query.
  2. Watch the Scores: High-similarity matches on sensitive topics require investigation.
  3. Immutable Logs: Send your audit logs to a separate system (like CloudWatch or Splunk) where they cannot be deleted by the database administrator.
  4. Visibility: Use dashboards to visualize "Most Retrieved Documents." (If a "Secret" doc is in the Top 10, your system has a leakage risk).

Exercise: The Security Auditor

  1. You see a log entry: User_44: query='How do I quit?', docs=['resignation_template', 'hr_severance_policy_2024'].
  2. Is this a security risk?
  3. The Question: If the docs list also included ceo_private_payroll_spreadsheet, what would be your first 3 steps to fix the system?

Congratulations on completing Module 16! You are now a guardian of vector data integrity.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn