Enterprise Agentic Platform¶

OLAV is not just an AI chat tool -- it is a complete Enterprise Agentic Platform with multi-layer memory and caching, multi-user isolation, audit-driven LLM fine-tuning, and a federated Specialist Agent architecture. This page systematically introduces these production-grade features along with measured performance data.

Feature Claims

ID	Claim	Status
C-L2-38	Per-Agent model assignment (`agent_overrides`)	✅ v0.10.0
C-L2-22	Audit data export in training formats (trajectory/sft/atif)	✅ v0.10.0
C-L2-12	Multi-user concurrent audit with no write conflicts	✅ v0.10.0
C-L2-21	Knowledge base semantic search (vector + BM25 hybrid)	✅ v0.10.0

Multi-Layer Agentic Architecture¶

User request
    |
+---------------------------------------------+
|  Tier-0  SemanticCache (LanceDB)             |  <- Vector similarity hit, 10ms response
|  Identical semantic queries never trigger     |
|  any LLM call                                |
+---------------------------------------------+
    | miss
+---------------------------------------------+
|  Tier-1  LLM SQLiteCache                    |  <- Exact prompt hit, <1ms response
|  Same conversation context avoids            |
|  redundant API calls                         |
+---------------------------------------------+
    | miss
+---------------------------------------------+
|  Tier-2  LLM API Call (OpenAI / OpenRouter)  |  <- Actual network request, ~2-30s
+---------------------------------------------+
    | result
+---------------------------------------------+
|  Episodic Memory (LanceDB)                  |  <- Run results written to long-term memory
|  LangGraph Checkpoint (DuckDB)              |  <- Conversation state persisted
|  Audit Log (audit.duckdb)                   |  <- Fully auditable trace
+---------------------------------------------+

Multi-Layer Caching in Detail¶

Tier-1: LLM SQLiteCache (Exact Match)¶

Each user has an independent SQLite cache at ~/.olav/cache/{username}/llm_cache.db. When an Agent sends the exact same prompt, the response is returned directly from the local database with zero token consumption.

Measured Performance (2026-04-03, model: x-ai/grok-4.1-fast):

Scenario	Response Time	Token Consumption	Speedup
Cold call (no cache)	2.78s	323 tokens (in: 171 + out: 152)	baseline
Hot call (SQLiteCache hit)	0.001s	0 tokens	2259x

# Cache file location
ls ~/.olav/cache/$(whoami)/llm_cache.db

# View cache statistics
sqlite3 ~/.olav/cache/$(whoami)/llm_cache.db \
  "SELECT COUNT(*) as entries FROM full_llm_cache"

Running the same Agent query multiple times

Overall Agent acceleration (including LangGraph reasoning and tool calls):

Run Number	Duration	Speedup
1st (cold start)	~28s	1x
2nd	~19s	1.5x
3rd+	~14s	2x

Because LangGraph message history contains dynamic IDs, the acceleration effect increases with each run and stabilizes at approximately 2x.

Tier-0: SemanticCache (Vector Similarity Match)¶

SemanticCache is stored in .olav/databases/memory.lance (LanceDB) and provides semantic-level caching for vector retrieval operations such as Knowledge Base queries and Hybrid Search.

Parameter	Default	Description
`cache_similarity_threshold`	`0.02`	Cosine distance threshold (<=0.02 = 99%+ similarity)
`cache_ttl_hours`	`24`	Cache entry expiration time
`cache_max_entries`	`500`	Maximum entries (evicted by LRU when exceeded)

# Adjust threshold to cover more semantically similar queries (recommended 0.10~0.20)
export OLAV_MEMORY_CACHE_SIMILARITY_THRESHOLD=0.15

Threshold tuning advice

The default threshold 0.02 only hits near-duplicate queries. To cover semantically similar but differently worded queries like "Show BGP settings for R1" vs. "What is the BGP config of R1?" (measured distance ~ 0.78), increase the threshold to 0.15~0.20.

Anthropic Prompt Caching (Enterprise Exclusive)¶

When using Anthropic models, OLAV automatically adds cache_control markers to system prompts and static context via deepagents.AnthropicPromptCachingMiddleware:

{
  "llm": {
    "provider": "anthropic",
    "model": "claude-3-5-sonnet-20241022"
  }
}

Anthropic Prompt Caching charges only once for the system prompt (typically thousands of tokens), with subsequent cache hits billed at only 10% of the read cost. This is especially effective for large static_context (such as Network Schema reference documents).

Multi-Layer Memory System¶

OLAV uses a LanceDB vector database to store two types of persistent memory:

Episodic Memory¶

Knowledge automatically accumulated by the Agent during execution, stored in .olav/databases/memory.lance (memory table).

Category	Description	Example
`fact`	Environmental facts	"R1@192.168.100.101 is a Juniper border router"
`decision`	Decision records	"Chose OSPF over BGP for intra-DC routing"
`preference`	User preferences	"User prefers JSON format output"
`audit`	Constraint lessons	"execute_sql must use schema.table format"

# memory table structure
id, text, vector (384/1536-dim), category, scope (global|agent),
metadata, timestamp, access_count, weight (time-decay)

LangGraph Checkpoint (Session State)¶

Each user's each Agent has an independent DuckDB checkpoint file, enabling cross-session conversation continuity:

~/.olav/checkpoints/{username}/{workspace}/{agent_id}/checkpoints.duckdb

After restarting OLAV, the Agent can resume the previous conversation context without the user needing to re-explain the background.

Multi-User Isolation (Enterprise Security)¶

OLAV is designed for multi-user concurrent environments with strict data isolation between users:

User Alice                    User Bob
~/.olav/cache/alice/         ~/.olav/cache/bob/
~/.olav/checkpoints/alice/   ~/.olav/checkpoints/bob/
~/.olav/token                ~/.olav/token
        |                            |
        +------------+---------------+
                     |
          .olav/databases/        <- Read-only: globally shared
          .olav/logs/             <- Centralized: all user audits
          .olav/workspace/        <- Read-only: Agent definitions

Data Type	Isolation Level	Storage Location
LLM response cache	User-private	`~/.olav/cache/{user}/`
Conversation checkpoint	User-private	`~/.olav/checkpoints/{user}/`
Audit logs	Centralized	`.olav/databases/audit.duckdb`
Agent definitions	Globally shared	`.olav/workspace/`
Business data	Globally shared	`.olav/databases/main.duckdb`

For details on authentication modes, role permissions, and user management, see Security Model ->.

Federated Specialist Agent Architecture¶

OLAV uses deepagents.SkillsMiddleware to dynamically bind Skills to Agents, forming a federated specialist system:

User request
    |
OLAVAgent (Semantic Router)
    +-- quick  -- Fast SQL/CLI queries
    +-- ops    -- Deep operations (routing, topology, log analysis, diff)
    +-- config -- Inventory sync, snapshot collection, API registration
    +-- core   -- Python/SQL/Shell code execution, web search
         +-- sandbox (SubAgent) -- High-concurrency compute isolation environment

Each Agent's toolset declares a required information check protocol through the required_params metadata in SKILL.md -- when required parameters are missing (such as target device or credentials), the Agent will stop and ask the user instead of executing blindly.

Enterprise LLM Fine-Tuning¶

OLAV's audit database is not just a log -- it is a continuously growing fine-tuning dataset. For the three export formats and specific commands, see Audit and Logs ->.

Fine-Tuning Objectives¶

Fine-Tuning Type	Data Source	Purpose
Tool Call (Tool Call FT)	Successful tool call trajectories	Improve the LLM's ability to directly generate correct SQL/CLI
Domain Knowledge (Domain FT)	Device information, network topology, alert rules	Reduce RAG queries by internalizing business knowledge into the LLM
Constraint Learning (Constraint FT)	Failure lessons extracted via `/trace-review`	Prevent known tool misuse patterns

The fine-tuned model can replace the llm.model field in api.json without modifying any Agent logic:

{
  "llm": {
    "provider": "custom",
    "model": "your-org/olav-finetuned-v1",
    "base_url": "https://your-inference-server/v1",
    "api_key": "..."
  },
  "agent_overrides": {
    "ops": {
      "model": "your-org/olav-ops-specialist-v1"
    }
  }
}

Per-Agent Model Assignment¶

Different Agents can use different fine-tuned models, achieving a fine-grained balance of cost and capability:

{
  "llm": {
    "model": "gpt-4o-mini"
  },
  "agent_overrides": {
    "ops":    { "model": "your-ops-specialist" },
    "config": { "model": "claude-3-5-sonnet-20241022" },
    "quick":  { "model": "gpt-4o-mini" },
    "core":   { "model": "gpt-4o" }
  }
}

The high-frequency quick Agent uses a low-cost model; the ops Agent that requires deep reasoning uses a dedicated fine-tuned model -- overall token cost can be reduced by 40-70%.

Observability¶

All Agent activity is written to .olav/databases/audit.duckdb and can be queried directly with DuckDB:

# Query token consumption over the last 24h
duckdb .olav/databases/audit.duckdb "
  SELECT 
    agent_id,
    COUNT(*) AS runs,
    SUM(CAST(json_extract(payload, '$.tokens_total') AS INT)) AS total_tokens
  FROM audit_events
  WHERE event_type = 'chain_end'
    AND timestamp > NOW() - INTERVAL 24 HOURS
  GROUP BY agent_id
  ORDER BY total_tokens DESC
"

# View cache hit rate (semantic_cache_hit events)
duckdb .olav/databases/audit.duckdb "
  SELECT 
    DATE_TRUNC('hour', timestamp) AS hour,
    COUNT(*) AS cache_hits
  FROM audit_events
  WHERE event_type = 'semantic_cache_hit'
  GROUP BY 1
  ORDER BY 1 DESC
  LIMIT 24
"

This lets you answer precisely: - Which Agent consumed the most tokens? - What is the cache hit rate trend? - Which types of queries are most frequent? - Which run failed, and why?

Next Steps¶

Agent Harness -> -- Sandbox execution, HITL approval, injection protection
Self-Improving Loop -> -- How to continuously improve Agents using audit data
Audit and Logs -> -- Detailed log querying and export guide
Configuration Reference -> -- Complete api.json configuration reference
Users and Roles -> -- RBAC permission controls