Token Compression
Vernis compresses prior-turn tool results before sending them back to the LLM, reducing token consumption without losing context. This page covers the compression pipeline, updated tool defaults, optimized unanswered-contacts queries, and the request telemetry system.
History compression
Section titled “History compression”When Vernis rebuilds conversation history for a new turn, it replaces raw tool-call JSON from prior turns with compact one-line summaries. The current turn’s results are always sent in full.
Turn 1: user asks "search contacts for maria" → search_contacts returns full JSON (500+ chars) → Full result sent to LLM, persisted to DB
Turn 2: user asks "show her messages" → Turn 1's search_contacts result compressed to: "Tool search_contacts returned 3 contacts: Maria Lopez, Maria Ruiz, Maria Santos." → Current turn's get_contact_messages result sent in fullKey files
Section titled “Key files”| File | Purpose |
|---|---|
tool_result_formatter.rs | Per-tool summarizers that extract key metadata (counts, names, IDs) into one-liners |
tool_context_compression_service.rs | Builds compressed replay context from prior-turn StoredToolCall records |
assistant_ws_api.rs | Uses compressed context when rebuilding AI history |
How summarization works
Section titled “How summarization works”summarize_tool_result() receives the tool name, arguments JSON, and raw result string. It classifies the tool and dispatches to a specialized summarizer:
- Collection tools (e.g.
search_contacts,get_tasks) → extracts count and key identifiers - History tools (e.g.
get_contact_messages,get_recent_calls) → extracts message/call count and date range - Write tools (e.g.
create_task,send_message) → extracts action and status - Analytics tools → extracts metric summaries
- Unknown tools → falls back to compact truncation (~60 chars)
Large text fields like body, transcription, analysis_text, and prompt are excluded from summaries entirely.
What stays unchanged
Section titled “What stays unchanged”- DB storage: full tool results persist in
tool_calls_jsonfor history display - Client replay:
WsServerMessage::HistoryMessagestill contains fulltool_callsfor the UI - Current-turn results: always sent uncompressed to the LLM
Lower tool defaults
Section titled “Lower tool defaults”Four tools had their default result limits reduced from 50 to 20:
| Tool | Previous default | New default |
|---|---|---|
get_contact_messages | 50 | 20 |
search_contacts | 50 | 20 |
get_recent_calls | 50 | 20 |
get_tasks | 50 | 20 |
Users can still request more results explicitly. The LLM tool descriptions reflect the updated defaults.
Unanswered contacts optimization
Section titled “Unanswered contacts optimization”The get_unanswered_contacts tool now supports SQL-level time filtering when called by the assistant.
Time range values
Section titled “Time range values”The assistant parses natural time ranges into datetime bounds via parse_assistant_time_range():
| Value | Period |
|---|---|
today | Current day |
yesterday | Previous day |
last_3_days | 3 days back |
last_7_days | 7 days back |
last_14_days | 14 days back |
last_30_days | 30 days back |
last_90_days | 90 days back |
all_time | No bounds |
SQL-level vs. in-memory filtering
Section titled “SQL-level vs. in-memory filtering”When the assistant provides datetime bounds, the service uses a CTE with COUNT(*) FILTER (WHERE ...) window functions and SQL-level pagination. The regular messaging API path continues using in-memory post-filtering with unfiltered badge counts.
The UnansweredContactsFilter struct replaced its opaque time_range: Option<TimeRange> field with explicit created_at_start / created_at_end datetime bounds.
Request telemetry
Section titled “Request telemetry”Every assistant turn now tracks a detailed breakdown of token and character usage.
Schema: assistant_request_telemetry
Section titled “Schema: assistant_request_telemetry”| Column | Type | Description |
|---|---|---|
id | UUID | Primary key |
organization_id | UUID | Owning organization |
conversation_id | UUID? | Linked conversation |
model | text | Model identifier |
user_message_chars | int | Characters in the user’s message |
conversation_message_count | int | Total messages in conversation |
conversation_user_chars | int | Total user character count |
conversation_assistant_chars | int | Total assistant character count |
conversation_total_chars | int | Combined character count |
system_prompt_chars | int | System prompt size |
capabilities_chars | int | Capabilities section size |
tool_count | int | Number of tools registered |
tool_payload_chars | int | Total tool definition size |
tool_breakdown_json | jsonb | Per-tool payload sizes |
input_tokens | int? | LLM input tokens (post-stream) |
output_tokens | int? | LLM output tokens |
cached_tokens | int? | Cached input tokens |
reasoning_tokens | int? | Reasoning tokens used |
How it works
Section titled “How it works”build_system_prompt_servicereturns anAssistantPromptBuildstruct with per-section character countsassistant_request_telemetry_servicebuilds a draft before the LLM call with prompt/conversation/tool metrics- After streaming completes, token usage from the LLM response is attached and the row is inserted asynchronously
Admin dashboard
Section titled “Admin dashboard”admin_assistant_telemetry_service aggregates telemetry data for the Admin → AI Costs tab:
- Average and p95 request sizes
- Top tools by payload size
- Top prompt sections by character count
- Conversation growth buckets
- Token usage statistics