Token Compression

Vernis compresses prior-turn tool results before sending them back to the LLM, reducing token consumption without losing context. This page covers the compression pipeline, updated tool defaults, optimized unanswered-contacts queries, and the request telemetry system.

History compression

When Vernis rebuilds conversation history for a new turn, it replaces raw tool-call JSON from prior turns with compact one-line summaries. The current turn’s results are always sent in full.

Turn 1: user asks "search contacts for maria"
  → search_contacts returns full JSON (500+ chars)
  → Full result sent to LLM, persisted to DB

Turn 2: user asks "show her messages"
  → Turn 1's search_contacts result compressed to:
    "Tool search_contacts returned 3 contacts: Maria Lopez, Maria Ruiz, Maria Santos."
  → Current turn's get_contact_messages result sent in full

Key files

File	Purpose
`tool_result_formatter.rs`	Per-tool summarizers that extract key metadata (counts, names, IDs) into one-liners
`tool_context_compression_service.rs`	Builds compressed replay context from prior-turn `StoredToolCall` records
`assistant_ws_api.rs`	Uses compressed context when rebuilding AI history

How summarization works

summarize_tool_result() receives the tool name, arguments JSON, and raw result string. It classifies the tool and dispatches to a specialized summarizer:

Collection tools (e.g. search_contacts, get_tasks) → extracts count and key identifiers
History tools (e.g. get_contact_messages, get_recent_calls) → extracts message/call count and date range
Write tools (e.g. create_task, send_message) → extracts action and status
Analytics tools → extracts metric summaries
Unknown tools → falls back to compact truncation (~60 chars)

Large text fields like body, transcription, analysis_text, and prompt are excluded from summaries entirely.

What stays unchanged

DB storage: full tool results persist in tool_calls_json for history display
Client replay: WsServerMessage::HistoryMessage still contains full tool_calls for the UI
Current-turn results: always sent uncompressed to the LLM

Lower tool defaults

Four tools had their default result limits reduced from 50 to 20:

Tool	Previous default	New default
`get_contact_messages`	50	20
`search_contacts`	50	20
`get_recent_calls`	50	20
`get_tasks`	50	20

Users can still request more results explicitly. The LLM tool descriptions reflect the updated defaults.

Unanswered contacts optimization

The get_unanswered_contacts tool now supports SQL-level time filtering when called by the assistant.

Time range values

The assistant parses natural time ranges into datetime bounds via parse_assistant_time_range():

Value	Period
`today`	Current day
`yesterday`	Previous day
`last_3_days`	3 days back
`last_7_days`	7 days back
`last_14_days`	14 days back
`last_30_days`	30 days back
`last_90_days`	90 days back
`all_time`	No bounds

SQL-level vs. in-memory filtering

When the assistant provides datetime bounds, the service uses a CTE with COUNT(*) FILTER (WHERE ...) window functions and SQL-level pagination. The regular messaging API path continues using in-memory post-filtering with unfiltered badge counts.

The UnansweredContactsFilter struct replaced its opaque time_range: Option<TimeRange> field with explicit created_at_start / created_at_end datetime bounds.

Request telemetry

Every assistant turn now tracks a detailed breakdown of token and character usage.

Schema: `assistant_request_telemetry`

Column	Type	Description
`id`	`UUID`	Primary key
`organization_id`	`UUID`	Owning organization
`conversation_id`	`UUID?`	Linked conversation
`model`	`text`	Model identifier
`user_message_chars`	`int`	Characters in the user’s message
`conversation_message_count`	`int`	Total messages in conversation
`conversation_user_chars`	`int`	Total user character count
`conversation_assistant_chars`	`int`	Total assistant character count
`conversation_total_chars`	`int`	Combined character count
`system_prompt_chars`	`int`	System prompt size
`capabilities_chars`	`int`	Capabilities section size
`tool_count`	`int`	Number of tools registered
`tool_payload_chars`	`int`	Total tool definition size
`tool_breakdown_json`	`jsonb`	Per-tool payload sizes
`input_tokens`	`int?`	LLM input tokens (post-stream)
`output_tokens`	`int?`	LLM output tokens
`cached_tokens`	`int?`	Cached input tokens
`reasoning_tokens`	`int?`	Reasoning tokens used

How it works

build_system_prompt_service returns an AssistantPromptBuild struct with per-section character counts
assistant_request_telemetry_service builds a draft before the LLM call with prompt/conversation/tool metrics
After streaming completes, token usage from the LLM response is attached and the row is inserted asynchronously

Admin dashboard

admin_assistant_telemetry_service aggregates telemetry data for the Admin → AI Costs tab:

Average and p95 request sizes
Top tools by payload size
Top prompt sections by character count
Conversation growth buckets
Token usage statistics