Skip to content

Token Compression

Vernis compresses prior-turn tool results before sending them back to the LLM, reducing token consumption without losing context. This page covers the compression pipeline, updated tool defaults, optimized unanswered-contacts queries, and the request telemetry system.

When Vernis rebuilds conversation history for a new turn, it replaces raw tool-call JSON from prior turns with compact one-line summaries. The current turn’s results are always sent in full.

Turn 1: user asks "search contacts for maria"
→ search_contacts returns full JSON (500+ chars)
→ Full result sent to LLM, persisted to DB
Turn 2: user asks "show her messages"
→ Turn 1's search_contacts result compressed to:
"Tool search_contacts returned 3 contacts: Maria Lopez, Maria Ruiz, Maria Santos."
→ Current turn's get_contact_messages result sent in full
FilePurpose
tool_result_formatter.rsPer-tool summarizers that extract key metadata (counts, names, IDs) into one-liners
tool_context_compression_service.rsBuilds compressed replay context from prior-turn StoredToolCall records
assistant_ws_api.rsUses compressed context when rebuilding AI history

summarize_tool_result() receives the tool name, arguments JSON, and raw result string. It classifies the tool and dispatches to a specialized summarizer:

  • Collection tools (e.g. search_contacts, get_tasks) → extracts count and key identifiers
  • History tools (e.g. get_contact_messages, get_recent_calls) → extracts message/call count and date range
  • Write tools (e.g. create_task, send_message) → extracts action and status
  • Analytics tools → extracts metric summaries
  • Unknown tools → falls back to compact truncation (~60 chars)

Large text fields like body, transcription, analysis_text, and prompt are excluded from summaries entirely.

  • DB storage: full tool results persist in tool_calls_json for history display
  • Client replay: WsServerMessage::HistoryMessage still contains full tool_calls for the UI
  • Current-turn results: always sent uncompressed to the LLM

Four tools had their default result limits reduced from 50 to 20:

ToolPrevious defaultNew default
get_contact_messages5020
search_contacts5020
get_recent_calls5020
get_tasks5020

Users can still request more results explicitly. The LLM tool descriptions reflect the updated defaults.

The get_unanswered_contacts tool now supports SQL-level time filtering when called by the assistant.

The assistant parses natural time ranges into datetime bounds via parse_assistant_time_range():

ValuePeriod
todayCurrent day
yesterdayPrevious day
last_3_days3 days back
last_7_days7 days back
last_14_days14 days back
last_30_days30 days back
last_90_days90 days back
all_timeNo bounds

When the assistant provides datetime bounds, the service uses a CTE with COUNT(*) FILTER (WHERE ...) window functions and SQL-level pagination. The regular messaging API path continues using in-memory post-filtering with unfiltered badge counts.

The UnansweredContactsFilter struct replaced its opaque time_range: Option<TimeRange> field with explicit created_at_start / created_at_end datetime bounds.

Every assistant turn now tracks a detailed breakdown of token and character usage.

ColumnTypeDescription
idUUIDPrimary key
organization_idUUIDOwning organization
conversation_idUUID?Linked conversation
modeltextModel identifier
user_message_charsintCharacters in the user’s message
conversation_message_countintTotal messages in conversation
conversation_user_charsintTotal user character count
conversation_assistant_charsintTotal assistant character count
conversation_total_charsintCombined character count
system_prompt_charsintSystem prompt size
capabilities_charsintCapabilities section size
tool_countintNumber of tools registered
tool_payload_charsintTotal tool definition size
tool_breakdown_jsonjsonbPer-tool payload sizes
input_tokensint?LLM input tokens (post-stream)
output_tokensint?LLM output tokens
cached_tokensint?Cached input tokens
reasoning_tokensint?Reasoning tokens used
  1. build_system_prompt_service returns an AssistantPromptBuild struct with per-section character counts
  2. assistant_request_telemetry_service builds a draft before the LLM call with prompt/conversation/tool metrics
  3. After streaming completes, token usage from the LLM response is attached and the row is inserted asynchronously

admin_assistant_telemetry_service aggregates telemetry data for the Admin → AI Costs tab:

  • Average and p95 request sizes
  • Top tools by payload size
  • Top prompt sections by character count
  • Conversation growth buckets
  • Token usage statistics