Skip to content

Client extractors

Client extractors let the widget pull live data from the host page — form state, selected text, DOM values, route state — and forward it with each user message turn. The LLM receives the extracted content as explicitly delimited, untrusted context, never as instructions.

Extractors are the client-side mirror image of tools:

AspectToolsClient extractors
Where they runHost PHP, server-sideHost JS, in the browser
WhenMid-turn, when LLM emits a tool_callAt user-message-send time, every turn
TrustOutput is host-vetted (you wrote the handler)Output is untrusted page material
Authorizationauthorize() + threaded actorNever used for auth — output is data only
HistoryReplayed back into the prompt within freshness windowStripped from history on subsequent turns

Register an extractor (JavaScript)

After the widget mounts, call registerClientExtractor(name, fn, options) on the widget element:

js
document.querySelector('chatbot-widget').registerClientExtractor(
  'form_state',
  () => {
    const form = document.getElementById('checkout-form')
    return form ? new URLSearchParams(new FormData(form)).toString() : ''
  },
  { description: 'Checkout form fields' },
)

The extractor function:

  • Receives no arguments. Pull whatever you need from the live DOM / app state.
  • Must return a string. Return '' to skip on this turn.
  • May be async; the widget awaits it up to extractor_timeout_ms.

The widget displays a transparency chip after each send — "Read from page: Checkout form fields" — so end-users see what page data was forwarded.

Allowlist per channel (PHP config)

The widget will not run an extractor whose name is not listed in the channel's allowlist. The allowlist is signed into the page envelope at render time, so the server can reject any inbound block whose name is not allowed:

php
// config/chatbot.php
'channels' => [
    'support' => [
        'allowed_extractors'       => ['form_state', 'selected_text'],
        'extractor_timeout_ms'     => 500,    // default: 250 ms
        'extractor_size_cap_bytes' => 16384,  // default: 8192 bytes
    ],
],

If the widget reports an extractor name that the envelope did not allow, the server rejects the request with HTTP 422.

Where extracted content lands in the prompt

Each allowed extractor's output is attached to that turn's user message (not the system prompt) as a clearly-delimited, name-labelled block:

text
<client-extractor name="form_state" trust="untrusted-page-content">
quantity=2&shipping=express&promo=
</client-extractor>

What's my total with promo NEWUSER?

A system-prompt rule instructs the model to treat the contents as data, not instructions.

Blade snapshot directive

For "ship a chunk of the rendered view to the LLM" cases — order summaries, article bodies, tabular data interpolated from request-time records — the package ships a paired Blade directive that rides this same pipeline:

blade
@chatbotSnapshot('article')
    <h1>{{ $post->title }}</h1>
    {!! $post->body_html !!}
@endChatbotSnapshot

Under the hood the directive is syntactic sugar over a single reserved extractor name, blade-snapshot. Hosts opt in per channel by listing that one name — no config edit per directive use:

php
'channels' => [
    'support' => [
        'allowed_extractors' => ['blade-snapshot', 'form_state'],
    ],
],

Mechanics:

  • The directive wraps its body in a marker span. A built-in widget extractor reads each marker's innerText at send-time and submits the concatenation as the blade-snapshot block.
  • The 'label' argument is required; same-label sections inside one page are concatenated in document order. Use it for loops: @foreach ($rows as $row) @chatbotSnapshot('rows') … @endChatbotSnapshot @endforeach.
  • Captured content is innerText, not HTML — tags stripped, display:none respected, whitespace collapsed the way the browser shows it.
  • Same trust posture, size cap, timeout, history-stripping, and transparency chip as a hand-written extractor.
  • The name blade-snapshot is reserved: registering an extractor under it on either the PHP ClientExtractorRegistry or the JS widget registry throws.

Snapshots freeze at page render. SPA navigations that swap content without a full reload will keep replaying the old snapshot — use a hand-written extractor when live page state matters. See ADR-0005 for the full rationale.

What makes extractors safe

PropertyMechanism
AllowlistingChannel must opt-in by name. Per-channel; no global default.
Tamper detectionAllowlist is signed into the envelope. Mismatched name → HTTP 422.
Size capPer-extractor extractor_size_cap_bytes. Oversize output is truncated and logged.
Time capPer-extractor extractor_timeout_ms. Slow extractors are skipped (not failed).
History strippingExtracted blocks are stripped from history replay; only fresh blocks appear on each turn.
Identity-shape blockingExtractor names matching identity patterns (user_id, account_id, etc.) are rejected at config load.
Indirection signalEach block is wrapped in <client-extractor> tags with a trust="untrusted-page-content" marker.

What extractors are NOT for

  • Authorization. Never use extracted values to decide whether a user can do something. Use the threaded actor for that.
  • Persisting state. Extracted blocks are stripped from history. If you need long-lived state, that's context or a tool.
  • Untrusted-to-trusted bridging. Don't write a server-side tool that reads extracted content from history and treats it as ground truth.

Residual risk

When a channel allowlists extractors and mutating tools simultaneously, an indirect prompt injection in the extracted content (for example, a hostile string injected into a customer note) could coerce the model into calling those tools with attacker-chosen arguments.

Soft defences (the wrapping tags, the system-prompt rule, modern model alignment) reduce this risk but do not eliminate it. The hard mitigation is to keep the extractor allowlist and the mutating-tool allowlist disjoint per channel — for example, a read-only support channel can safely allowlist selected_text, while an admin channel with reset_password should not.

See ADR-0004 for the rationale and Security for the full residual-risk discussion.

See also

Released under the MIT License.