Hallucination guard - Beamdesk Docs

Three-layer protection

Hallucination guard runs three checks on every AI-generated draft before it reaches your customer: citation enforcement ensures claims are grounded in knowledge base, refund-policy quarantine blocks financial claims without verification, and KB faithfulness eval scores confidence against retrieved evidence.

If any layer fails, the draft is blocked and escalated according to your persona's escalation path. Every guarded decision is logged to audit trail for review and improvement.

Citation enforcement

Requires every factual claim to be backed by a knowledge source with minimum confidence score.

Refund quarantine

Blocks drafts mentioning refunds, credits, or free months without explicit verification.

KB faithfulness

Scores draft confidence against retrieved knowledge, flags low-faithfulness responses.

Enable and configure

Hallucination guard is configured per persona. Enable each layer independently and set thresholds that match your risk tolerance. Default settings are conservative for production use.

Persona guardrail configuration:
Citation enforcement:
  enabled: true
  min_confidence: 0.8
  require_citations_for: ['feature_claim', 'pricing', 'availability', 'policy']

Refund quarantine:
  enabled: true
  keywords: ['refund', 'credit', 'free month', 'compensation', 'write-off']
  require_verification: true

KB faithfulness:
  enabled: true
  min_faithfulness: 0.7
  low_faithfulness_action: 'human_review'

require_citations_for specifies which claim types must have evidence. If AI makes a claim about pricing or a feature but provides no citations, the draft is blocked.

Refund quarantine uses a keyword list to detect financial claims. When enabled, any draft matching keywords is blocked unless the persona has explicit refund tool access and the citation includes policy verification.

Guard failure actions

When a layer fails, Beamdesk can block the draft, escalate to human review, or log without action. Choose the action per layer based on your tolerance for false positives versus false negatives.

Failure actions:
Citation failure:
  action: 'human_review'
  notify: ['billing_team']
  reason_template: 'Missing citation for claim: {{claim_type}}'

Refund quarantine failure:
  action: 'abstain'
  notify: ['manager']
  escalate_to_team: 'risk_review'

KB faithfulness failure:
  action: 'human_review'
  auto_flag: true
  threshold: 0.6  // If faithfulness < 0.6, always human review

abstain means the draft is not sent and customer sees a placeholder message. human_review queues the draft for agent approval. The original customer message is preserved for manual response.

Audit logging

Every guarded decision is written to the audit log with full context: customer message, draft, evidence, guard results, and final action. Export audit logs for compliance review, guardrail tuning, and model performance analysis.

Audit entry structure:
{
  "event_id": "evt_abc123",
  "timestamp": "2025-05-01T14:30:00Z",
  "persona": "Support",
  "business_id": "550e8400-e29b-41d4-a716-446655440000",
  "customer_message": "Do you offer phone support?",
  "draft": "Yes, we offer 24/7 phone at 1-800...",
  "evidence": [
    { "type": "kb", "id": "kb-123", "snippet": "Email only support 9am-6pm", "score": 0.92 }
  ],
  "guard_results": {
    "citation_check": {
      "passed": false,
      "reason": "Claim '24/7 phone' has no citation"
    },
    "refund_quarantine": {
      "passed": true,
      "triggered": false
    },
    "kb_faithfulness": {
      "score": 0.15,
      "passed": false,
      "reason": "Draft contradicts knowledge source"
    }
  },
  "final_action": "abstain",
  "escalation_target": "human_review"
}

Query audit logs via API or export from workspace settings. Use audit data to identify recurring hallucination patterns and refine guard thresholds.

Custom claim patterns

Extend guardrail detection with custom claim patterns for your domain. Add patterns for SLA commitments, uptime guarantees, feature availability, and any claim that could damage trust if incorrect.

Custom claim patterns:
claim_patterns:
  - id: 'sla_response_time'
    regex: '(respond|get back|contact) within (\d+\s?)?'
    type: 'sla_claim'
    require_citation: true
    min_confidence: 0.9

  - id: 'feature_availability'
    regex: '(have|offer|support) (\w+) (integration|feature|api)'
    type: 'feature_claim'
    require_citation: true
    min_confidence: 0.85

  - id: 'uptime_guarantee'
    regex: '\d+(\.\d+)?% (uptime|availability|sla)'
    type: 'sla_claim'
    require_citation: true
    min_confidence: 0.95

Best practices

Start with all three layers enabled and default thresholds
Review audit logs weekly for recurring false positives
Adjust thresholds gradually: 0.8 confidence → 0.75 → 0.7
Add custom patterns for domain-specific claims after analyzing real tickets
Use refund quarantine even if you don't have a formal refund policy
Keep claim patterns simple: complex regex catches more legitimate text
Monitor KB faithfulness scores to identify knowledge gaps