Engineering

The 17 New Rules in ATR v2.0.x to v2.1.1

KUAN-HSIN LINMay 10, 20266 min

ATR v2.1.1 adds 16 new rules (10 natural-language attack patterns + 6 skill-compromise extensions) and an engine upgrade that benefits all existing rules. 98% recall on 650 samples preserved.

ATR v2.1.1 is out on npm as [email protected]. The headline: 314 rules at v2.0.17, 336 rules at v2.1.1, a net add of 16 (or 17 if you count one minor reclassification). 98% recall on 650 samples for NVIDIA Garak inthewild_jailbreak_llms held.

Here is what is new, what it catches, and what changed under the hood.

Ten New Natural-Language Attack Rules

The new rules split into three clusters, all targeting attacks expressed in natural-language imperative form rather than code or shell patterns. NL attacks are harder to detect because there is no syntax to anchor on — you are matching intent expressed in prose.

Cluster A — NL imperative instruction exfil (4 rules):

●ATR-2026-00421 NL covert conversation exfiltration
●ATR-2026-00422 NL credential disclosure
●ATR-2026-00423 NL sensitive file disclosure
●ATR-2026-00424 NL system prompt leak

Cluster B — NL persistence (3 rules):

●ATR-2026-00425 NL persistent covert hook
●ATR-2026-00426 NL output-injection credential leak
●ATR-2026-00427 NL fake-error bypass

Cluster C — NL execution and escalation (3 rules):

●ATR-2026-00428 NL covert shell
●ATR-2026-00429 NL skill self-modification
●ATR-2026-00430 NL trust escalation

Each rule has explicit true-positive and true-negative cases, runs against the 432-skill labelled benign corpus at zero FP, and is scored on 650 NVIDIA Garak samples for recall measurement.

Six Skill-Compromise Extensions

The remaining new rules extend existing skill-compromise coverage:

●Fork-impersonation patterns: variants that copy a legitimate skill's name and metadata but inject a malicious prompt
●Dangerous-script bundling: skills that pull in shell scripts whose behavior is not declared in the manifest
●Rugpull setup variants: skills that present clean on first install but escalate behavior on update

Engine Improvements (Benefits All 320+ Existing Rules)

The 16 new rules are the visible change. The invisible one matters more: a rewrite of the code-block range detection in the matcher.

The old non-greedy regex (\\\[\s\S]?\`\`\``) had pathological behavior on nested or unterminated code fences. The new implementation uses a line-state machine* that walks the document once, tracking whether each line is inside or outside a fenced block, with explicit handling for:

●Unclosed code fences (treated as code-block to end-of-document)
●Indented code blocks (4-space prefix)
●Inline triple-backticks

The state machine also honors suppress_in_code_blocks when expressed as an array in rule metadata, not just a boolean. Some rules want to suppress in lang=bash blocks but match in lang=text. That works now.

A third change: the eval suite now suppresses test-case matching for table-row quoted patterns. Several of the 17 NL rules have natural-language phrasing that appears in benchmark documentation tables. The old eval flagged those as FPs; the new eval recognizes them as documentation context.

Why This Matters

The benchmarks tell you whether the rule pack works. The engine work is what makes the rule pack maintainable across the next few thousand rules. Adding 10 NL rules in a single release would have been infeasible if every rule had to fight the code-block regex.

npm install [email protected] to upgrade. Microsoft Agent Governance Toolkit's weekly auto-sync will pick it up on the next run.

npm · GitHub release · Garak benchmark