Skip to main content

Find what breaks your AI system before adversarial users do.

Prompt injection is the most actively exploited vulnerability class in LLM-based systems right now. Indirect injection — malicious instructions embedded in documents, emails, or web pages that an agent retrieves and processes — is particularly dangerous because it bypasses all direct input validation. We test for it systematically, not opportunistically.

Start a ConversationAll Services
AI Safety & Red Teaming
The Challenge

AI systems have attack surfaces that traditional software security testing does not cover. An LLM-based customer support agent can be manipulated via prompt injection to ignore its system prompt and respond as if it has no restrictions. A fine-tuned classifier can be fooled by adversarial examples — inputs crafted to produce a specific misclassification. A RAG system can be attacked via indirect prompt injection: malicious instructions embedded in retrieved documents that the LLM processes as content.

OWASP Top 10 for LLM Applications formalizes the most common vulnerabilities. Prompt injection (LLM01) is the most widely exploited: attacker input that overrides or supplements system prompt instructions. Insecure output handling (LLM02): downstream systems execute LLM output without validation — SQL injection via LLM-generated queries, HTML injection in rendered output. Training data poisoning (LLM03) and model denial of service (LLM04) round out the critical categories. NeMo Guardrails (NVIDIA) and Guardrails AI provide output filtering and policy enforcement — but they need to be configured against the actual exploits that work against your system, not generic categories.

AI attack surface categories we test
  • Direct prompt injection: user input that overrides or supplements system prompt instructions
  • Indirect prompt injection: malicious instructions embedded in retrieved or processed content
  • Jailbreaking: multi-turn, encoded, or role-playing inputs that bypass content filtering
  • Data extraction via inference: using model responses to reconstruct training data or system prompts
  • Adversarial examples: inputs crafted to produce specific misclassifications in detection models
  • Output handling vulnerabilities: LLM-generated content reaching security-relevant code paths without validation
Our Approach

Red teaming exercises follow a structured methodology adapted from offensive security practices. We establish scope (systems, attack categories, attacker personas), develop an attack playbook specific to your architecture, execute testing, and document every successful exploit with reproduction steps and impact assessment.

The output is not a theoretical vulnerability checklist — it is a prioritized set of actual findings from testing your specific system, with concrete remediation guidance including Guardrails AI and NeMo Guardrails configurations where they apply. What goes into your backlog is things that actually broke in testing, not things that might theoretically break.

Red team engagement process

01
Scope and threat model

Define systems in scope, worst-case outcomes (data exposure, unauthorized agent actions, compliance violations), and attacker personas most relevant to your threat model — internal users, external users, and automated attackers.

02
Attack playbook development

Develop a playbook of techniques relevant to your architecture: prompt injection variants, indirect injection via RAG retrieval, jailbreak attempts, adversarial input generation, data extraction probes, and workflow abuse scenarios specific to your agent's tool surface.

03
Adversarial testing execution

Execute the playbook against your systems. Document every successful exploit with reproduction steps, attack complexity, and impact severity. Run attacks multiple times to establish exploit rates — non-deterministic systems require statistical testing.

04
Findings report with remediation

Prioritized findings with CVSS-style severity ratings adapted for AI vulnerabilities. Each finding: description, exploit demonstration, business impact, remediation guidance including specific Guardrails AI or NeMo Guardrails configuration where applicable.

05
Remediation validation

Optional follow-up to validate that implemented remediations are effective and have not introduced new vulnerabilities. Re-test exploits from the original report.

What Is Included
  1. 01

    Prompt injection testing (direct and indirect)

    We test direct injection using role-playing attacks, delimiter injection, instruction override, and context manipulation. Indirect injection is tested against every content path the LLM processes — documents, emails, retrieved chunks — using payloads crafted to survive chunking and embedding. For agentic systems, we specifically test whether injected instructions can propagate across multi-step reasoning chains.

  2. 02

    Agentic system adversarial testing

    Agents with tool access — file writes, API calls, outbound communications — have a blast radius that scales with autonomy. We test whether prompt injection can cause agents to invoke tools with crafted parameters, access out-of-scope data, or trigger unauthorized actions. Severity is mapped to each tool's actual capabilities, not a generic risk score.

  3. 03

    Guardrails configuration

    NeMo Guardrails and Guardrails AI are effective — but only when configured against the specific exploit patterns that work on your system. We configure output filtering and policy enforcement rules based on confirmed attack vectors from the red team exercise, not against generic categories that may not reflect your architecture's actual exposure.

  4. 04

    Adversarial input generation for classifiers

    For classification and detection models, we generate adversarial examples using black-box techniques — no model internals required. We measure robustness across input regions and identify where adversarial examples are most effective, giving you specific surfaces to harden via adversarial training, confidence thresholding, or input preprocessing.

  5. 05

    OWASP LLM Top 10 coverage

    Our methodology maps findings to the OWASP Top 10 for LLM Applications, covering prompt injection, insecure output handling, training data poisoning, model denial of service, supply chain vulnerabilities, and sensitive information disclosure. Every finding is tagged to its OWASP category for compliance reporting and to support prioritization conversations with non-technical stakeholders.

Deliverables
  • Threat model covering system architecture and attacker personas
  • Attack playbook specific to your tool surface and retrieval pipeline
  • Adversarial testing execution across all in-scope attack categories
  • Findings report with exploits, reproduction steps, and business impact
  • Per-finding remediation guidance including guardrails configuration
  • Optional re-test to validate remediation effectiveness
Projected Impact

A confirmed prompt injection in an agent with write access to external systems can exfiltrate data, trigger unintended actions, or leak system prompts. The cost of discovering this in a structured red team exercise is a fraction of the incident response, reputational damage, and regulatory scrutiny that follows a breach.

FAQ

Frequently
asked questions

Is AI red teaming different from traditional penetration testing?

Yes. Traditional pentesting looks for binary vulnerabilities in infrastructure, code, and protocols. AI red teaming tests probabilistic systems for failure modes that are often not binary: a prompt injection that works 30% of the time is still a documented vulnerability with a specific severity profile. The techniques — adversarial examples, jailbreaks, indirect injection — are specific to AI systems and require different methodology.

How do you handle non-determinism in LLM red teaming?

We run attacks multiple times to establish exploit rates rather than just presence or absence of vulnerability. A jailbreak that works one in ten attempts is a documented finding — with a different severity profile than one that works reliably. We use temperature 0 for reproducibility testing and document attack success rates across repeated attempts at production temperature.

How do we defend against prompt injection in agentic systems?

Defense in depth: input sanitization (flag or strip suspicious instruction patterns in user input), privilege separation (agent tool permissions are minimally scoped to what the task requires), output validation (tool call parameters are validated against schemas before execution), and content isolation (retrieved documents processed in a context separate from user instructions where possible). NeMo Guardrails or Guardrails AI configured against the specific injection patterns that succeed in testing. No single defense is sufficient.

Do you need model weights or just API access for testing?

For LLM-based system red teaming, API access is sufficient — we test the deployed system as an attacker would. For adversarial example testing of custom-trained classification models, access to model architecture and weights enables gradient-based attacks. We design engagement scope based on available access and the threat actors you are defending against.

Ready to get started?

Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.

Start a Conversation

Free 30-min scoping call