ToxicityScanner
Detect toxic, hateful, or offensive language in text.
When to Use This
ToxicityScanner protects your LLM application from users who submit harmful, hateful, or offensive input — and catches any such language that might appear in model responses. Community platforms, public chatbots, customer-facing assistants, and any product accessible to a broad audience should screen for toxicity before content reaches the model or the end user.
Common use cases: blocking hate speech in chat applications, filtering toxic user prompts before LLM processing, auditing LLM outputs for offensive language before display, building safer public-facing products, and enforcing community guidelines at the API layer.
Quick Example
from meshulash_guard import Guard, Action, Condition
from meshulash_guard.scanners import ToxicityScanner, ToxicityLabel
guard = Guard(api_key="sk-your-api-key", tenant_id="your-tenant-id")
toxicity = ToxicityScanner(
labels=[ToxicityLabel.TOXICITY, ToxicityLabel.HATE_SPEECH],
action=Action.BLOCK,
condition=Condition.ANY,
)
result = guard.scan_input(
"I hate people like you, you're all worthless.",
scanners=[toxicity],
)
print(result.status) # "blocked"
print(result.processed_text) # original text unchanged (Action.BLOCK keeps text)
Expected output:
Labels
ToxicityScanner classifies text into four toxicity categories.
| Label | What It Detects |
|---|---|
ToxicityLabel.TOXICITY |
General toxic content — rude, disrespectful, or harmful language intended to cause distress |
ToxicityLabel.HATE_SPEECH |
Language that attacks individuals or groups based on protected characteristics (race, religion, gender, sexual orientation, nationality, disability) |
ToxicityLabel.NON_HATE |
Content classified as non-hateful — use to verify clean content or in combination with other labels |
ToxicityLabel.OFFENSIVE_LANGUAGE |
Profanity, slurs, and language that is offensive without necessarily targeting a group |
ToxicityLabel.ALL |
Shorthand to include all four toxicity labels |
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
labels |
list[ToxicityLabel] |
required | Toxicity categories to detect. Cannot be empty. Use ToxicityLabel.ALL to detect all categories. |
action |
Action |
Action.BLOCK |
Action when toxicity is detected. |
condition |
Condition |
Condition.ANY |
Gating condition — when the scanner triggers. |
threshold |
float |
None |
Confidence threshold (0.0–1.0). Detections below this score are ignored. Useful for tuning false-positive rates. |
allowlist |
list[str] |
None |
Values to allow through even when detected. |
Actions and Conditions
ToxicityScanner defaults to Action.BLOCK to immediately reject harmful input. Set threshold to tune sensitivity — lower values (e.g., 0.5) catch borderline content, while higher values (e.g., 0.9) only block clearly toxic text.
For logging and monitoring without blocking, use Action.LOG. This records detections server-side and returns status="clean" so your application continues normally. Useful for analytics dashboards tracking toxicity trends over time.
See the Concepts page for the full reference on Actions and Conditions.
scan_input Example
Screening user messages in a community platform before passing them to an LLM moderator:
from meshulash_guard import Guard, Action, Condition
from meshulash_guard.scanners import ToxicityScanner, ToxicityLabel
guard = Guard(api_key="sk-your-api-key", tenant_id="your-tenant-id")
# Block the most serious categories; log offensive language instead of blocking
toxicity = ToxicityScanner(
labels=[ToxicityLabel.TOXICITY, ToxicityLabel.HATE_SPEECH],
action=Action.BLOCK,
condition=Condition.ANY,
threshold=0.75,
)
messages = [
"Can you help me write a professional cover letter?",
"All members of that group should be eliminated from our platform.",
]
for msg in messages:
result = guard.scan_input(msg, scanners=[toxicity])
print(f"[{result.status}] {msg[:60]}...")
Expected output:
[clean] Can you help me write a professional cover letter?...
[blocked] All members of that group should be eliminated from our...
scan_output Example
Auditing LLM responses before returning them to users to catch any harmful content the model generated:
from meshulash_guard import Guard, Action
from meshulash_guard.scanners import ToxicityScanner, ToxicityLabel
guard = Guard(api_key="sk-your-api-key", tenant_id="your-tenant-id")
toxicity = ToxicityScanner(
labels=[ToxicityLabel.ALL],
action=Action.BLOCK,
)
# Simulate LLM responses
responses = [
"Here is a step-by-step guide to setting up your development environment.",
"Sure, I can help with that. Those people are absolute idiots anyway.",
]
for response in responses:
result = guard.scan_output(response, scanners=[toxicity])
if result.status == "blocked":
print("Response blocked — returning generic error to user.")
else:
print(f"Safe response: {response[:60]}...")
Expected output: