Skip to content

ToxicityScanner

Detect toxic, hateful, or offensive language in text.

When to Use This

ToxicityScanner protects your LLM application from users who submit harmful, hateful, or offensive input — and catches any such language that might appear in model responses. Community platforms, public chatbots, customer-facing assistants, and any product accessible to a broad audience should screen for toxicity before content reaches the model or the end user.

Common use cases: blocking hate speech in chat applications, filtering toxic user prompts before LLM processing, auditing LLM outputs for offensive language before display, building safer public-facing products, and enforcing community guidelines at the API layer.

Quick Example

from meshulash_guard import Guard, Action, Condition
from meshulash_guard.scanners import ToxicityScanner, ToxicityLabel

guard = Guard(api_key="sk-your-api-key", tenant_id="your-tenant-id")

toxicity = ToxicityScanner(
    labels=[ToxicityLabel.TOXICITY, ToxicityLabel.HATE_SPEECH],
    action=Action.BLOCK,
    condition=Condition.ANY,
)

result = guard.scan_input(
    "I hate people like you, you're all worthless.",
    scanners=[toxicity],
)

print(result.status)          # "blocked"
print(result.processed_text)  # original text unchanged (Action.BLOCK keeps text)

Expected output:

blocked
I hate people like you, you're all worthless.

Labels

ToxicityScanner classifies text into four toxicity categories.

Label What It Detects
ToxicityLabel.TOXICITY General toxic content — rude, disrespectful, or harmful language intended to cause distress
ToxicityLabel.HATE_SPEECH Language that attacks individuals or groups based on protected characteristics (race, religion, gender, sexual orientation, nationality, disability)
ToxicityLabel.NON_HATE Content classified as non-hateful — use to verify clean content or in combination with other labels
ToxicityLabel.OFFENSIVE_LANGUAGE Profanity, slurs, and language that is offensive without necessarily targeting a group
ToxicityLabel.ALL Shorthand to include all four toxicity labels

Parameters

Parameter Type Default Description
labels list[ToxicityLabel] required Toxicity categories to detect. Cannot be empty. Use ToxicityLabel.ALL to detect all categories.
action Action Action.BLOCK Action when toxicity is detected.
condition Condition Condition.ANY Gating condition — when the scanner triggers.
threshold float None Confidence threshold (0.0–1.0). Detections below this score are ignored. Useful for tuning false-positive rates.
allowlist list[str] None Values to allow through even when detected.

Actions and Conditions

ToxicityScanner defaults to Action.BLOCK to immediately reject harmful input. Set threshold to tune sensitivity — lower values (e.g., 0.5) catch borderline content, while higher values (e.g., 0.9) only block clearly toxic text.

For logging and monitoring without blocking, use Action.LOG. This records detections server-side and returns status="clean" so your application continues normally. Useful for analytics dashboards tracking toxicity trends over time.

See the Concepts page for the full reference on Actions and Conditions.

scan_input Example

Screening user messages in a community platform before passing them to an LLM moderator:

from meshulash_guard import Guard, Action, Condition
from meshulash_guard.scanners import ToxicityScanner, ToxicityLabel

guard = Guard(api_key="sk-your-api-key", tenant_id="your-tenant-id")

# Block the most serious categories; log offensive language instead of blocking
toxicity = ToxicityScanner(
    labels=[ToxicityLabel.TOXICITY, ToxicityLabel.HATE_SPEECH],
    action=Action.BLOCK,
    condition=Condition.ANY,
    threshold=0.75,
)

messages = [
    "Can you help me write a professional cover letter?",
    "All members of that group should be eliminated from our platform.",
]

for msg in messages:
    result = guard.scan_input(msg, scanners=[toxicity])
    print(f"[{result.status}] {msg[:60]}...")

Expected output:

[clean] Can you help me write a professional cover letter?...
[blocked] All members of that group should be eliminated from our...

scan_output Example

Auditing LLM responses before returning them to users to catch any harmful content the model generated:

from meshulash_guard import Guard, Action
from meshulash_guard.scanners import ToxicityScanner, ToxicityLabel

guard = Guard(api_key="sk-your-api-key", tenant_id="your-tenant-id")

toxicity = ToxicityScanner(
    labels=[ToxicityLabel.ALL],
    action=Action.BLOCK,
)

# Simulate LLM responses
responses = [
    "Here is a step-by-step guide to setting up your development environment.",
    "Sure, I can help with that. Those people are absolute idiots anyway.",
]

for response in responses:
    result = guard.scan_output(response, scanners=[toxicity])
    if result.status == "blocked":
        print("Response blocked — returning generic error to user.")
    else:
        print(f"Safe response: {response[:60]}...")

Expected output:

Safe response: Here is a step-by-step guide to setting up your dev...
Response blocked — returning generic error to user.