Normalizers
Normalizers pre-process text before scanning, defeating evasion techniques like Unicode tricks, invisible characters, and homoglyph substitution. All normalization runs server-side — the SDK has zero processing dependencies.
What Normalizers Do
Attackers often try to slip harmful content past security scanners by encoding it in unusual ways: using lookalike characters (Cyrillic "а" instead of Latin "a"), injecting invisible Unicode characters, or double-encoding with HTML entities or URL encoding. Normalizers clean up these obfuscations before the scanners ever see the text.
Quick Example
from meshulash_guard import Guard, Action
from meshulash_guard import InvisibleTextNormalizer, UnicodeNormalizer
from meshulash_guard.scanners import PIIScanner, PIILabel
guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
pii = PIIScanner(labels=[PIILabel.EMAIL_ADDRESS], action=Action.REPLACE)
result = guard.scan_input(
text="Contact me at s\u200barah@company.com", # zero-width space injected
scanners=[pii],
normalizers=[InvisibleTextNormalizer(), UnicodeNormalizer()],
)
print(result.status) # "secured"
print(result.processed_text) # "Contact me at [EMAIL_ADDRESS-A1B2]"
Expected output:
The zero-width space is stripped, the normalized text "Contact me at sarah@company.com" reaches the scanner, and the email is detected and replaced.
Available Normalizers
| Normalizer | What It Does | Config |
|---|---|---|
InvisibleTextNormalizer() |
Strips zero-width characters, private-use area code points, and Unicode tag blocks | None |
UnicodeNormalizer(form="NFKC") |
Applies Unicode normalization (NFC, NFKC, NFD, or NFKD) | form parameter |
EncodingNormalizer() |
Decodes base64, URL encoding, and HTML entities | None |
HomoglyphNormalizer() |
Maps visually similar characters (e.g., Cyrillic lookalikes) to their ASCII equivalents | None |
StripHtmlNormalizer() |
Removes HTML tags from text, leaving only the text content | None |
StripMarkdownNormalizer() |
Removes Markdown formatting symbols, leaving plain text | None |
LowercaseNormalizer() |
Converts all text to lowercase | None |
CollapseWhitespaceNormalizer() |
Collapses runs of whitespace (spaces, tabs, newlines) to a single space | None |
Using Normalizers with Scanning
Pass normalizers as a list to scan_input() or scan_output(). Normalizers are applied in list order before the scanners run.
from meshulash_guard import Guard, Action
from meshulash_guard import (
InvisibleTextNormalizer,
UnicodeNormalizer,
HomoglyphNormalizer,
)
from meshulash_guard.scanners import PIIScanner, PIILabel
guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
pii = PIIScanner(labels=[PIILabel.EMAIL_ADDRESS], action=Action.REPLACE)
result = guard.scan_input(
text="Email: \uff53\uff41\uff52\uff41\uff48@example.com", # fullwidth characters
scanners=[pii],
normalizers=[InvisibleTextNormalizer(), UnicodeNormalizer(), HomoglyphNormalizer()],
)
print(result.status) # "secured"
print(result.processed_text) # "Email: [EMAIL_ADDRESS-A1B2]"
Expected output:
Using "all" Shorthand
Pass normalizers="all" to apply all 8 normalizers in the canonical security order. This is the simplest option when you want comprehensive coverage.
The canonical security order is: invisible text → unicode → encoding → homoglyph → strip HTML → strip Markdown → lowercase → collapse whitespace.
Auto-Ordering
When you pick a specific subset of normalizers, they run in the order you list them. If you want the recommended security ordering applied automatically, set auto_order=True:
from meshulash_guard import Guard, Action
from meshulash_guard import HomoglyphNormalizer, InvisibleTextNormalizer
from meshulash_guard.scanners import PIIScanner, PIILabel
guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
pii = PIIScanner(labels=[PIILabel.EMAIL_ADDRESS], action=Action.REPLACE)
# Listed in "wrong" order — auto_order corrects it to invisible → homoglyph
result = guard.scan_input(
text=user_input,
scanners=[pii],
normalizers=[HomoglyphNormalizer(), InvisibleTextNormalizer()],
auto_order=True,
)
With auto_order=True, the SDK reorders the normalizers into the canonical security sequence regardless of your list order. Useful when you want a specific subset without having to remember the exact ordering.
Normalize Without Scanning
Call guard.normalize() to normalize text without running any scanners. Returns a plain str.
from meshulash_guard import Guard
from meshulash_guard import EncodingNormalizer, HomoglyphNormalizer
guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
normalized = guard.normalize(
text="S%61r%61h%40example.com", # URL-encoded email
normalizers=[EncodingNormalizer(), HomoglyphNormalizer()],
)
print(normalized) # "Sarah@example.com"
Expected output:
This is useful for preprocessing pipelines where you want to normalize before sending text to other systems, not just to the scanner.
Normalizer Details
InvisibleTextNormalizer
Strips characters that have no visible representation but can disrupt pattern matching: zero-width spaces (U+200B), zero-width non-joiners, private-use area characters (U+E000–U+F8FF), and Unicode tag blocks (U+E0000–U+E007F).
When to use: Always include this normalizer in security-sensitive pipelines. Invisible character injection is a common evasion technique.
Constructor: InvisibleTextNormalizer() — no parameters.
Example:
| Before | After |
|---|---|
"sarah@example.com" (zero-width space after "s") |
"sarah@example.com" |
"hello world" (multiple invisible chars) |
"hello world" |
UnicodeNormalizer
Applies Unicode normalization to reduce text to a canonical form. The default form NFKC is recommended for security use — it maps compatibility characters (e.g., fullwidth letters, ligatures, circled digits) to their standard equivalents.
When to use: Use when input may contain fullwidth or compatibility characters, math-style letters, or other visually similar Unicode variants.
Constructor: UnicodeNormalizer(form="NFKC")
| Parameter | Type | Default | Description |
|---|---|---|---|
form |
str |
"NFKC" |
Normalization form: "NFC", "NFKC", "NFD", or "NFKD". "NFKC" is recommended for security. |
Example:
| Before | After (NFKC) |
|---|---|
"\uff53\uff41\uff52\uff41\uff48" (fullwidth "sarah") |
"sarah" |
"\u2160" (Roman numeral I) |
"I" |
EncodingNormalizer
Decodes common encoding schemes used to obfuscate text: base64, URL percent-encoding (%XX), and HTML entities (&, A, etc.).
When to use: Use when processing content from web forms, APIs, or any user-controlled input channel where encoding tricks are a concern.
Constructor: EncodingNormalizer() — no parameters.
Example:
| Before | After |
|---|---|
"S%61r%61h%40example.com" |
"Sarah@example.com" |
"sarah@example.com" |
"sarah@example.com" |
HomoglyphNormalizer
Maps visually similar characters from non-Latin scripts to their ASCII equivalents. For example, Cyrillic "а" (U+0430) looks identical to Latin "a" (U+0061) but is a different code point that would bypass most pattern-based detections.
When to use: Use when users may substitute Cyrillic, Greek, or other lookalike characters for ASCII letters.
Constructor: HomoglyphNormalizer() — no parameters.
Example:
| Before | After |
|---|---|
"аdmin" (Cyrillic "а") |
"admin" |
"раypal.com" (Cyrillic "р" and "а") |
"paypal.com" |
StripHtmlNormalizer
Removes HTML tags from text, leaving only the text content. Useful when user input or LLM output may contain HTML markup that could conceal content from scanners.
When to use: Use when processing HTML-formatted input (e.g., rich text editors, email bodies, web scraping output).
Constructor: StripHtmlNormalizer() — no parameters.
Example:
| Before | After |
|---|---|
"<b>Hello</b> <script>alert(1)</script>world" |
"Hello world" |
"<a href='x'>sarah@example.com</a>" |
"sarah@example.com" |
StripMarkdownNormalizer
Removes Markdown formatting symbols (headers #, bold **, italic _, code backticks, links [text](url), etc.), leaving plain text.
When to use: Use when processing Markdown-formatted content such as LLM responses or documentation snippets.
Constructor: StripMarkdownNormalizer() — no parameters.
Example:
| Before | After |
|---|---|
"**Contact**: sarah@example.com" |
"Contact: sarah@example.com" |
"# Title\n\nBody text" |
"Title\n\nBody text" |
LowercaseNormalizer
Converts all text to lowercase. Useful for case-insensitive detection when combined with other normalizers.
When to use: Use when your scanners or banned-substring lists are case-insensitive and you want to normalize casing before scanning.
Constructor: LowercaseNormalizer() — no parameters.
Example:
| Before | After |
|---|---|
"SARAH@EXAMPLE.COM" |
"sarah@example.com" |
"Hello World" |
"hello world" |
CollapseWhitespaceNormalizer
Collapses runs of multiple whitespace characters (spaces, tabs, newlines) into a single space, and trims leading/trailing whitespace.
When to use: Use to normalize whitespace-separated evasion attempts, or to clean up text before passing it to scanners that might otherwise miss multi-space patterns.
Constructor: CollapseWhitespaceNormalizer() — no parameters.
Example:
| Before | After |
|---|---|
"s a r a h" |
"s a r a h" |
"hello\t\t\tworld" |
"hello world" |