Skip to content

Normalizers

Normalizers pre-process text before scanning, defeating evasion techniques like Unicode tricks, invisible characters, and homoglyph substitution. All normalization runs server-side — the SDK has zero processing dependencies.


What Normalizers Do

Attackers often try to slip harmful content past security scanners by encoding it in unusual ways: using lookalike characters (Cyrillic "а" instead of Latin "a"), injecting invisible Unicode characters, or double-encoding with HTML entities or URL encoding. Normalizers clean up these obfuscations before the scanners ever see the text.


Quick Example

from meshulash_guard import Guard, Action
from meshulash_guard import InvisibleTextNormalizer, UnicodeNormalizer
from meshulash_guard.scanners import PIIScanner, PIILabel

guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")

pii = PIIScanner(labels=[PIILabel.EMAIL_ADDRESS], action=Action.REPLACE)

result = guard.scan_input(
    text="Contact me at s\u200barah@company.com",  # zero-width space injected
    scanners=[pii],
    normalizers=[InvisibleTextNormalizer(), UnicodeNormalizer()],
)

print(result.status)          # "secured"
print(result.processed_text)  # "Contact me at [EMAIL_ADDRESS-A1B2]"

Expected output:

secured
Contact me at [EMAIL_ADDRESS-A1B2]

The zero-width space is stripped, the normalized text "Contact me at sarah@company.com" reaches the scanner, and the email is detected and replaced.


Available Normalizers

Normalizer What It Does Config
InvisibleTextNormalizer() Strips zero-width characters, private-use area code points, and Unicode tag blocks None
UnicodeNormalizer(form="NFKC") Applies Unicode normalization (NFC, NFKC, NFD, or NFKD) form parameter
EncodingNormalizer() Decodes base64, URL encoding, and HTML entities None
HomoglyphNormalizer() Maps visually similar characters (e.g., Cyrillic lookalikes) to their ASCII equivalents None
StripHtmlNormalizer() Removes HTML tags from text, leaving only the text content None
StripMarkdownNormalizer() Removes Markdown formatting symbols, leaving plain text None
LowercaseNormalizer() Converts all text to lowercase None
CollapseWhitespaceNormalizer() Collapses runs of whitespace (spaces, tabs, newlines) to a single space None

Using Normalizers with Scanning

Pass normalizers as a list to scan_input() or scan_output(). Normalizers are applied in list order before the scanners run.

from meshulash_guard import Guard, Action
from meshulash_guard import (
    InvisibleTextNormalizer,
    UnicodeNormalizer,
    HomoglyphNormalizer,
)
from meshulash_guard.scanners import PIIScanner, PIILabel

guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
pii = PIIScanner(labels=[PIILabel.EMAIL_ADDRESS], action=Action.REPLACE)

result = guard.scan_input(
    text="Email: \uff53\uff41\uff52\uff41\uff48@example.com",  # fullwidth characters
    scanners=[pii],
    normalizers=[InvisibleTextNormalizer(), UnicodeNormalizer(), HomoglyphNormalizer()],
)

print(result.status)          # "secured"
print(result.processed_text)  # "Email: [EMAIL_ADDRESS-A1B2]"

Expected output:

secured
Email: [EMAIL_ADDRESS-A1B2]

Using "all" Shorthand

Pass normalizers="all" to apply all 8 normalizers in the canonical security order. This is the simplest option when you want comprehensive coverage.

result = guard.scan_input(
    text=user_input,
    scanners=[pii],
    normalizers="all",
)

The canonical security order is: invisible text → unicode → encoding → homoglyph → strip HTML → strip Markdown → lowercase → collapse whitespace.


Auto-Ordering

When you pick a specific subset of normalizers, they run in the order you list them. If you want the recommended security ordering applied automatically, set auto_order=True:

from meshulash_guard import Guard, Action
from meshulash_guard import HomoglyphNormalizer, InvisibleTextNormalizer
from meshulash_guard.scanners import PIIScanner, PIILabel

guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
pii = PIIScanner(labels=[PIILabel.EMAIL_ADDRESS], action=Action.REPLACE)

# Listed in "wrong" order — auto_order corrects it to invisible → homoglyph
result = guard.scan_input(
    text=user_input,
    scanners=[pii],
    normalizers=[HomoglyphNormalizer(), InvisibleTextNormalizer()],
    auto_order=True,
)

With auto_order=True, the SDK reorders the normalizers into the canonical security sequence regardless of your list order. Useful when you want a specific subset without having to remember the exact ordering.


Normalize Without Scanning

Call guard.normalize() to normalize text without running any scanners. Returns a plain str.

from meshulash_guard import Guard
from meshulash_guard import EncodingNormalizer, HomoglyphNormalizer

guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")

normalized = guard.normalize(
    text="S%61r%61h%40example.com",  # URL-encoded email
    normalizers=[EncodingNormalizer(), HomoglyphNormalizer()],
)

print(normalized)  # "Sarah@example.com"

Expected output:

Sarah@example.com

This is useful for preprocessing pipelines where you want to normalize before sending text to other systems, not just to the scanner.


Normalizer Details

InvisibleTextNormalizer

Strips characters that have no visible representation but can disrupt pattern matching: zero-width spaces (U+200B), zero-width non-joiners, private-use area characters (U+E000U+F8FF), and Unicode tag blocks (U+E0000U+E007F).

When to use: Always include this normalizer in security-sensitive pipelines. Invisible character injection is a common evasion technique.

Constructor: InvisibleTextNormalizer() — no parameters.

Example:

Before After
"s​arah@example.com" (zero-width space after "s") "sarah@example.com"
"hello​​​ world" (multiple invisible chars) "hello world"

UnicodeNormalizer

Applies Unicode normalization to reduce text to a canonical form. The default form NFKC is recommended for security use — it maps compatibility characters (e.g., fullwidth letters, ligatures, circled digits) to their standard equivalents.

When to use: Use when input may contain fullwidth or compatibility characters, math-style letters, or other visually similar Unicode variants.

Constructor: UnicodeNormalizer(form="NFKC")

Parameter Type Default Description
form str "NFKC" Normalization form: "NFC", "NFKC", "NFD", or "NFKD". "NFKC" is recommended for security.

Example:

Before After (NFKC)
"\uff53\uff41\uff52\uff41\uff48" (fullwidth "sarah") "sarah"
"\u2160" (Roman numeral I) "I"

EncodingNormalizer

Decodes common encoding schemes used to obfuscate text: base64, URL percent-encoding (%XX), and HTML entities (&, A, etc.).

When to use: Use when processing content from web forms, APIs, or any user-controlled input channel where encoding tricks are a concern.

Constructor: EncodingNormalizer() — no parameters.

Example:

Before After
"S%61r%61h%40example.com" "Sarah@example.com"
"sarah@example.com" "sarah@example.com"

HomoglyphNormalizer

Maps visually similar characters from non-Latin scripts to their ASCII equivalents. For example, Cyrillic "а" (U+0430) looks identical to Latin "a" (U+0061) but is a different code point that would bypass most pattern-based detections.

When to use: Use when users may substitute Cyrillic, Greek, or other lookalike characters for ASCII letters.

Constructor: HomoglyphNormalizer() — no parameters.

Example:

Before After
"аdmin" (Cyrillic "а") "admin"
"раypal.com" (Cyrillic "р" and "а") "paypal.com"

StripHtmlNormalizer

Removes HTML tags from text, leaving only the text content. Useful when user input or LLM output may contain HTML markup that could conceal content from scanners.

When to use: Use when processing HTML-formatted input (e.g., rich text editors, email bodies, web scraping output).

Constructor: StripHtmlNormalizer() — no parameters.

Example:

Before After
"<b>Hello</b> <script>alert(1)</script>world" "Hello world"
"<a href='x'>sarah@example.com</a>" "sarah@example.com"

StripMarkdownNormalizer

Removes Markdown formatting symbols (headers #, bold **, italic _, code backticks, links [text](url), etc.), leaving plain text.

When to use: Use when processing Markdown-formatted content such as LLM responses or documentation snippets.

Constructor: StripMarkdownNormalizer() — no parameters.

Example:

Before After
"**Contact**: sarah@example.com" "Contact: sarah@example.com"
"# Title\n\nBody text" "Title\n\nBody text"

LowercaseNormalizer

Converts all text to lowercase. Useful for case-insensitive detection when combined with other normalizers.

When to use: Use when your scanners or banned-substring lists are case-insensitive and you want to normalize casing before scanning.

Constructor: LowercaseNormalizer() — no parameters.

Example:

Before After
"SARAH@EXAMPLE.COM" "sarah@example.com"
"Hello World" "hello world"

CollapseWhitespaceNormalizer

Collapses runs of multiple whitespace characters (spaces, tabs, newlines) into a single space, and trims leading/trailing whitespace.

When to use: Use to normalize whitespace-separated evasion attempts, or to clean up text before passing it to scanners that might otherwise miss multi-space patterns.

Constructor: CollapseWhitespaceNormalizer() — no parameters.

Example:

Before After
"s a r a h" "s a r a h"
"hello\t\t\tworld" "hello world"