File Parsing

Extract text from files and scan it — no parsing libraries needed in your application. All text extraction runs server-side.

What File Parsing Does

guard.file_parser.parse() accepts a file (as a path, bytes, or file-like object), sends it to the Meshulash security server, and returns the extracted text as a plain str. You can then pass that text to any of the usual scan methods.

The SDK has zero parsing dependencies — your application does not need PDF or DOCX libraries installed. All heavy parsing work happens server-side.

Quick Example

from meshulash_guard import Guard, Action
from meshulash_guard.scanners import PIIScanner, PIILabel

guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
pii = PIIScanner(labels=[PIILabel.EMAIL_ADDRESS, PIILabel.PHONE_NUMBER], action=Action.REPLACE)

# Extract text from a PDF, then scan it
parsed_text = guard.file_parser.parse("report.pdf")
result = guard.scan_input(parsed_text, scanners=[pii])

print(result.status)          # "secured" if PII was found
print(result.processed_text)  # text with PII replaced

Expected output:

secured
Report prepared by [EMAIL_ADDRESS-A1B2]. For inquiries call [PHONE_NUMBER-C3D4].

Supported Formats

Format	Extension	Notes
PDF	`.pdf`	Text extraction; scanned pages use OCR as fallback
Plain text	`.txt`	Encoding auto-detected
CSV	`.csv`	All cell contents extracted
JSON	`.json`	All string values extracted recursively
DOCX	`.docx`	Body text including table contents

File Input Types

parse() accepts four input types:

from pathlib import Path

# 1. String path — format auto-detected from extension
text = guard.file_parser.parse("documents/report.pdf")

# 2. Path object — same as string path
text = guard.file_parser.parse(Path("documents/report.pdf"))

# 3. Bytes — format must be specified explicitly (no extension to detect from)
with open("report.pdf", "rb") as f:
    file_bytes = f.read()
text = guard.file_parser.parse(file_bytes, format="pdf")

# 4. File-like object (BinaryIO) — format auto-detected from name attribute if available
with open("report.pdf", "rb") as f:
    text = guard.file_parser.parse(f)

When passing raw bytes, the format parameter is required because there is no filename from which to detect the extension.

Size Limits

The default maximum file size is 10 MB. The SDK checks the file size before sending it to the server — oversized files raise FileParseError immediately without making a network call.

Error Handling

Exception	When Raised
`FileParseError`	Base class for all file parsing errors. Also raised when the file exceeds the size limit or cannot be read.
`PasswordProtectedError`	The PDF or DOCX file is password-protected and cannot be parsed.
`UnsupportedFormatError`	The file format is not supported, or the format cannot be determined (e.g., raw bytes with no `format=` argument).

PasswordProtectedError and UnsupportedFormatError both inherit from FileParseError, so except FileParseError catches all three.

from meshulash_guard import Guard, FileParseError, PasswordProtectedError, UnsupportedFormatError
from meshulash_guard.scanners import PIIScanner, PIILabel
from meshulash_guard import Action

guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
pii = PIIScanner(labels=[PIILabel.EMAIL_ADDRESS], action=Action.REPLACE)

try:
    parsed_text = guard.file_parser.parse("confidential.pdf")
    result = guard.scan_input(parsed_text, scanners=[pii])
    print(result.status)

except PasswordProtectedError:
    print("File is password-protected — cannot extract text")

except UnsupportedFormatError as e:
    print(f"Unsupported file format: {e}")

except FileParseError as e:
    print(f"File parsing failed: {e}")

Expected output (if file is clean):

clean

Integration with Scanning

A complete workflow — parse a file, normalize the text, and scan it:

from meshulash_guard import Guard, Action
from meshulash_guard import InvisibleTextNormalizer, UnicodeNormalizer
from meshulash_guard import FileParseError, PasswordProtectedError, UnsupportedFormatError
from meshulash_guard.scanners import PIIScanner, PIILabel, ToxicityScanner, ToxicityLabel

guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")

pii = PIIScanner(
    labels=[PIILabel.EMAIL_ADDRESS, PIILabel.PHONE_NUMBER],
    action=Action.REPLACE,
)
toxicity = ToxicityScanner(
    labels=[ToxicityLabel.TOXICITY],
    action=Action.BLOCK,
)

try:
    # Step 1: Extract text from file
    parsed_text = guard.file_parser.parse("user-upload.docx")

    # Step 2: Scan with normalizers to defeat encoding evasion
    result = guard.scan_input(
        parsed_text,
        scanners=[pii, toxicity],
        normalizers=[InvisibleTextNormalizer(), UnicodeNormalizer()],
    )

    if result.status == "blocked":
        print("File content rejected — contains toxic content")
    elif result.status == "secured":
        print("File scanned — PII redacted")
        print(result.processed_text)
    else:
        print("File is clean")
        print(result.processed_text)

except PasswordProtectedError:
    print("Cannot process password-protected files")
except UnsupportedFormatError:
    print("Unsupported file format")
except FileParseError as e:
    print(f"Could not parse file: {e}")

For the full list of exceptions the SDK can raise during scanning, see Concepts → Exceptions.