File Parsing
Extract text from files and scan it — no parsing libraries needed in your application. All text extraction runs server-side.
What File Parsing Does
guard.file_parser.parse() accepts a file (as a path, bytes, or file-like object), sends it to the Meshulash security server, and returns the extracted text as a plain str. You can then pass that text to any of the usual scan methods.
The SDK has zero parsing dependencies — your application does not need PDF or DOCX libraries installed. All heavy parsing work happens server-side.
Quick Example
from meshulash_guard import Guard, Action
from meshulash_guard.scanners import PIIScanner, PIILabel
guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
pii = PIIScanner(labels=[PIILabel.EMAIL_ADDRESS, PIILabel.PHONE_NUMBER], action=Action.REPLACE)
# Extract text from a PDF, then scan it
parsed_text = guard.file_parser.parse("report.pdf")
result = guard.scan_input(parsed_text, scanners=[pii])
print(result.status) # "secured" if PII was found
print(result.processed_text) # text with PII replaced
Expected output:
Supported Formats
| Format | Extension | Notes |
|---|---|---|
.pdf |
Text extraction; scanned pages use OCR as fallback | |
| Plain text | .txt |
Encoding auto-detected |
| CSV | .csv |
All cell contents extracted |
| JSON | .json |
All string values extracted recursively |
| DOCX | .docx |
Body text including table contents |
File Input Types
parse() accepts four input types:
from pathlib import Path
# 1. String path — format auto-detected from extension
text = guard.file_parser.parse("documents/report.pdf")
# 2. Path object — same as string path
text = guard.file_parser.parse(Path("documents/report.pdf"))
# 3. Bytes — format must be specified explicitly (no extension to detect from)
with open("report.pdf", "rb") as f:
file_bytes = f.read()
text = guard.file_parser.parse(file_bytes, format="pdf")
# 4. File-like object (BinaryIO) — format auto-detected from name attribute if available
with open("report.pdf", "rb") as f:
text = guard.file_parser.parse(f)
When passing raw bytes, the format parameter is required because there is no filename from which to detect the extension.
Size Limits
The default maximum file size is 10 MB. The SDK checks the file size before sending it to the server — oversized files raise FileParseError immediately without making a network call.
Error Handling
| Exception | When Raised |
|---|---|
FileParseError |
Base class for all file parsing errors. Also raised when the file exceeds the size limit or cannot be read. |
PasswordProtectedError |
The PDF or DOCX file is password-protected and cannot be parsed. |
UnsupportedFormatError |
The file format is not supported, or the format cannot be determined (e.g., raw bytes with no format= argument). |
PasswordProtectedError and UnsupportedFormatError both inherit from FileParseError, so except FileParseError catches all three.
from meshulash_guard import Guard, FileParseError, PasswordProtectedError, UnsupportedFormatError
from meshulash_guard.scanners import PIIScanner, PIILabel
from meshulash_guard import Action
guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
pii = PIIScanner(labels=[PIILabel.EMAIL_ADDRESS], action=Action.REPLACE)
try:
parsed_text = guard.file_parser.parse("confidential.pdf")
result = guard.scan_input(parsed_text, scanners=[pii])
print(result.status)
except PasswordProtectedError:
print("File is password-protected — cannot extract text")
except UnsupportedFormatError as e:
print(f"Unsupported file format: {e}")
except FileParseError as e:
print(f"File parsing failed: {e}")
Expected output (if file is clean):
Integration with Scanning
A complete workflow — parse a file, normalize the text, and scan it:
from meshulash_guard import Guard, Action
from meshulash_guard import InvisibleTextNormalizer, UnicodeNormalizer
from meshulash_guard import FileParseError, PasswordProtectedError, UnsupportedFormatError
from meshulash_guard.scanners import PIIScanner, PIILabel, ToxicityScanner, ToxicityLabel
guard = Guard(api_key="your-api-key", tenant_id="your-tenant-id")
pii = PIIScanner(
labels=[PIILabel.EMAIL_ADDRESS, PIILabel.PHONE_NUMBER],
action=Action.REPLACE,
)
toxicity = ToxicityScanner(
labels=[ToxicityLabel.TOXICITY],
action=Action.BLOCK,
)
try:
# Step 1: Extract text from file
parsed_text = guard.file_parser.parse("user-upload.docx")
# Step 2: Scan with normalizers to defeat encoding evasion
result = guard.scan_input(
parsed_text,
scanners=[pii, toxicity],
normalizers=[InvisibleTextNormalizer(), UnicodeNormalizer()],
)
if result.status == "blocked":
print("File content rejected — contains toxic content")
elif result.status == "secured":
print("File scanned — PII redacted")
print(result.processed_text)
else:
print("File is clean")
print(result.processed_text)
except PasswordProtectedError:
print("Cannot process password-protected files")
except UnsupportedFormatError:
print("Unsupported file format")
except FileParseError as e:
print(f"Could not parse file: {e}")
For the full list of exceptions the SDK can raise during scanning, see Concepts → Exceptions.