PDF Reader Skill
Extract and process content from PDF documents.
Basic Text Extraction
python3 -c "
import subprocess
result = subprocess.run(['pdftotext', '{file}', '-'], capture_output=True, text=True)
print(result.stdout[:5000])
"
Page-by-Page Extraction
python3 -c "
import subprocess
result = subprocess.run(['pdftotext', '-f', '{first_page}', '-l', '{last_page}', '{file}', '-'], capture_output=True, text=True)
print(result.stdout)
"
PDF Metadata
python3 -c "
import subprocess
result = subprocess.run(['pdfinfo', '{file}'], capture_output=True, text=True)
print(result.stdout)
"
Table Extraction
python3 -c "
import subprocess
result = subprocess.run(['pdftotext', '-layout', '{file}', '-'], capture_output=True, text=True)
print(result.stdout[:5000])
"
Guidelines
- Try
pdftotextfirst — it's fast and handles most PDFs well - Use
-layoutflag to preserve table formatting - For scanned PDFs, note that OCR is needed (not supported by pdftotext)
- Extract specific page ranges for large documents
- Summarize extracted content rather than dumping entire documents