# ArXiv Paper Processor

Use this skill for per-paper manual summarization, with optional batch artifact download.

Single-paper mode: process one paper directory (e.g. <run_dir>/<arxiv_id>/).
Batch predownload mode: process many paper directories under one run dir before writing summaries.

## Language Parameter

Use a workflow language parameter (for example English or Chinese) and apply it manually.
The per-paper summary.md must be written in the selected language.
If download scripts are called directly, pass --language <LANG> for traceability.

## Core Principle

Scripts only fetch artifacts. The model performs reading and writing.

## Non-negotiable Constraint

Do not generate summary.md by script-based snippet extraction, regex harvesting, or template autofill.
Do not use Python/shell scripts to auto-compose section text from abstract/introduction fragments.
Scripts in this skill are only for artifact download (source/pdf) and trace logs.
The final summary.md must come from model-side reading and synthesis of the paper content.

## Optional Batch Artifact Download (Many Papers)

Use this first when Stage B has many papers:

python3 scripts/download_papers_batch.py \
  --run-dir /path/to/run \
  --artifact source_then_pdf \
  --max-workers 3 \
  --min-interval-sec 5 \
  --language English

Key behavior:

Supports --artifact source, --artifact pdf, or --artifact source_then_pdf (default).
Supports concurrency (--max-workers) and safe throttling/retry (--min-interval-sec, retry args).
Uses run-local throttle state by default (<run_dir>/.runtime/arxiv_download_state.json) to reduce 429 risk.
Skips papers that already have usable source/source_extract/*.tex or existing source/paper.pdf (unless --force).
Resume-friendly: if a paper already has a completed summary.md, you can skip that paper's summary-writing step.
Writes batch log to <run_dir>/download_batch_log.json by default.

## Step 1: Download Source (Preferred)

python3 scripts/download_arxiv_source.py \
  --paper-dir /path/to/run/2602.00528 \
  --language English

This writes:

source/source_bundle.bin
source/source_extract/
source/download_source_log.json

If usable source already exists and --force is not set, the script reuses local artifacts.

## Step 2: If Needed, Download PDF

python3 scripts/download_arxiv_pdf.py \
  --paper-dir /path/to/run/2602.00528 \
  --language English

This writes:

source/paper.pdf
source/download_pdf_log.json

If PDF already exists and --force is not set, the script reuses local artifacts.

## Step 3: Model Reads and Summarizes

1. If summary.md already exists and follows the required format, skip this paper and mark it complete.

2. Read metadata.md first.

3. If source/source_extract/ already exists with readable .tex files, use it directly.

4. Otherwise, if source/paper.pdf already exists, use PDF directly.

5. If neither exists, run download scripts (single-paper scripts or batch script) first.

6. Manually write summary.md in the same paper directory, in the selected language.

Do not rely on rule-based auto summarization.

Do not rely on auto-extracted snippets as the primary writing basis.

## Quality Requirement

Every section should include paper-specific details that are traceable to full-text reading.
Section 4/5/10 should reflect concrete method and evaluation details, not generic wording.
If key details are unclear in the source, explicitly note uncertainty instead of guessing.
Match the detail level shown in references/summary-example-en.md and references/summary-example-zh.md.
If your draft is clearly shorter or less specific than the examples, expand it before finishing.

## Required Output

<paper_dir>/summary.md in fixed section format.
Pay special attention to section ## 10. Brief Conclusion: write a 3-4 sentence mini-conclusion that covers contribution, method, evaluation setup, and results with paper-specific details.
In section ## 1. Paper Snapshot, use exact keys: ArXiv ID, Title, Authors, Publish date, Primary category, Reading basis.
Do not use key variants such as Reading source, Author list, Published on, or lowercase key names.

See references/summary-format.md for exact section requirements.

## Related Skills

This skill is a sub-skill of arxiv-summarizer-orchestrator.

Pipeline position:

1. Step 1 (upstream): arxiv-search-collector produces the selected paper directories and metadata.

2. Step 2 (this skill): arxiv-paper-processor downloads artifacts and writes one summary.md per paper.

3. Step 3 (downstream): arxiv-batch-reporter uses these per-paper summaries to generate the final collection report.

Use this skill together with Step 1 and Step 3 for full end-to-end execution.

More Research Skills