search

Arxiv Summarizer Orchestrator

Verified

by xukp20

Run the full pipeline by composing three sub-skills. 1. `arxiv-search-collector` 2. `arxiv-paper-processor` 3. `arxiv-batch-reporter` - `language`: manual language parameter used by all stages. Default is English when omitted. - `paper_processing_mode`: `subagent_parallel` or `serial`. - `max_parallel_papers`: default `5` when `paper_processing_mode=subagent_parallel`. 1. Initialize one run with `arxiv-search-collector/scripts/init_collection_run.py`. 2. Model generates multiple focused queries

View on GitHub

# ArXiv Summarizer Orchestrator

Run the full pipeline by composing three sub-skills.

## Sub-skill Order

1. arxiv-search-collector

2. arxiv-paper-processor

3. arxiv-batch-reporter

## Workflow Parameters

  • language: manual language parameter used by all stages. Default is English when omitted.
  • paper_processing_mode: subagent_parallel or serial.
  • max_parallel_papers: default 5 when paper_processing_mode=subagent_parallel.

## Workflow

### Stage A: Collection Setup + Query Retrieval

1. Initialize one run with arxiv-search-collector/scripts/init_collection_run.py.

2. Model generates multiple focused queries from original topic and writes a minimal query_plan.json (label + query only).

3. Run arxiv-search-collector/scripts/fetch_queries_batch.py with the plan file (recommended).

4. (Optional fallback) call arxiv-search-collector/scripts/fetch_query_metadata.py manually for one-by-one fetch.

5. Model reads each indexed query list and decides keep indexes.

6. Merge selected items with arxiv-search-collector/scripts/merge_selected_papers.py.

7. If relevance/coverage is still not good, iterate Stage A:

- generate another query plan with new labels,

- fetch again,

- re-merge with --incremental and updated selection-json.

- set weak labels to empty keep list ([]) to explicitly drop them.

Pass --language <LANG> to collector scripts so all generated markdown files in Stage A follow the selected language.

Use serial query fetch in Stage A with conservative controls (for example --min-interval-sec 5, --retry-max 4).

Default collector settings already include retries/backoff and run-local throttle state (<run_dir>/.runtime/arxiv_api_state.json), so manual tuning is usually unnecessary.

Prefer cache reuse (no --force) unless query parameters changed or data refresh is required.

Output: one run directory with per-paper metadata subdirectories.

### Stage B: Per-paper Artifact Download + Manual Summary

For each paper directory, invoke sub-skill arxiv-paper-processor once and let that skill produce <paper_dir>/summary.md.

Recommended pre-step for many papers:

1. Run one batch artifact download before per-paper reading:

python3 arxiv-paper-processor/scripts/download_papers_batch.py \
  --run-dir /path/to/run \
  --artifact source_then_pdf \
  --max-workers 3 \
  --min-interval-sec 5 \
  --language <LANG>

Per-paper execution steps (inside arxiv-paper-processor):

1. If <paper_dir>/summary.md already exists and is complete, skip this paper.

2. If usable source (source/source_extract/*.tex) or PDF (source/paper.pdf) already exists, skip download.

3. If artifacts are missing, download source with arxiv-paper-processor/scripts/download_arxiv_source.py.

4. If source is unusable, download PDF with arxiv-paper-processor/scripts/download_arxiv_pdf.py.

5. Model reads content and manually writes <paper_dir>/summary.md by reference format, in language.

Parallel strategy for many papers:

  • Default: paper_processing_mode=subagent_parallel with max_parallel_papers=5.
  • Optional: paper_processing_mode=serial to process one paper at a time.
  • In parallel mode, run multiple arxiv-paper-processor instances in batches; concurrent papers must not exceed max_parallel_papers.
  • Wait for one batch to finish before starting the next batch.
  • In serial mode, run exactly one arxiv-paper-processor instance at a time.
  • Subagent workers should only own one paper directory each to avoid file conflicts.
  • Do not use scripts to auto-compose summary text; scripts are download-only tools.

Output: all paper directories contain summary.md.

### Stage C: Bundle + Final Hierarchical Report

1. Run arxiv-batch-reporter/scripts/collect_summaries_bundle.py --language <LANG>.

2. Model reads summaries_bundle.md and writes collection_report_template.md in base dir.

3. In template, each paper leaf entry must include one standalone placeholder line: {{ARXIV_BRIEF:<arxiv_id>}}.

4. Run arxiv-batch-reporter/scripts/render_collection_report.py to generate final collection_report.md.

5. Do not manually paraphrase per-paper conclusion lines in final report; they must come from per-paper summary.md section 10 via script injection.

If language is non-English (for example Chinese), all intermediate markdown files and final reports should follow that language.

## Periodic Scheduling

This orchestrator is suitable for cron/scheduled execution in OpenClaw:

  • Frequency examples: daily, weekly, monthly.
  • For rolling windows, use lookback (1d, 7d, 30d) when initializing runs.

## Output Layout

<output-root>/<topic>-<timestamp>-<range>/

  • task_meta.json, task_meta.md
  • query_results/, query_selection/
  • <arxiv_id>/metadata.md + downloaded source/pdf + summary.md
  • summaries_bundle.md
  • collection_report_template.md
  • final rendered collection report (e.g. collection_report.md)

Use references/workflow-checklist.md as execution checklist.

## Related Skills

This is the top-level orchestration skill.

Before using it, install and enable these three sub-skills:

  • arxiv-search-collector
  • arxiv-paper-processor
  • arxiv-batch-reporter

Execution order inside this orchestrator:

1. arxiv-search-collector (Stage A)

2. arxiv-paper-processor (Stage B)

3. arxiv-batch-reporter (Stage C)