server

Arxivkb

Verified

by camopel

๐Ÿ  **100% local** โ€” crawls arXiv's free API, embeds with Ollama (nomic-embed-text), indexes in FAISS + SQLite. No cloud cost. ๐Ÿ” **Semantic search on paper content** โ€” FAISS indexes PDF chunks (not just abstracts), so you find papers by what they contain. ๐Ÿ“‚ **arXiv category-based** โ€” tracks official arXiv categories (155 available, 8 groups). No free-text queries. ๐Ÿงน **Auto-cleanup** โ€” configurable expiry deletes old papers, PDFs, and chunks. ```bash python3 scripts/install.py ``` Works on **macOS

View on GitHub

ArXivKB โ€” Science Knowledge Base

Why This Skill?

๐Ÿ  100% local โ€” crawls arXiv's free API, embeds with Ollama (nomic-embed-text), indexes in FAISS + SQLite. No cloud cost.

๐Ÿ” Semantic search on paper content โ€” FAISS indexes PDF chunks (not just abstracts), so you find papers by what they contain.

๐Ÿ“‚ arXiv category-based โ€” tracks official arXiv categories (155 available, 8 groups). No free-text queries.

๐Ÿงน Auto-cleanup โ€” configurable expiry deletes old papers, PDFs, and chunks.

Install

python3 scripts/install.py

Works on macOS and Linux. Installs Python deps (faiss-cpu, pdfplumber, tiktoken, arxiv, numpy), pulls nomic-embed-text via Ollama, creates data directories and DB.

Prerequisites

  • Ollama โ€” must be installed and running (ollama serve)
  • Python 3.10+

Quick Start

# 1. Add arXiv categories to track
akb categories add cs.AI cs.CV cs.LG

# 2. Browse all available categories
akb categories browse

# 3. Ingest recent papers (last 7 days)
akb ingest

# 4. Check stats
akb stats

Categories

akb categories list                    # Show enabled categories
akb categories browse                  # Browse all 155 arXiv categories
akb categories browse robotics         # Filter by keyword
akb categories add cs.AI cs.RO         # Enable categories
akb categories delete cs.AI            # Disable a category

Categories are official arXiv codes (e.g. cs.AI, eess.IV, q-fin.ST). The full taxonomy is built in.

Ingestion

akb ingest                    # Crawl, download PDFs, chunk, embed
akb ingest --days 14          # Look back 14 days
akb ingest --dry-run          # Preview only
akb ingest --no-pdf           # Index abstracts only (faster)

Pipeline: arXiv API โ†’ PDF download โ†’ text extraction (pdfplumber) โ†’ chunking (tiktoken, 500 tokens, 50 overlap) โ†’ embedding (Ollama nomic-embed-text) โ†’ FAISS + SQLite.

Paper Details

akb paper 2401.12345    # Show title, abstract, categories, PDF status

Statistics

akb stats   # Papers, chunks, categories, DB size

Expiry & Cleanup

akb expire               # Delete papers older than 90 days (default)
akb expire --days 30     # Override: delete papers older than 30 days
akb expire --days 30 -y  # Skip confirmation

Configuration

No config file needed. Defaults:

| Setting | Default | Override |

|---------|---------|----------|

| Data directory | ~/workspace/arxivkb | ARXIVKB_DATA_DIR env or --data-dir |

| Ollama endpoint | http://localhost:11434 | โ€” (hardcoded) |

| Embedding model | nomic-embed-text (768d) | โ€” (hardcoded) |

| Chunk size | 500 tokens, 50 overlap | โ€” |

| Expiry | 90 days | --days flag |

Data Layout

~/workspace/arxivkb/
โ”œโ”€โ”€ arxivkb.db           # SQLite: papers, chunks, translations, categories
โ”œโ”€โ”€ pdfs/                  # Downloaded PDF files ({arxiv_id}.pdf)
โ””โ”€โ”€ faiss/
    โ””โ”€โ”€ arxivkb.faiss    # FAISS IndexFlatIP (chunk embeddings)

DB Schema

  • papers: id, arxiv_id, title, abstract, categories, published, status, created_at
  • chunks: id, paper_id, section, chunk_index, text, faiss_id, created_at
  • translations: paper_id, language, abstract, created_at (PK: paper_id+language)
  • categories: code, description, group_name, enabled, added_at (155 entries)

๐Ÿ’ฌ Chat Commands (OpenClaw Agent)

When this skill is installed, the agent recognizes /akb as a shortcut:

| Command | Action |

|---------|--------|

| /akb list | Show enabled categories |

| /akb add cs.AI cs.RO | Enable categories for crawling |

| /akb remove cs.AI | Disable a category |

| /akb browse | Browse all 155 arXiv categories |

| /akb browse robotics | Filter categories by keyword |

| /akb stats | Show paper/chunk/category counts |

| /akb help | Show available commands |

The agent runs these via the akb CLI internally.

๐Ÿ“ฑ PrivateApp Dashboard

A companion PWA dashboard is available. Provides:

  • Semantic search across paper content
  • Paper detail with abstract translation (on-demand via LLM)
  • Inline PDF viewing
  • Category browser
  • Stats (papers, chunks, categories)

Architecture

scripts/
โ”œโ”€โ”€ cli.py             # CLI โ€” categories, ingest, paper, stats, expire
โ”œโ”€โ”€ db.py              # SQLite schema + CRUD
โ”œโ”€โ”€ arxiv_crawler.py   # arXiv API search + PDF download
โ”œโ”€โ”€ arxiv_taxonomy.py  # Full arXiv category taxonomy (155 categories)
โ”œโ”€โ”€ pdf_processor.py   # PDF text extraction + tiktoken chunking
โ”œโ”€โ”€ embed.py           # Ollama nomic-embed-text (768d, normalized)
โ”œโ”€โ”€ faiss_index.py     # FAISS IndexFlatIP manager
โ”œโ”€โ”€ search.py          # Semantic search: query โ†’ FAISS โ†’ group by paper
โ””โ”€โ”€ install.py         # One-command installer