ArXivKB โ Science Knowledge Base
Why This Skill?
๐ 100% local โ crawls arXiv's free API, embeds with Ollama (nomic-embed-text), indexes in FAISS + SQLite. No cloud cost.
๐ Semantic search on paper content โ FAISS indexes PDF chunks (not just abstracts), so you find papers by what they contain.
๐ arXiv category-based โ tracks official arXiv categories (155 available, 8 groups). No free-text queries.
๐งน Auto-cleanup โ configurable expiry deletes old papers, PDFs, and chunks.
Install
python3 scripts/install.py
Works on macOS and Linux. Installs Python deps (faiss-cpu, pdfplumber, tiktoken, arxiv, numpy), pulls nomic-embed-text via Ollama, creates data directories and DB.
Prerequisites
- Ollama โ must be installed and running (
ollama serve) - Python 3.10+
Quick Start
# 1. Add arXiv categories to track
akb categories add cs.AI cs.CV cs.LG
# 2. Browse all available categories
akb categories browse
# 3. Ingest recent papers (last 7 days)
akb ingest
# 4. Check stats
akb stats
Categories
akb categories list # Show enabled categories
akb categories browse # Browse all 155 arXiv categories
akb categories browse robotics # Filter by keyword
akb categories add cs.AI cs.RO # Enable categories
akb categories delete cs.AI # Disable a category
Categories are official arXiv codes (e.g. cs.AI, eess.IV, q-fin.ST). The full taxonomy is built in.
Ingestion
akb ingest # Crawl, download PDFs, chunk, embed
akb ingest --days 14 # Look back 14 days
akb ingest --dry-run # Preview only
akb ingest --no-pdf # Index abstracts only (faster)
Pipeline: arXiv API โ PDF download โ text extraction (pdfplumber) โ chunking (tiktoken, 500 tokens, 50 overlap) โ embedding (Ollama nomic-embed-text) โ FAISS + SQLite.
Paper Details
akb paper 2401.12345 # Show title, abstract, categories, PDF status
Statistics
akb stats # Papers, chunks, categories, DB size
Expiry & Cleanup
akb expire # Delete papers older than 90 days (default)
akb expire --days 30 # Override: delete papers older than 30 days
akb expire --days 30 -y # Skip confirmation
Configuration
No config file needed. Defaults:
| Setting | Default | Override |
|---------|---------|----------|
| Data directory | ~/workspace/arxivkb | ARXIVKB_DATA_DIR env or --data-dir |
| Ollama endpoint | http://localhost:11434 | โ (hardcoded) |
| Embedding model | nomic-embed-text (768d) | โ (hardcoded) |
| Chunk size | 500 tokens, 50 overlap | โ |
| Expiry | 90 days | --days flag |
Data Layout
~/workspace/arxivkb/
โโโ arxivkb.db # SQLite: papers, chunks, translations, categories
โโโ pdfs/ # Downloaded PDF files ({arxiv_id}.pdf)
โโโ faiss/
โโโ arxivkb.faiss # FAISS IndexFlatIP (chunk embeddings)
DB Schema
- papers: id, arxiv_id, title, abstract, categories, published, status, created_at
- chunks: id, paper_id, section, chunk_index, text, faiss_id, created_at
- translations: paper_id, language, abstract, created_at (PK: paper_id+language)
- categories: code, description, group_name, enabled, added_at (155 entries)
๐ฌ Chat Commands (OpenClaw Agent)
When this skill is installed, the agent recognizes /akb as a shortcut:
| Command | Action |
|---------|--------|
| /akb list | Show enabled categories |
| /akb add cs.AI cs.RO | Enable categories for crawling |
| /akb remove cs.AI | Disable a category |
| /akb browse | Browse all 155 arXiv categories |
| /akb browse robotics | Filter categories by keyword |
| /akb stats | Show paper/chunk/category counts |
| /akb help | Show available commands |
The agent runs these via the akb CLI internally.
๐ฑ PrivateApp Dashboard
A companion PWA dashboard is available. Provides:
- Semantic search across paper content
- Paper detail with abstract translation (on-demand via LLM)
- Inline PDF viewing
- Category browser
- Stats (papers, chunks, categories)
Architecture
scripts/
โโโ cli.py # CLI โ categories, ingest, paper, stats, expire
โโโ db.py # SQLite schema + CRUD
โโโ arxiv_crawler.py # arXiv API search + PDF download
โโโ arxiv_taxonomy.py # Full arXiv category taxonomy (155 categories)
โโโ pdf_processor.py # PDF text extraction + tiktoken chunking
โโโ embed.py # Ollama nomic-embed-text (768d, normalized)
โโโ faiss_index.py # FAISS IndexFlatIP manager
โโโ search.py # Semantic search: query โ FAISS โ group by paper
โโโ install.py # One-command installer