Claude Code Skills · 论文 · 文献综述

arxiv-database

基于 arXiv Atom API 的预印本检索工具。支持按关键词、作者、分类、日期范围与 arXiv ID 查询，结果以 JSON 格式返回标题、摘要、作者、分类及链接。适用于搭建文献综述数据集、跟踪子领域新投稿，或批量下载 PDF 做全文分析。

Search and retrieve preprints from arXiv via the Atom API. Use this skill when searching for papers in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering, or economics by keywords, authors, arXiv IDs, date ranges, or categories.

Repo: Chanw-research/claude-code-paper-writing
skills.sh: /Chanw-research/claude-code-paper-writing/arxiv-database
Slug: arxiv-database

skills.sh 详情页 GitHub 源目录

SKILL.md

arXiv Database

Overview

This skill provides Python tools for searching and retrieving preprints from arXiv.org via its public Atom API. It supports keyword search, author search, category filtering, arXiv ID lookup, and PDF download. Results are returned as structured JSON with titles, abstracts, authors, categories, and links.

When to Use This Skill

Use this skill when:

Searching for preprints in CS, ML, AI, physics, math, statistics, q-bio, q-fin, or economics
Looking up specific papers by arXiv ID (e.g., 2309.10668)
Tracking an author's recent preprints
Filtering papers by arXiv category (e.g., cs.LG, cs.CL, stat.ML)
Downloading PDFs for full-text analysis
Building literature review datasets for AI/ML research
Monitoring new submissions in a subfield

Consider alternatives when:

Searching for biomedical literature specifically -> Use pubmed-database or biorxiv-database
You need citation counts or impact metrics -> Use openalex-database
You need peer-reviewed journal articles only -> Use pubmed-database

Core Search Capabilities

1. Keyword Search

Search for papers by keywords in titles, abstracts, or all fields.

python scripts/arxiv_search.py \
  --keywords "sparse autoencoders" "mechanistic interpretability" \
  --max-results 20 \
  --output results.json

With category filter:

python scripts/arxiv_search.py \
  --keywords "transformer" "attention mechanism" \
  --category cs.LG \
  --max-results 50 \
  --output transformer_papers.json

Search specific fields:

# Title only
python scripts/arxiv_search.py \
  --keywords "GRPO" \
  --search-field ti \
  --max-results 10

# Abstract only
python scripts/arxiv_search.py \
  --keywords "reward model" "RLHF" \
  --search-field abs \
  --max-results 30

2. Author Search

python scripts/arxiv_search.py \
  --author "Anthropic" \
  --max-results 50 \
  --output anthropic_papers.json

python scripts/arxiv_search.py \
  --author "Ilya Sutskever" \
  --category cs.LG \
  --max-results 20

3. arXiv ID Lookup

Retrieve metadata for specific papers:

python scripts/arxiv_search.py \
  --ids 2309.10668 2406.04093 2310.01405 \
  --output sae_papers.json

Full arXiv URLs also accepted:

python scripts/arxiv_search.py \
  --ids "https://arxiv.org/abs/2309.10668"

4. Category Browsing

List recent papers in a category:

python scripts/arxiv_search.py \
  --category cs.AI \
  --max-results 100 \
  --sort-by submittedDate \
  --output recent_cs_ai.json

5. PDF Download

python scripts/arxiv_search.py \
  --ids 2309.10668 \
  --download-pdf papers/

Batch download from search results:

import json
from scripts.arxiv_search import ArxivSearcher

searcher = ArxivSearcher()

# Search first
results = searcher.search(query="ti:sparse autoencoder", max_results=5)

# Download all
for paper in results:
    arxiv_id = paper["arxiv_id"]
    searcher.download_pdf(arxiv_id, f"papers/{arxiv_id.replace('/', '_')}.pdf")

arXiv Categories

Computer Science (cs.*)

Category	Description
`cs.AI`	Artificial Intelligence
`cs.CL`	Computation and Language (NLP)
`cs.CV`	Computer Vision
`cs.LG`	Machine Learning
`cs.NE`	Neural and Evolutionary Computing
`cs.RO`	Robotics
`cs.CR`	Cryptography and Security
`cs.DS`	Data Structures and Algorithms
`cs.IR`	Information Retrieval
`cs.SE`	Software Engineering

Statistics & Math

Category	Description
`stat.ML`	Machine Learning (Statistics)
`stat.ME`	Methodology
`math.OC`	Optimization and Control
`math.ST`	Statistics Theory

Other Relevant Categories

Category	Description
`q-bio.BM`	Biomolecules
`q-bio.GN`	Genomics
`q-bio.QM`	Quantitative Methods
`q-fin.ST`	Statistical Finance
`eess.SP`	Signal Processing
`physics.comp-ph`	Computational Physics

Full list: see references/api_reference.md.

Query Syntax

The arXiv API uses prefix-based field searches combined with Boolean operators.

Field prefixes:

ti: - Title
au: - Author
abs: - Abstract
cat: - Category
all: - All fields (default)
co: - Comment
jr: - Journal reference
id: - arXiv ID

Boolean operators (must be UPPERCASE):

ti:transformer AND abs:attention
au:bengio OR au:lecun
cat:cs.LG ANDNOT cat:cs.CV

Grouping with parentheses:

(ti:sparse AND ti:autoencoder) AND cat:cs.LG
au:anthropic AND (abs:interpretability OR abs:alignment)

Examples:

from scripts.arxiv_search import ArxivSearcher

searcher = ArxivSearcher()

# Papers about SAEs in ML
results = searcher.search(
    query="ti:sparse autoencoder AND cat:cs.LG",
    max_results=50,
    sort_by="submittedDate"
)

# Specific author in specific field
results = searcher.search(
    query="au:neel nanda AND cat:cs.LG",
    max_results=20
)

# Complex boolean query
results = searcher.search(
    query="(abs:RLHF OR abs:reinforcement learning from human feedback) AND cat:cs.CL",
    max_results=100
)

Output Format

All searches return structured JSON:

{
  "query": "ti:sparse autoencoder AND cat:cs.LG",
  "result_count": 15,
  "results": [
    {
      "arxiv_id": "2309.10668",
      "title": "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning",
      "authors": ["Trenton Bricken", "Adly Templeton", "..."],
      "abstract": "Full abstract text...",
      "categories": ["cs.LG", "cs.AI"],
      "primary_category": "cs.LG",
      "published": "2023-09-19T17:58:00Z",
      "updated": "2023-10-04T14:22:00Z",
      "doi": "10.48550/arXiv.2309.10668",
      "pdf_url": "http://arxiv.org/pdf/2309.10668v1",
      "abs_url": "http://arxiv.org/abs/2309.10668v1",
      "comment": "42 pages, 30 figures",
      "journal_ref": ""
    }
  ]
}

Common Usage Patterns

Literature Review Workflow

from scripts.arxiv_search import ArxivSearcher
import json

searcher = ArxivSearcher()

# 1. Broad search
results = searcher.search(
    query="abs:mechanistic interpretability AND cat:cs.LG",
    max_results=200,
    sort_by="submittedDate"
)

# 2. Save results
with open("interp_papers.json", "w") as f:
    json.dump({"result_count": len(results), "results": results}, f, indent=2)

# 3. Filter and analyze
import pandas as pd
df = pd.DataFrame(results)
print(f"Total papers: {len(df)}")
print(f"Date range: {df['published'].min()} to {df['published'].max()}")
print(f"\nTop categories:")
print(df["primary_category"].value_counts().head(10))

Track a Research Group

searcher = ArxivSearcher()

groups = {
    "anthropic": "au:anthropic AND (cat:cs.LG OR cat:cs.CL)",
    "openai": "au:openai AND cat:cs.CL",
    "deepmind": "au:deepmind AND cat:cs.LG",
}

for name, query in groups.items():
    results = searcher.search(query=query, max_results=50, sort_by="submittedDate")
    print(f"{name}: {len(results)} recent papers")

Monitor New Submissions

searcher = ArxivSearcher()

# Most recent ML papers
results = searcher.search(
    query="cat:cs.LG",
    max_results=50,
    sort_by="submittedDate",
    sort_order="descending"
)

for paper in results[:10]:
    print(f"[{paper['published'][:10]}] {paper['title']}")
    print(f"  {paper['abs_url']}\n")

Python API

from scripts.arxiv_search import ArxivSearcher

searcher = ArxivSearcher(verbose=True)

# Free-form query (uses arXiv query syntax)
results = searcher.search(query="...", max_results=50)

# Lookup by ID
papers = searcher.get_by_ids(["2309.10668", "2406.04093"])

# Download PDF
searcher.download_pdf("2309.10668", "paper.pdf")

# Build query from components
query = ArxivSearcher.build_query(
    title="sparse autoencoder",
    author="anthropic",

Back to 论文 · 文献综述