Pharma Wiki — Knowledge Graph Skill

Build a persistent, compounding pharmaceutical knowledge base with structured entity/relation extraction, normalization, and interactive graph visualization.

Architecture

Three layers under a project root:

raw/ — Immutable source documents (markdown)
wiki/ — Human-readable compiled articles + index + log
graph/ — Structured knowledge graph (JSON):
- entities.json — All entities with aliases, types, descriptions
- relations.json — All binary relations (triples)
- viz.html — Auto-generated interactive D3.js visualization

Entity Types (12 + Other)

Drug, Target, Disease, Gene, Protein, Pathway, AdverseEvent, ClinicalTrial, Biomarker, Dosage, Contraindication, Mechanism

Relation Types (15)

inhibits, activates, treats, causes, associated_with, metabolized_by, contraindicated_with, indicated_for, resistant_to, sensitive_to, biomarker_for, combined_with, precedes, interacts_with, participates_in

Operations

1. Init (scaffold)

python3 scripts/scaffold.py /path/to/project

Creates raw/, wiki/, graph/ with initial files. Safe to re-run (won't overwrite).

2. Ingest (core pipeline)

Fetch → Extract → Normalize → Merge → Update graph.

# From stdin JSON
echo '{"input_text":"..."}' | python3 scripts/ingest.py /path/to/project

# From file
python3 scripts/ingest.py /path/to/project --file doc.md --topic drug-discovery

# From URL
python3 scripts/ingest.py /path/to/project --url https://... --topic clinical

# Mock mode (no LLM API key needed)
echo '{"input_text":"..."}' | python3 scripts/ingest.py /path/to/project --mock

Pipeline steps:

Save raw document to raw/<topic>/YYYY-MM-DD-slug.md
LLM extracts entities + relations (structured JSON with aliases)
Normalize: check aliases → merge with existing entities
Merge new entities/relations into graph/entities.json and graph/relations.json
Update wiki/log.md

Entity normalization works by:

Exact name match (case-insensitive)
Alias match across all entities
Built-in known alias map (Gleevec → Imatinib, CML → Chronic myeloid leukemia, etc.)

3. Query

python3 scripts/query.py /path/to/project "imatinib 的靶点有哪些？"
python3 scripts/query.py /path/to/project "从 ABL1 到 CML 有什么路径？" --mock

Query types detected automatically:

Entity lookup → all relations for mentioned entities
Path query → BFS between two entities
Compare → find shared relations
General → LLM answers using graph as context

4. Visualize

python3 scripts/visualize.py /path/to/project
python3 scripts/visualize.py /path/to/project --output /tmp/my-graph.html

Generates graph/viz.html with:

D3.js force-directed layout
Nodes colored by entity type (Drug=red, Target=blue, Disease=green, etc.)
Drag to reposition, scroll to zoom
Hover → highlight neighborhood + tooltip
Click → detail panel with all relations
Search bar for filtering
Legend click to toggle entity types

5. Lint

python3 scripts/lint.py /path/to/project

Checks:

Entity reference integrity (relation endpoints exist)
Alias conflicts (same alias on different entities)
Orphan entities (no relations)
Index consistency
Duplicate relations
Missing descriptions

Data Format

graph/entities.json

{
  "E_imatinib": {
    "id": "E_imatinib",
    "name": "Imatinib",
    "aliases": ["Gleevec", "STI-571", "格列卫", "Glivec"],
    "type": "Drug",
    "description": "Tyrosine kinase inhibitor targeting BCR-ABL, KIT, PDGFRB",
    "sources": ["raw/general/2026-01-15-imatinib-review.md"],
    "first_seen": "2026-01-15",
    "last_updated": "2026-01-15"
  }
}

graph/relations.json

{
  "R_001": {
    "id": "R_001",
    "source": "E_imatinib",
    "target": "E_bcr_abl",
    "relation": "inhibits",
    "description": "Imatinib inhibits BCR-ABL tyrosine kinase activity",
    "sources": ["raw/general/2026-01-15-imatinib-review.md"]
  }
}

Agent Workflow

When the user asks to add content to the knowledge base:

Get source content (URL, file, or pasted text)
Run ingest: echo '{"input_text":"..."}' | python3 scripts/ingest.py <project> [--mock]
Run visualize to update the graph view
Report: how many new entities, merged entities, new relations

When the user asks a question:

Run query: python3 scripts/query.py <project> "<question>" [--mock]
Present answer with citations from the graph

When the user asks for health check:

Run lint: python3 scripts/lint.py <project>
Report issues and suggest fixes

Dependencies

Zero pip dependencies — all scripts use Python stdlib only
LLM calls use urllib.request (no openai library needed)
D3.js loaded from CDN in visualization (no install needed)
Requires OPENAI_API_KEY for real extraction; use --mock for testing

File Reference

references/extraction-prompt.md — LLM extraction prompts (system + user + few-shot)
references/entity-types.md — Entity/relation type definitions and colors
references/raw-template.md — Raw document template
references/article-template.md — Wiki article template
references/index-template.md — Index template
scripts/scaffold.py — Initialize project structure
scripts/ingest.py — Core pipeline: extract → normalize → merge
scripts/query.py — Graph traversal + LLM Q&A
scripts/visualize.py — D3.js interactive visualization generator
scripts/lint.py — Health check