Skip to content

Phase 2B: Caption Embeddings

Extracts Sentence-BERT (MiniLM) text embeddings from memory captions.

Run

source venv/bin/activate
pip install sentence-transformers>=2.2.0
python pipeline/extract_caption_embeddings.py

Output

File Description
data/processed/text_embeddings.npy (N, 384) Sentence-BERT embeddings
data/processed/caption_embedding_index.json Record ID to embedding index mapping

Preprocessing Rules

  • Unicode normalization (NFC)
  • Strip leading/trailing whitespace
  • Preserve punctuation and casing

Model Details

  • Model: all-MiniLM-L6-v2
  • Embedding Dimension: 384
  • Normalization: L2-normalized vectors