nixos/shared/linked-dotfiles/opencode/llmemory/docs/TESTING.md
2025-10-29 18:46:16 -06:00

14 KiB
Raw Blame History

LLMemory Testing Guide

Testing Philosophy: Integration-First TDD

This project uses integration-first TDD - we write integration tests that verify real workflows, not unit tests that verify implementation details.

Core Principles

1. Integration Tests Are Primary

Why:

  • Tests real behavior users/agents will experience
  • Less brittle (survives refactoring)
  • Higher confidence in system working correctly
  • Catches integration issues early

Example:

// GOOD: Integration test
test('store and search workflow', () => {
  // Test the actual workflow
  storeMemory(db, { content: 'Docker uses bridge networks', tags: 'docker' });
  const results = searchMemories(db, 'docker');
  expect(results[0].content).toContain('Docker');
});

// AVOID: Over-testing implementation details
test('parseContent returns trimmed string', () => {
  expect(parseContent('  test  ')).toBe('test');
});
// ^ This is probably already tested by integration tests

2. Unit Tests Are Rare

Only write unit tests for:

  • Complex algorithms (Levenshtein distance, trigram extraction)
  • Pure functions with many edge cases
  • Critical validation logic

Don't write unit tests for:

  • Database queries (test via integration)
  • CLI argument parsing (test via integration)
  • Simple utilities (tag parsing, date formatting)
  • Anything already covered by integration tests

Rule of thumb: Think twice before writing a unit test. Ask: "Is this already tested by my integration tests?"

3. Test With Realistic Data

Use real SQLite databases:

beforeEach(() => {
  db = new Database(':memory:');  // Fast, isolated
  initSchema(db);
  
  // Seed with realistic data
  seedDatabase(db, 50);  // 50 realistic memories
});

Generate realistic test data:

// test/helpers/seed.js
export function generateRealisticMemory() {
  const templates = [
    { content: 'Docker Compose requires explicit subnet config', tags: ['docker', 'networking'] },
    { content: 'PostgreSQL VACUUM FULL locks tables', tags: ['postgresql', 'performance'] },
    { content: 'Git worktree allows parallel branches', tags: ['git', 'workflow'] },
    // 50+ realistic templates
  ];
  return randomChoice(templates);
}

Why: Tests should reflect real usage, not artificial toy data.

4. Watch-Driven Development

Workflow:

# Terminal 1: Watch mode (always running)
npm run test:watch

# Terminal 2: Manual testing
node src/cli.js store "test memory"

Steps:

  1. Write integration test (red/failing)
  2. Watch test fail
  3. Implement feature
  4. Watch test pass (green)
  5. Verify manually with CLI
  6. Refine based on output

TDD Workflow Example

Example: Implementing Store Command

Step 1: Write Test First

// test/integration.test.js
describe('Store Command', () => {
  let db;
  
  beforeEach(() => {
    db = new Database(':memory:');
    initSchema(db);
  });
  
  test('stores memory with tags', () => {
    const result = storeMemory(db, {
      content: 'Docker uses bridge networks',
      tags: 'docker,networking'
    });
    
    expect(result.id).toBeDefined();
    
    // Verify in database
    const memory = db.prepare('SELECT * FROM memories WHERE id = ?').get(result.id);
    expect(memory.content).toBe('Docker uses bridge networks');
    
    // Verify tags linked correctly
    const tags = db.prepare(`
      SELECT t.name FROM tags t
      JOIN memory_tags mt ON t.id = mt.tag_id
      WHERE mt.memory_id = ?
    `).all(result.id);
    
    expect(tags.map(t => t.name)).toEqual(['docker', 'networking']);
  });
  
  test('rejects content over 10KB', () => {
    expect(() => {
      storeMemory(db, { content: 'x'.repeat(10001) });
    }).toThrow('Content exceeds 10KB limit');
  });
});

Step 2: Run Test (Watch It Fail)

$ npm run test:watch

FAIL test/integration.test.js
  Store Command
    ✕ stores memory with tags (2 ms)
    ✕ rejects content over 10KB (1 ms)

  ● Store Command  stores memory with tags
    ReferenceError: storeMemory is not defined

Step 3: Implement Feature

// src/commands/store.js
export function storeMemory(db, { content, tags, expires, entered_by }) {
  // Validate content
  if (content.length > 10000) {
    throw new Error('Content exceeds 10KB limit');
  }
  
  // Insert memory
  const result = db.prepare(`
    INSERT INTO memories (content, entered_by, expires_at)
    VALUES (?, ?, ?)
  `).run(content, entered_by, expires);
  
  // Handle tags
  if (tags) {
    const tagList = tags.split(',').map(t => t.trim().toLowerCase());
    linkTags(db, result.lastInsertRowid, tagList);
  }
  
  return { id: result.lastInsertRowid };
}

Step 4: Watch Test Pass

PASS test/integration.test.js
  Store Command
    ✓ stores memory with tags (15 ms)
    ✓ rejects content over 10KB (3 ms)

Tests: 2 passed, 2 total

Step 5: Verify Manually

$ node src/cli.js store "Docker uses bridge networks" --tags docker,networking
Memory #1 stored successfully

$ node src/cli.js search "docker"
[2025-10-29 12:45] docker, networking
Docker uses bridge networks

Step 6: Refine

// Add more test cases based on manual testing
test('normalizes tags to lowercase', () => {
  storeMemory(db, { content: 'test', tags: 'Docker,NETWORKING' });
  
  const tags = db.prepare('SELECT name FROM tags').all();
  expect(tags).toEqual([
    { name: 'docker' },
    { name: 'networking' }
  ]);
});

Test Organization

Directory Structure

test/
├── integration.test.js       # PRIMARY - All main workflows
├── unit/
│   ├── fuzzy.test.js        # RARE - Only complex algorithms
│   └── levenshtein.test.js  # RARE - Only complex algorithms
├── helpers/
│   ├── seed.js              # Realistic data generation
│   └── db.js                # Database setup helpers
└── fixtures/
    └── realistic-memories.js # Memory templates

Integration Test Structure

// test/integration.test.js
import { describe, test, expect, beforeEach, afterEach } from 'vitest';
import Database from 'better-sqlite3';
import { storeMemory, searchMemories } from '../src/commands/index.js';
import { initSchema } from '../src/db/schema.js';
import { seedDatabase } from './helpers/seed.js';

describe('Memory System Integration', () => {
  let db;
  
  beforeEach(() => {
    db = new Database(':memory:');
    initSchema(db);
  });
  
  afterEach(() => {
    db.close();
  });
  
  describe('Store and Retrieve', () => {
    test('stores and finds memory', () => {
      storeMemory(db, { content: 'test', tags: 'demo' });
      const results = searchMemories(db, 'test');
      expect(results).toHaveLength(1);
    });
  });
  
  describe('Search with Filters', () => {
    beforeEach(() => {
      seedDatabase(db, 50); // Realistic data
    });
    
    test('filters by tags', () => {
      const results = searchMemories(db, 'docker', { tags: ['networking'] });
      results.forEach(r => {
        expect(r.tags).toContain('networking');
      });
    });
  });
  
  describe('Performance', () => {
    test('searches 100 memories in <50ms', () => {
      seedDatabase(db, 100);
      
      const start = Date.now();
      searchMemories(db, 'test');
      const duration = Date.now() - start;
      
      expect(duration).toBeLessThan(50);
    });
  });
});

Unit Test Structure (Rare)

Only for complex algorithms:

// test/unit/levenshtein.test.js
import { describe, test, expect } from 'vitest';
import { levenshtein } from '../../src/search/fuzzy.js';

describe('Levenshtein Distance', () => {
  test('calculates edit distance correctly', () => {
    expect(levenshtein('docker', 'dcoker')).toBe(2);
    expect(levenshtein('kubernetes', 'kuberntes')).toBe(2);
    expect(levenshtein('same', 'same')).toBe(0);
  });
  
  test('handles edge cases', () => {
    expect(levenshtein('', 'hello')).toBe(5);
    expect(levenshtein('a', '')).toBe(1);
    expect(levenshtein('', '')).toBe(0);
  });
  
  test('handles unicode correctly', () => {
    expect(levenshtein('café', 'cafe')).toBe(1);
  });
});

Test Data Helpers

Realistic Memory Generation

// test/helpers/seed.js
const REALISTIC_MEMORIES = [
  { content: 'Docker Compose uses bridge networks by default. Custom networks require explicit subnet config.', tags: ['docker', 'networking'] },
  { content: 'PostgreSQL VACUUM FULL locks tables and requires 2x disk space. Use VACUUM ANALYZE for production.', tags: ['postgresql', 'performance'] },
  { content: 'Git worktree allows working on multiple branches without stashing. Use: git worktree add ../branch branch-name', tags: ['git', 'workflow'] },
  { content: 'NixOS flake.lock must be committed to git for reproducible builds across machines', tags: ['nixos', 'build-system'] },
  { content: 'TypeScript 5.0+ const type parameters preserve literal types: function id<const T>(x: T): T', tags: ['typescript', 'types'] },
  // ... 50+ more realistic examples
];

export function generateRealisticMemory() {
  return { ...randomChoice(REALISTIC_MEMORIES) };
}

export function seedDatabase(db, count = 50) {
  const insert = db.prepare(`
    INSERT INTO memories (content, entered_by, created_at)
    VALUES (?, ?, ?)
  `);
  
  const insertMany = db.transaction((memories) => {
    for (const memory of memories) {
      const result = insert.run(
        memory.content,
        randomChoice(['investigate-agent', 'optimize-agent', 'manual']),
        Date.now() - randomInt(0, 90 * 86400000) // Random within 90 days
      );
      
      // Link tags
      if (memory.tags) {
        linkTags(db, result.lastInsertRowid, memory.tags);
      }
    }
  });
  
  const memories = Array.from({ length: count }, () => generateRealisticMemory());
  insertMany(memories);
}

function randomChoice(arr) {
  return arr[Math.floor(Math.random() * arr.length)];
}

function randomInt(min, max) {
  return Math.floor(Math.random() * (max - min + 1)) + min;
}

Running Tests

# Watch mode (primary workflow)
npm run test:watch

# Run once
npm test

# With coverage
npm run test:coverage

# Specific test file
npm test integration.test.js

# Run in CI (no watch)
npm test -- --run

Coverage Guidelines

Target: >80% coverage, but favor integration over unit

What to measure:

  • Are all major workflows tested? (store, search, list, prune)
  • Are edge cases covered? (empty data, expired memories, invalid input)
  • Are performance targets met? (<50ms search for Phase 1)

What NOT to obsess over:

  • 100% line coverage (diminishing returns)
  • Testing every internal function (if covered by integration tests)
  • Testing framework code (CLI parsing, DB driver)

Check coverage:

npm run test:coverage

# View HTML report
open coverage/index.html

Examples of Good vs Bad Tests

Good: Integration Test

test('full workflow: store, search, list, prune', () => {
  // Store memories
  storeMemory(db, { content: 'Memory 1', tags: 'test' });
  storeMemory(db, { content: 'Memory 2', tags: 'test', expires_at: Date.now() - 1000 });
  
  // Search finds active memory
  const results = searchMemories(db, 'Memory');
  expect(results).toHaveLength(2); // Both found initially
  
  // List shows both
  const all = listMemories(db);
  expect(all).toHaveLength(2);
  
  // Prune removes expired
  const pruned = pruneMemories(db);
  expect(pruned.count).toBe(1);
  
  // Search now finds only active
  const afterPrune = searchMemories(db, 'Memory');
  expect(afterPrune).toHaveLength(1);
});

Bad: Over-Testing Implementation

// AVOID: Testing internal implementation details
test('parseTagString splits on comma', () => {
  expect(parseTagString('a,b,c')).toEqual(['a', 'b', 'c']);
});

test('normalizeTag converts to lowercase', () => {
  expect(normalizeTag('Docker')).toBe('docker');
});

// These are implementation details already covered by integration tests!

Good: Unit Test (Justified)

// Complex algorithm worth isolated testing
test('levenshtein distance edge cases', () => {
  // Empty strings
  expect(levenshtein('', '')).toBe(0);
  expect(levenshtein('abc', '')).toBe(3);
  
  // Unicode
  expect(levenshtein('café', 'cafe')).toBe(1);
  
  // Long strings
  const long1 = 'a'.repeat(1000);
  const long2 = 'a'.repeat(999) + 'b';
  expect(levenshtein(long1, long2)).toBe(1);
});

Debugging Failed Tests

1. Use .only to Focus

test.only('this specific test', () => {
  // Only runs this test
});

2. Inspect Database State

test('debug search', () => {
  storeMemory(db, { content: 'test' });
  
  // Inspect what's in DB
  const all = db.prepare('SELECT * FROM memories').all();
  console.log('Database contents:', all);
  
  const results = searchMemories(db, 'test');
  console.log('Search results:', results);
  
  expect(results).toHaveLength(1);
});

3. Use Temp File for Manual Inspection

test('debug with file', () => {
  const db = new Database('/tmp/debug.db');
  initSchema(db);
  
  storeMemory(db, { content: 'test' });
  
  // Now inspect with: sqlite3 /tmp/debug.db
});

Summary

DO:

  • Write integration tests for all workflows
  • Use realistic data (50-100 memories)
  • Test with :memory: database
  • Run in watch mode (npm run test:watch)
  • Verify manually with CLI after tests pass
  • Think twice before writing unit tests

DON'T:

  • Test implementation details
  • Write unit tests for simple functions
  • Use toy data (1-2 memories)
  • Mock database or CLI (test the real thing)
  • Aim for 100% coverage at expense of test quality

Remember: Integration tests that verify real workflows are worth more than 100 unit tests that verify implementation details.


Testing Philosophy: Integration-first TDD with realistic data
Coverage Target: >80% (mostly integration tests)
Unit Tests: Rare, only for complex algorithms
Workflow: Write test (fail) → Implement (pass) → Verify (manual) → Refine