Testing

Risk Level: 🟢 Essential

Guide Entry

TESTING (n.): The practice of verifying that code works. With AI-generated code, testing is your primary defense against confident wrongness. The AI will never tell you something is broken. Tests will.

The Testing Paradox

AI can generate tests. But:

AI-generated tests test AI-generated assumptions
If both are wrong in the same way, you won’t know

Solution: Generate tests, but verify they test what matters.

Testing AI Code Workflow

Step 1: Specify First

Before generating code:

"I need a function that:
- Takes a list of numbers
- Returns the average
- Returns 0 for empty lists
- Ignores non-numeric values"

Step 2: Generate Tests First

"Write tests for that function before implementing it.
Use pytest.
Include edge cases."

Step 3: Review Tests

Do the tests match your specification?

Test for empty list?
Test for non-numeric values?
Test for normal case?

Do the tests make sense?

# Good test
def test_average_ignores_strings():
    assert average([1, "two", 3]) == 2.0

# Suspicious test - why would this be expected?
def test_average_returns_none_for_empty():
    assert average([]) is None
# Wait, spec said return 0, not None...

Step 4: Generate Implementation

"Now implement the function to pass these tests."

Step 5: Run Tests

pytest

If tests fail, you learned something. Fix and repeat.

What to Test

The Happy Path

def test_average_normal_case():
    assert average([1, 2, 3, 4, 5]) == 3.0

Edge Cases

def test_average_empty_list():
    assert average([]) == 0

def test_average_single_element():
    assert average([42]) == 42

Error Cases

def test_average_with_invalid_input():
    assert average([1, "two", 3]) == 2.0

Boundaries

def test_average_very_large_numbers():
    assert average([10**100, 10**100]) == 10**100

Testing AI-Specific Concerns

Hallucination Testing

def test_uses_real_library():
    # If this import fails, AI hallucinated the library
    from actual_library import actual_function

Scope Testing

def test_function_only_does_what_asked():
    result = process_data(input_data)
    # Verify it didn't add extra fields
    assert set(result.keys()) == {"expected", "fields", "only"}

Integration Testing

def test_works_with_existing_code():
    # Test that AI code integrates with your codebase
    existing_result = existing_function()
    ai_result = ai_generated_function(existing_result)
    assert ai_result is not None

AI Test Generation Prompts

Good Prompt

"Generate pytest tests for this function.
Include:
- 3 happy path tests
- 3 edge case tests
- 2 error case tests

Follow this format:
def test_descriptive_name():
    '''What this tests.'''
    # Arrange
    input = ...
    # Act
    result = function(input)
    # Assert
    assert result == expected"

Review the Output

AI might generate:

Tests that always pass (useless)
Tests that test implementation, not behavior
Tests missing critical edge cases

You catch these by reading the tests.

The Testing Safety Net

Level 0: No tests
         AI says it works, you hope it works
         🔴 Dangerous

Level 1: AI-generated tests, unreviewed
         Better than nothing
         🟡 Risky

Level 2: AI-generated tests, reviewed
         You verified they test the right things
         🟢 Good

Level 3: Spec-first tests, then implementation
         Tests define correctness, code follows
         🟢 Better

Level 4: Test + review + CI/CD
         Automated verification on every commit
         🟢 Best

The Street Rule

“The AI cannot test its own correctness. Tests are how you test the AI.”

Move to Make

For your next AI-generated function:

Write the spec
Generate tests from spec
Review tests manually
Generate implementation
Run tests
Note what the tests caught

Build the muscle memory of test-first AI development.