Claude generates better test cases than I write manually — here's the prompt
I used to write tests for the happy path and a few obvious error cases. Claude writes tests I would never think of — negative numbers that overflow, Unicode in unexpected places, concurrent calls to the same function, empty arrays where I assumed at least one item. The quality jump comes from the prompt, not just from using AI. Here is what I figured out.
The naive approach (and why it produces mediocre tests)
Asking "write tests for this function" produces tests that mirror your implementation. Claude sees that you handle the positive case and writes a test for the positive case. It sees your null check and writes a test for null. This is marginally better than nothing but misses the creative edge cases that find real bugs.
The structured test generation prompt
import anthropic
from pathlib import Path
client = anthropic.Anthropic()
TEST_SYSTEM = """You are a QA engineer known for finding edge cases.
When generating tests, think through:
1. Happy path — standard inputs, expected output
2. Boundary values — empty, zero, min/max for numbers, single char strings
3. Type coercion — what if a string is passed where a number is expected
4. Null/undefined/None handling — each optional parameter separately
5. Concurrent access — what if two callers hit this simultaneously
6. Large inputs — very long strings, arrays with thousands of items
7. Special characters — Unicode, newlines, SQL injection strings, HTML
8. State mutations — does the function modify its inputs?
9. Error propagation — what exceptions can this throw and are they caught
10. Integration boundary — what if the external dependency (DB, API) fails
Generate tests that would find bugs that real users would actually hit.
Use the project's existing test framework and patterns from the examples provided."""
def generate_tests(
source_code: str,
existing_test_example: str,
function_name: str,
framework: str = "pytest",
) -> str:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=3000,
system=TEST_SYSTEM,
messages=[{
"role": "user",
"content": f"""Generate comprehensive tests for the function `{function_name}`.
Source code:
```python
{source_code}
```
Existing test example (follow this style):
```python
{existing_test_example}
```
Framework: {framework}
Generate at minimum:
- 3 happy path tests with different valid inputs
- 5+ edge case tests
- Error case tests for each expected exception
- At least one test that would catch a common off-by-one or type error"""
}]
)
return response.content[0].text
Example: generating tests for a price calculator
source = """
def calculate_discount(
price: float,
discount_percent: float,
min_price: float = 0.0,
) -> float:
if not 0 <= discount_percent <= 100:
raise ValueError(f"Discount must be 0-100, got {discount_percent}")
discounted = price * (1 - discount_percent / 100)
return max(discounted, min_price)
"""
existing_test = """
def test_no_discount():
assert calculate_discount(100.0, 0) == 100.0
"""
tests = generate_tests(source, existing_test, "calculate_discount")
print(tests)
Claude generated 14 tests including ones I would not have written:
import pytest
from decimal import Decimal
def test_no_discount():
assert calculate_discount(100.0, 0) == 100.0
def test_full_discount():
assert calculate_discount(100.0, 100) == 0.0
def test_partial_discount():
assert calculate_discount(200.0, 25) == 150.0
def test_min_price_enforced():
# Even with 100% discount, min_price is respected
assert calculate_discount(100.0, 100, min_price=10.0) == 10.0
def test_min_price_not_applied_when_unnecessary():
# min_price should not artificially raise the price
assert calculate_discount(100.0, 10, min_price=5.0) == 90.0
def test_floating_point_precision():
# Classic floating point: 19.99 * 0.9 = 17.991000000000003
result = calculate_discount(19.99, 10)
assert abs(result - 17.991) < 0.001
def test_very_small_price():
result = calculate_discount(0.01, 50)
assert result == 0.005
def test_large_price():
result = calculate_discount(999_999.99, 50)
assert result == 499_999.995
def test_discount_exactly_100_is_valid():
assert calculate_discount(50.0, 100) == 0.0
def test_discount_exactly_0_is_valid():
assert calculate_discount(50.0, 0) == 50.0
def test_discount_above_100_raises():
with pytest.raises(ValueError, match="100"):
calculate_discount(100.0, 100.001)
def test_discount_below_0_raises():
with pytest.raises(ValueError):
calculate_discount(100.0, -1)
def test_zero_price_with_discount():
assert calculate_discount(0.0, 50) == 0.0
def test_price_type_coercion_with_int():
# Ensure int inputs work (Python allows int where float expected)
assert calculate_discount(100, 10) == 90.0
The TypeScript version
# Shell alias for TypeScript projects
test-gen() {
local file="$1"
local func="$2"
local framework="${3:-vitest}"
cat "$file" | claude --print --system "You are a QA engineer. Generate comprehensive tests including edge cases, boundary values, null handling, and error cases. Use $framework." "Generate tests for function: $func. Include tests I would never think to write."
}
Making it part of your workflow
The pattern I use: write the function, run the test generator, skim the output, delete the tests that are not adding value, and add the surprising ones. Takes about 5 minutes and consistently surfaces 2-3 tests I would not have written. Those 2-3 tests have caught real bugs in production code on several occasions.
The key insight: Claude is not replacing your judgment about which tests matter — it is expanding your imagination about what to test. You still decide what to keep.