Pricing Experiments: The Right Box at the Right Price

Greenbox has just over two hundred subscribers. The Business Model Canvas workshop was earlier in the week, and something Lee said is still rattling around in Maya’s head: “You’ve validated almost everything on that canvas. The one box you haven’t tested is the revenue model.”

Lee is at Maya’s kitchen table with a coffee and the canvas printed on A3. He’s drawn a circle around the Revenue Streams box and written three words next to it: “untested, but working.”

“The prices are working,” Maya says. “People are paying them. Nobody’s complained.”

“Nobody who signed up has complained. You don’t know anything about the people who looked at the site, saw $25, and closed the tab. You don’t know whether you could charge $30 and still get the same conversion. You don’t know whether the small box is mispriced relative to the large box. You’ve got one data point – the current price – and you’ve decided it’s right because people are buying.”

Maya frowns. She’d been congratulating herself, a little, that the pricing felt settled. It was one less thing to worry about.

“So what are you suggesting?”

“Pricing experiments. Not once. As a habit. You should be testing your pricing the way you test your features.”

Why pricing feels different

Pricing is one of the few decisions in a startup that feels genuinely scary to get wrong. A bad feature can be rolled back. A bad copy tweak can be un-tweaked. A bad price, once published, anchors every subscriber’s expectation of what the product costs. Raise it and people feel cheated. Lower it and the people who paid the old price feel stupid.

This is why most founders pick a number that feels right, ship it, and never touch it again. It’s not laziness. It’s loss-aversion. The downside of changing pricing feels enormous, and the upside feels uncertain.

Lee has seen this pattern many times. His view is that it’s the wrong way to think about it.

“The risk isn’t changing your price. The risk is being wrong about your price and not knowing. If you’re charging $25 for a small box that people would happily pay $30 for, you’re not being nice. You’re giving away five dollars per subscriber per week. At two hundred and ten subscribers, that’s $1,050 per week you’re leaving on the table. That’s a farm partner you could pay. That’s Sam’s hours. That’s runway.”

Maya does the maths in her head. The number is uncomfortably large.

What to test

The team sits down on a Wednesday afternoon to work out what they actually want to learn. Lee uses the same approach they’ve used for product discovery: write down the assumptions and figure out which ones are worth testing.

The assumptions on the wall, in Maya’s handwriting:

The small box at $25 and the large box at $45 are both priced correctly relative to cost.
The $20 gap between small and large reflects the actual value difference to subscribers.
Subscribers would not pay more than $25 for the small box.
A third, larger box option would not expand the market.
Weekly delivery is the only frequency that makes sense.
Free delivery is a deal-breaker if removed.

Lee reads down the list and then sets the marker down. “Which of these would you be most embarrassed to find out you’d been wrong about for six months?”

Maya doesn’t hesitate. “Three. If people would happily pay $30 for the small box, I’d feel sick.”

“Good. Let’s test three first.”

Designing the experiment

This is where it gets interesting. You can’t just change the price on the website and see what happens, because if the new price is wrong you’ve damaged the brand. You also can’t ask people “would you pay $30?” in a survey, because people lie about pricing in surveys – not deliberately, but because stated preference and revealed preference are different things.

Lee proposes a simple framework:

Test the price before the commitment. New visitors who haven’t yet chosen a box see a landing page with the price. Half see $25. Half see $30. Everybody who clicks through can choose to continue at the price they saw. Measure two things: the click-through rate (do fewer people continue at $30?) and the conversion rate (of the people who continue, how many actually subscribe?).

Be honest about what you’re doing. If a subscriber asks why they saw $30 and their friend saw $25, don’t pretend it was a glitch. Explain that you were testing pricing and wanted to understand what the right number was. Offer them whichever price they would have preferred.

Lee pauses there. “That’s the design. Now the part I can’t do for you. I’ve watched plenty of these experiments and I know what they cost when they’re set up wrong – but the actual sample-size maths, how big and how long and how confident, isn’t my world. I can tell you a pricing test is worth running. I can’t tell you whether your traffic will let you read the result. You need that worked out before you agree on a duration.”

He looks across the table. Priya is already on it.

“Give me a minute. I want to be sure we can actually read a result before we agree to run this.”

She pulls her notepad towards her.

“The thing we’re worried about with $30 is people seeing the higher number and not subscribing. So the question I have to size up is: how many visitors per arm before we’d reliably notice a drop in conversion at $30, if there’s one to notice? The maths is symmetric – the sample size comes out the same whether we frame the change we’re looking for as a fall or a lift – but the fall is what would actually hurt us, so that’s the version I’ll plug in.”

She writes a few lines.

“Before I pick a number for ‘how big a drop’, let me work out what would actually count as bad for us. At $25 with seven percent conversion we make $1.75 a visitor. We’re testing $30 – so the question becomes: how far would conversion have to fall at $30 before the price hike is a loss instead of a win? Revenue per visitor matches when $30 × p equals $1.75, so p = $1.75 ÷ $30 – five point eight three percent. Anything below that and $30 is making us less money than $25, not more. So the drop we’d care about catching is conversion falling from seven percent to about five point eight – roughly a seventeen percent relative drop. That’s what I’ll size for.

“Two arms, $25 and $30. Conversion seven percent today. Target drop pinned at break-even – seventeen percent relative. Standard knobs on confidence and power. The back-of-envelope number is roughly seven thousand visitors per arm. We’re getting maybe a hundred and fifty visitors a week. That’s nearly a year per arm.”

Tom holds up a hand. “Nearly a year? Hold on – before you tell me the rest, can you walk us through where that number actually comes from? I hear ‘seven thousand per arm’ and I’ve got no idea what’s in the calculation.”

Priya nods. “Worth doing. Five things go into it, and they’re all things we have to commit to before we run the experiment, not after.”

She turns to a fresh page.

“First, the baseline. The conversion rate we’re comparing against. For us that’s seven percent – the fraction of visitors today who actually subscribe at $25.

“Second, the shift we want to be able to detect. We have to commit up front to a size of change that matters to us. I picked the break-even drop – seven percent falling to about five point eight, a roughly seventeen percent relative shift. If we’d only cared about catching a much bigger collapse – twenty-five percent, conversion all the way down to five point three – we’d need fewer visitors because the gap is bigger and easier to see through the noise. If we wanted to spot a subtler ten percent drop, seven sliding to six and a bit, we’d need a lot more. Small differences look a lot like noise, and the maths punishes us for asking it to see through them.

“Third, the null hypothesis. The ‘nothing’s actually happening here’ assumption – the default we’d hold to until the data pushes us off it. In our case: ‘$25 and $30 convert identically.’ The whole game is asking whether the gap we observe between the two arms could plausibly have come from chance, even if the prices really do convert identically. If yes, we shrug. If no, we’ve learned something.

“Fourth, alpha. How often we’re willing to be fooled by chance into thinking there’s an effect when there isn’t. Standard number is five percent – one experiment in twenty where we cry ‘effect!’ when actually nothing’s there.

“Fifth, power. The flip side. If there really is an effect, how likely are we to catch it? Standard number is eighty percent. So even with a perfectly designed experiment, one time in five we’d miss a real difference and write it off as noise.”

Tom nods slowly. “So alpha and power are us deciding how cautious we want to be in each direction.”

“Exactly. Once you’ve fixed all four – baseline, target shift, alpha, power – there’s a formula that turns them into a sample size. Two pieces of vocabulary inside that formula are worth pinning down before I show the working. The first is standard deviation. Think of it as the typical size of the wobble. Flip a fair coin a hundred times and you expect fifty heads, but you don’t get exactly fifty every time – sometimes forty-six, sometimes fifty-three. Standard deviation puts a number on that everyday wobble: the noise you get from a process even when nothing interesting is going on.

“The second is the z-score. That’s our signal measured in wobble-units: how many standard deviations away from boring-and-noisy a result sits. The bigger the z, the harder it gets to explain the result away as noise. Alpha and power feed into the formula as z-scores – five percent alpha and eighty percent power are roughly z = 1.96 and z = 0.84, numbers worth recognising on sight.”

She turns the notepad landscape to give herself more room and starts writing as she talks.

“Here’s the formula, for two proportions like ours. Sample size per arm:

Three pieces. The cautiousness term – (z₁ + z₂)² in blue – is alpha and power expressed as z-scores, added then squared. The noise term – the amber piece – has each arm contributing its own wobble, p(1−p), summed across the two arms. The signal term – (p₁ − p₂)² in green underneath – is the gap we want to detect, squared. Big signal, small sample needed. Small signal, the sample balloons – and the squaring on the gap is what makes it balloon.

“Plug ours in. Cautiousness: (1.96 + 0.84)² is about 7.84. Noise: 0.07 × 0.93 is about 0.065 in the $25 arm; 0.0583 × 0.9417 is about 0.055 in the $30 arm; sum is about 0.12. Signal: (0.07 − 0.0583)² is 0.0117², which is about 0.000137. Multiply cautiousness by noise: 7.84 × 0.12 is about 0.94. Divide by signal: 0.94 ÷ 0.000137 is just under seven thousand. Per arm. That’s where the number comes from.”

She underlines the result. “About seven thousand visitors per arm to spot the break-even drop. We get a hundred and fifty a week. Forty-six weeks per arm – nearly a year. If we’d settle for spotting a much bigger collapse – twenty-five percent or so – we could call it in about four months. For a subtler ten percent drop, we’re back over two years per arm. For a five percent drop, well inside what a small price tweak might plausibly move, the better part of a decade.”

Tom sits back. “So what does two weeks of data buy us?”

“Worth working out. Let me run the formula backwards. Two weeks gives us roughly three hundred visitors per arm. If I fix the cautiousness and noise where they were and solve for the gap instead – what’s the smallest difference we’d reliably catch at n = 300? – the maths gives me about six percentage points. So a real conversion rate would have to drop from seven percent to about one percent before our two-week test would reliably notice. Anything subtler than that, we miss most of the time.”

She thinks for a second.

“More usefully: take the gap we actually care about – the break-even drop, seven down to five point eight – and ask how often we’d correctly call it. At three hundred per arm the power drops from eighty percent to about eight percent. So even if $30 genuinely tips us over to break-even, our two-week test would flag it only about one time in twelve. The other eleven times we’d shrug and write it off as noise.

“And the flip side is just as ugly. Even if $25 and $30 convert identically, we’ll see a gap that looks significant about one time in twenty. That’s the alpha we picked, and alpha doesn’t get kinder when the sample’s small. So a ‘significant’ result at this scale could be a real effect we got lucky enough to spot – or it could be pure chance. We can’t tell which from the data alone.”

She caps the pen.

“At our volume the test is statistically blind. Whatever number comes out, we can’t separate signal from noise. What we can do is read the direction the gap leans – and decide in advance whether that’s enough to act on.”

Maya looks at Lee. “So the experiment can’t really prove anything in any sane window.”

“Not at your volume. Which means the choice you have isn’t ‘run it until it’s statistically valid’. It’s ‘run it long enough to see the shape of the signal, then act on the direction’. You’ll be moving from one defensible price to another defensible price with a lean, not a proof. That’s the only kind of pricing decision you can make at this stage.”

Maya thinks about it. “If we’re wrong by five percent in either direction, we can adjust. We won’t be wrong by fifty percent.”

“That’s the right framing. And it tells you what the time limit is for.”

Set a time limit. Two weeks, then stop – not because two weeks will give you certainty (it won’t), but because the time box is what limits your exposure. A pricing experiment is a temporary act of price discrimination, and the longer it runs the more it corrodes trust. The time limit is containment, not measurement.

Know what you’ll do with each outcome. Before starting, write down what decision you’ll make if revenue per visitor goes up, goes down, or barely moves. “The data is inconclusive” needs to be one of the outcomes you’ve planned for – because at this volume it’s the most likely one. Decide in advance whether a directional lean is enough to act on. If it isn’t, don’t run the experiment yet; save it for when you’ve got the traffic.

Priya adds one more thing. “And we write a note in the wiki – ‘small box price, revisit when weekly visitors exceed five thousand’. Future us deserves to know we acted on a lean, not on a proof.”

Tom has a concern. “What if conversion drops by ten percent at $30? Does that mean $30 is the wrong price?”

Lee thinks about it. “Not necessarily. If conversion drops ten percent but revenue per converted subscriber goes up twenty percent, you’re still ahead. The question isn’t ‘does conversion drop’ – it’s ‘does total revenue go up or down.’ And you have to weight that against the long-term effects on word-of-mouth, retention, and brand.”

Running the experiment

They run it for two weeks.

The setup is deliberately simple. Priya writes a small piece of code that assigns each new visitor to one of two groups at random and shows them the appropriate landing page. The price on the page is the price they’d pay if they subscribed. Nothing else about the site changes.

Over two weeks, 312 visitors see the $25 page. 298 visitors see the $30 page. The split is even enough to compare – and, as the team already knows going in, well short of what statistical confidence would require.

The results:

Small box pricing experiment: two weeks

Variant	Visitors	Clicked through	Subscribed	Revenue/week
$25 (control)	312	184 (59%)	22 (7.1%)	$550
$30 (test)	298	164 (55%)	19 (6.4%)	$570

Click-through drops from 59% to 55%. Conversion drops from 7.1% to 6.4%. But revenue per visitor goes up, because the people who subscribe at $30 are paying more than the people who subscribe at $25.

The total revenue from the $30 cohort is slightly higher than the $25 cohort, despite slightly fewer subscribers.

Reading the numbers

The team gathers round Priya’s laptop.

“This is roughly the shape we expected,” Priya says. “Twenty-two subscribers versus nineteen, out of about three hundred visitors each. The gap is well inside the noise – if we’d run the experiment in a different fortnight, those numbers could easily have flipped. The data isn’t conclusive. We knew going in it wouldn’t be.”

“Then what is it telling us?” Maya asks.

“Direction. Click-through is slightly lower at $30. Conversion is slightly lower. Revenue per visitor is slightly higher. That’s the shape you’d expect if $30 is closer to the right price than $25, and roughly the opposite of what you’d see if $30 were too high. It’s not proof. It’s a lean.”

Lee picks it up. “And the team agreed before we ran it that a lean is what we’d act on. The alternative was waiting years for a confidence we don’t actually need to make a five-dollar decision.”

The decision

Maya looks at the numbers again. “So we raise the price.”

“On a directional signal,” Lee says. “Not because the data proved anything, but because we said we’d act on direction and the direction is up. If it had pointed the other way we’d be having a much shorter meeting. The five percent more revenue per visitor isn’t huge – but the people who said yes at $30 are telling you they value the box at $30. The people who said no were probably never going to be great subscribers anyway. They would have subscribed for a month and cancelled.”

“Before you change anything – how do you feel about the subscribers who paid $25?”

Maya thinks. “I don’t want to raise the price on them. They signed up at $25 and that was the deal.”

“Good instinct. Honour the original price for existing subscribers indefinitely. New subscribers sign up at $30. Your existing subscribers feel looked after. Your new subscribers feel fairly treated. The only people who lose are the ones who would have subscribed at $25 but won’t at $30 – and the experiment tells you that’s a small group, and a group that probably wouldn’t have stuck around.”

It’s a clean decision, but it’s only clean because they measured first – and only honest because they were clear, before measuring, about what kind of evidence they’d accept.

What gets tested next

Maya and Lee work through the rest of the list. The $20 gap between small and large is the obvious next target – assumption 2 from the wall. After that, the mixed-sourcing pilot that’s been on the whiteboard for weeks. Each one a separate test. Each one starting the same way: write down the question, design the test, decide what you’d do with each outcome before you run it, measure, decide.

Pricing experiments. Not once. As a habit.

The team doesn’t know it yet, but the question on the wall is about to change. The discipline they’ve just learned will get its first real test on a deadline they didn’t pick. That’s a story for another week.

Lee writes a single sentence at the top of the pricing page in the team wiki:

“Price is an assumption until you’ve tested it. Test the assumptions you’d be most embarrassed to be wrong about first.”

Maya reads it, nods, and goes back to the kitchen to think about what to test next.