The Question Your Observability Vendor Won't Answer

This year marks a decade for me in observability.

I left my engineering job in 2016 to start Timber.io, a hosted logging platform, because I thought logs could be simple and great. Timber became Vector. Vector got mass adoption. It got acquired, and I stayed for three years.

And somewhere along the way, the optimism curdled.

I'm not a cynical person. I believed observability could make engineers' lives better. But after a decade, after hundreds of conversations with teams bleeding money across every major vendor, after hearing firsthand how their vendors strong-armed them instead of helping; I've seen enough. The whole industry has lost the plot.

Does any of this sound familiar?

You run observability at your company. But really, you're the cost police. You wake up to a log line in a hot path, a metric tag that exploded cardinality. You chase down the engineer. They didn't do anything wrong, they're just disconnected from what any of this costs. The renewal is always in the back of your mind because mismanaging it reflects poorly on you. Sometimes you catch these mistakes. Sometimes you don't. When you don't, you crawl to your rep asking for forgiveness. Maybe they help the first time, even the second. By the fourth or fifth, they stop. "It's your data." But even with the mistakes, if you're diligent, checking dashboards, staying on top of things, you manage to stay under your commit and avoid an early renewal. But the renewal still gives you a black eye: 40% higher than last year. Your budget didn't grow that much. So you consider switching vendors, but asking your engineers to frantically migrate dashboards, alerts, and change workflows is a distraction that also reflects poorly on you. You're in a lose-lose situation. So you go back to your vendor and ask them to help. You championed them internally; brought them six, seven figure business. Surely they'd return the favor. A slightly bigger discount, help you cut costs by showing you what data is safe to drop. But they don't budge. They could help; they don't.

Case Taintor, Director of Engineering at Klarna, put it all too well:

The most frustrating part of watching your money burn is knowing your supplier could help if they only cared about your long term success.

So why has this gone on for over a decade? Something is deeply wrong if after ten years these same problems not only exist, but have gotten worse.

But what's wrong, exactly? Should your vendor help you? It is your data. They didn't create it. You sent it to them under their pricing model. For years I accepted that framing too. Maybe this is just how it works.

Then I bumped into a question that changed my thinking.

How much of my observability data is waste?

You've asked it. Your vendor has asked it. You know the answer isn't zero. But what is it? 10%? 20%? 40%? At what point does "that's just how it works" stop being an acceptable answer?

You see, anyone who's been in this space knows that cost is far and away the biggest problem. You can take all of the other problems, bundle them together, multiply them by 100, and they still would not surpass cost. It shows up everywhere. All of the "innovation" in observability can be traced back to cost in some way. Pipelines? Cost. Fancy new storages? Cost. OpenTelemetry? Yes, cost.

So in that context, this seems like a pretty important question. Maybe the most important question in observability. Which means it must be unanswerable, right? Because if someone could answer it and let you keep paying for garbage anyway, that would be unconscionable.

Put it to the test. Ask your vendor what percentage of your data is waste. They'll play ignorant. "It's your data." They don't understand it well enough to tell you what's worth keeping. But they understand it well enough to sell you an AI SRE that can "root cause in minutes."

It's this willful ignorance that gets me. Everyone knows what's right but plays the quarterly earnings game instead. Except it's not a game for the people on the other side. I got a front row seat with Vector users. Vector wasn't deployed for fun; it was often deployed in crisis, usually around renewal time when the cost of this game came due. I watched people lose their jobs for "mismanaging" the observability budget. I saw the stress on their faces, the lost sleep.

So when I first bumped into this question while helping a Vector user, and wanted to answer it but couldn't, that's when my optimism curdled.

So I answered it

After I left Vector, the question stayed with me. I took a year off, but Vector users still found me with questions. One in particular jumped out because it was impossible not to: emails, LinkedIn messages, people in my network pinging me on their behalf. I wasn't annoyed. I knew exactly what was going on. So I agreed to help. Except this time, no roadmaps, no one telling me what to do. In exchange, they'd give me access to their data so I could try to answer the question, which I suspected was their actual problem anyway.

So I signed all the docs, got access to their Vector environment, and took a look at their Vector config. It was the mother of all configs (sorry guys, no offense). Dozens of components connected into a complex DAG. Every cost reduction trick in the book: sampling, aggregating, storage tiering, archiving, and a massive list of regexes to match and drop waste. But I wasn't appalled, I respected it. They weren't being careless, they were doing everything they possibly could.

One trick in particular intrigued me: the regex list. It was the bottleneck, but it was also something else: an expression of understanding. Every pattern represented an engineer who understood their service well enough to say "this is waste." My first instinct was to optimize it. I stumbled on Hyperscan. Turns out you can compile tens of thousands of patterns and still match at line rate. That flipped my thinking: what if I took this to the extreme and automated that understanding to produce thousands of patterns?

So I built a system to do exactly that. It compressed billions of logs into thousands of semantic events, each one evaluated with the context it needed: the service, the failure scenarios, the patterns, how it all fits together. (The deep details are outside the scope of this post, but if you're curious, here's how it works today.)

I ran it against the first service: ~40% waste. Another: ~60%. Another: ~30%. On average, ~40% waste.

I knew the number wasn't zero, but I wasn't expecting 40%. So I pressure tested it. Went through hundreds of lines manually. Checked it against their existing patterns. It checked out. With that confidence, I brought it to them.

They laughed. "We can't just drop half of our logs." Fair. But that's not what I was asking. I showed them: this wasn't anything new. It was the same analysis they were already doing, just at scale, more complete, more accurate. Most of their hand-written patterns were already represented in my set, often simpler and faster. They could tweak the analysis, roll it out slowly, push it to teams to take action in their own code.

And that's what happened. The knowledge stopped the bleeding. Over time, services cleaned up their logging. Pipelines got simpler. Bills went down. Not because anyone dropped data recklessly, but because they finally knew what was worth keeping.

Why observability feels broken

The answer to this question isn't just a number. It's the answer to why observability feels broken despite it being more expensive than ever. Think about it.

On the surface: you're paying twice what you should. Cut the waste, cut the bill. Simple.

Go deeper: the cost policing, the weekly dashboard checks, the monthly exercises, the begging your rep for forgiveness when someone's log blows up the bill, the pipelines. All of that exists because you're managing garbage. Half the complexity you've built is dedicated to noise.

Go deeper still: your engineers complain that observability doesn't help them debug faster despite costing millions. Of course it doesn't. They're drowning in noise and calling it data. The alerts fire on garbage. The dashboards are cluttered with garbage. The AI can't find the signal because there's too much garbage in the way.

And underneath all of it: this number shouldn't exist if your vendor was aligned with you.

Take a look around the market. $65M bills. $170M bills. Entire roles for cost control. "Observe without limits." "Stop sampling." "More data, more insight." Dozens of products. It's all backwards. The goal isn't more data, more products, or more complexity.

The goal is understanding with less.

And how do you prove understanding? The question. Either you understand the data well enough to answer it or you don't.

There's a future where you're not the cost cop. Where observability just works. Where your vendor's success depends on yours.

That's the future we're building at Tero.

Get your number.