Signs are already clear that the free era has ended. Two years ago, we lived in a beautiful illusion where compute power felt like running water from a tap that could flow endlessly. Now? Every token has a price, and that price has skyrocketed.



What’s interesting is how it all started. When API costs were still very cheap, everyone could use it recklessly. We threw thousands of words into prompts without thinking. Asked the most advanced models to do silly tasks like capitalizing the first letter of sentences. Why? Because it was super cheap, subsidized by giant investors. But that subsidy has now ended.

This change isn’t just about higher prices on the dashboard. It’s about a fundamental shift in how we need to think about AI infrastructure. Token consumption, which was once ignored, is now a critical item in any cost center. One API call can be worth thousands of rupiah if the volume is high. Imagine startups handling millions of requests per day—this is no longer an optional concern, it’s a survival issue.

There are three places where we truly lose tokens without realizing it. First, overly long system prompts. We like to write super detailed instructions for output stability, but each instruction is a paid token. Every conversation must recalculate these thousands of tokens. Second, out-of-control RAG. The ideal vision of RAG is to take the three most relevant sentences and ask the model. Reality? Database pulls ten long PDFs with thousands of words and dumps them into the model. We think it’s just asking simple questions, but the model is being asked to read half a library. Third, agents stuck in infinite loops. If the logic is poor and the API is down, the agent can keep spinning, with each iteration consuming expensive output tokens.

Now comes the interesting part—how do we get out of this hole? There are three weapons that are now essential, no longer optional. Semantic cache could be a game changer because user questions tend to be repetitive. If a user asks “how to reset password” multiple times, we can cache the answer and return it directly without hitting the large model. From seconds to milliseconds, and zero token cost. Prompt compression using entropy-based algorithms can squeeze 1000 tokens into 300 tokens without losing meaning. Let machines communicate with machines using strange languages humans don’t understand. The model’s attention mechanism is strong enough to comprehend this. We save 70% of costs.

But the most sophisticated is model routing. Don’t send every task to the most expensive model. Simple entity extraction? Route to Llama 3 8B or Claude Haiku, which are very cheap. Complex reasoning and coding? Use GPT-4o or Claude Sonnet. It’s like an efficient company—receptionists don’t need to bother the CEO with simple matters. Whoever can implement this routing mechanism smoothly can cut token costs by up to a third compared to competitors.

If we look at leading agent frameworks like OpenClaw and Hermes, they’re already ahead of the curve. OpenClaw is obsessive about token control. Instead of full context stacking, it forces the model to output strict JSON schemas or more compact formats. Not “talk freely,” but “submit form.” This is an elegant data-saving operation amid compute scarcity. Hermes’ approach is different—dynamic memory mechanism. Working memory only stores the last 3-5 conversations. If exceeded, a lightweight model summarizes old conversations into core points and stores them in a vector database. This isn’t trash disposal; it’s surgical memory operation. Fine-grained context management drastically reduces compute costs at a macro level.

But there’s a more fundamental mindset shift beyond all these technical solutions. In the cheap era, we treat tokens with a consumer mindset—seeing discounts as directly entering the cart. Many companies randomly integrate LLMs into internal systems, give access to all employees, even ask AI to generate cafeteria menus. The result? End-of-month bill shock.

Now, it must be an investment mindset. Every token spent is an investment that must calculate ROI. If tokens are used up, what’s the return? Higher ticket closure rates? Shorter bug fix times? Or just responses like “haha, funny AI”? If features using rule engines cost 0.1 yuan but LLM integration costs 1 yuan with only a 2% conversion rate improvement, just cut it. No need to chase big AI fantasies—switch to targeted precision approaches. Every token should be treated like gold to be forged.

Finally, this cost increase isn’t a crisis but a purification. It’s breaking the bubble created by unlimited subsidies and forcing everyone back to reality. It eliminates superficial players who can only write prompts and wander around, passing the torch to core teams that truly understand architecture, model routing, and how to maximize compute on edge devices. When the tide goes out, we’ll see who’s swimming naked. This time, those who survive and thrive are those who treat every token as a precious resource, confident they can get more than they spend. They are the ones who will dominate the next era of AI infrastructure.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin