The team behind continuous batch processing says your idle GPUs should be running inference, not staying in the dark

Each GPU cluster has dead time. Training jobs end, workloads change, and hardware sits idle while power and cooling costs continue to rise. For neocloud traders, those empty cycles are a margin of loss.

The obvious solution is GPU spot markets: renting excess capacity to whoever needs it. But one-time cases mean that the cloud provider is still the one leasing, and the engineers who buy that capacity are still paying for raw computing with no inference stack attached.

FriendliAI’s answer is different: run inference directly on unused hardware, optimize token performance, and split the revenue with the operator. FriendliAI was founded by Byung-Gon Chun, the researcher whose paper on continuous batch processing became the basis for vLLM, the open source inference engine used in most production deployments today.

Chun spent more than a decade as a professor at Seoul National University studying efficient execution of machine learning models at scale. That research produced an article called Killer whalewhich introduced continuous batch processing. The technique processes inference requests dynamically rather than waiting for a fixed batch to complete before executing them. It is now an industry standard and is the core mechanism within vLLM.

This week, FriendliAI launches a new platform called InferenceSense. Just as publishers use Google AdSense to monetize unsold ad inventory, neocloud operators can use InferenceSense to fill unused GPU cycles with paid AI inference workloads and collect a portion of the token revenue. The operator’s own jobs always take priority: the moment a programmer retrieves a GPU, InferenceSense recedes.

"What we offer is that instead of letting the GPUs sit idle, by running inference they can monetize those idle GPUs." Chun told VentureBeat.

How a Seoul National University lab built the engine inside vLLM

Chun founded FriendliAI in 2021, before most of the industry shifted its focus from training to inference. The company’s core product is a dedicated inference endpoint service for AI startups and companies running open models. FriendliAI is also listed as a deployment option on Hugging Face along with Azure, AWS, and GCP, and currently supports over 500,000 open models of the platform.

InferenceSense now extends that inference engine to the capacity problem faced by GPU operators between workloads.

how it works

InferenceSense runs on top of Kubernetes, which most neocloud operators already use for resource orchestration. An operator assigns a group of GPUs to a Kubernetes cluster managed by FriendliAI, declaring which nodes are available and under what conditions they can be recovered. Idle detection is done through Kubernetes itself.

"We have our own orchestrator running on the GPUs of these neocloud (or simply cloud) providers." Chun said. "We definitely leverage Kubernetes, but the software running on top of it is a really highly optimized inference stack."

When GPUs are not used, InferenceSense spins up isolated containers that serve paid inference workloads on open weight models, including DeepSeek, Qwen, Kimi, GLM, and MiniMax. When the operator scheduler needs to recover hardware, inference workloads are brought forward and GPUs are returned. FriendliAI says the transfer occurs in a matter of seconds.

Demand is aggregated through FriendliAI’s direct clients and through inference aggregators such as OpenRouter. The operator supplies the capacity; FriendliAI handles demand pipeline, model optimization, and service stack. There are no upfront fees or minimum commitments. A real-time dashboard shows traders which models are running, tokens in process, and revenue accrued.

Why Token Yield Outperforms Raw Capacity Rental

GPU spot markets from vendors like CoreWeave, Lambda Labs, and RunPod involve the cloud provider renting its own hardware from a third party. InferenceSense runs on hardware already owned by the neocloud operator, and the operator defines which nodes participate and establishes programming agreements with FriendliAI in advance. The distinction matters: spot markets monetize capacity, InferenceSense monetizes tokens.

GPU hourly token throughput determines how much InferenceSense can actually earn during unused windows. FriendliAI claims that its engine offers two to three times the performance of a standard vLLM implementation, although Chun notes that the figure varies depending on the type of workload. Most competing inference stacks are based on open source Python-based frameworks. The FriendliAI engine is written in C++ and uses custom GPU cores instead of Nvidia’s cuDNN library. The company has created its own model rendering layer to partition and run models on hardware, with its own implementations of speculative decoding, quantization, and KV cache management.

Since the FriendliAI engine processes more tokens per GPU hour than a standard vLLM stack, operators should generate more revenue per unused cycle than they could with their own inference service.

What AI engineers should consider when evaluating inference costs

For AI engineers evaluating where to run inference workloads, the decision between neocloud versus hyperscaler has typically come down to price and availability.

InferenceSense adds a new consideration: if neoclouds can monetize idle capacity through inference, they have more economic incentives to keep token prices competitive.

That’s not a reason to change infrastructure decisions today: it’s still early. But engineers tracking the total cost of inference should watch to see whether the adoption of platforms like InferenceSense in the neocloud puts downward pressure on API prices for models like DeepSeek and Qwen over the next 12 months.

"When we have more efficient suppliers, the overall cost will go down," Chun said. "With InferenceSense we can help make these models cheaper."

Source link

The team behind continuous batch processing says your idle GPUs should be running inference, not staying in the dark

How a Seoul National University lab built the engine inside vLLM

how it works

Why Token Yield Outperforms Raw Capacity Rental

What AI engineers should consider when evaluating inference costs

Leave a ReplyCancel Reply

The UK plans to buy AI chips from British companies to prevent them from leaving for the US.

Microsoft Edge removes master password feature and switches to Windows Hello to access saved password

Zigbee solved the smart home battery problem years ago and Matter hasn’t done it yet

How a Seoul National University lab built the engine inside vLLM

how it works

Why Token Yield Outperforms Raw Capacity Rental

What AI engineers should consider when evaluating inference costs

Leave a ReplyCancel Reply

Trending now

The UK plans to buy AI chips from British companies to prevent them from leaving for the US.

Microsoft Edge removes master password feature and switches to Windows Hello to access saved password

Zigbee solved the smart home battery problem years ago and Matter hasn’t done it yet