Skip to main content

Real-time, Batch, and Micro-Batching Inference Explained

Arseny
Jun 23, 2025 7:27:38 AM

When you put a machine learning model into production, it needs to process new data and return results, whether that’s classifying an image, recommending a product, or detecting potential fraud. This step is called inference, and the way you run it can vary depending on your system’s needs.

Sometimes results need to be delivered immediately, such as when a user is waiting for a response. Other times, it's fine to process a large number of requests in the background. Choosing between real-time and batch inference affects performance, cost, and system design.

In this post, we’ll explore the differences between these approaches, when to use each, and how micro-batching can help balance speed and efficiency.

Real-Time Inference

A schema showing that end user requests are routed via a load balancer to an autoscaling cluster of GPU servers

Use when:

  • low latency is critical to the user experience

Pros:

  • fast response times
  • simpler architecture

Cons:

  • more expensive to run (especially with GPU-backed models)
  • inefficient use of compute (lack of batching)
  • harder to scale with unpredictable or spiky traffic

In a real-time system, requests are typically routed through a load balancer to a cluster of workers running the model. Each request is handled individually as it arrives, so the system must remain responsive at all times.

To deal with changes in traffic, especially sudden spikes, the worker cluster often needs to autoscale. This can lead to higher infrastructure costs because the system has to be prepared for peak traffic.

Common implementations use lightweight APIs such as FastAPI or Flask, or dedicated model-serving tools like TensorFlow Serving or TorchServe.

Batch Inference

A schema showing that end user requests are accepted by a lightweight application server, then buffered in a queue, and then processed by a cluster of GPU servers asynchronously. The processing results are served from the database by the same application server.

Use when:

  • immediate results are not required
  • data comes in at set times

Pros:

  • more efficient
  • no need to handle sudden spikes in traffic

Cons:

  • more complex to set up
  • higher delay before results are ready

Batch inference involves collecting incoming data into a buffer, such as a queue or database, and processing it all at once. This lets you start servers only when needed, which helps save on infrastructure costs. It works best when new data arrives at regular intervals, like hourly, daily, or weekly, and when immediate results are not required.

Because the buffer evens out the load, the system does not need to react instantly to spikes in traffic. This often means cluster autoscaling is less critical and can be simpler to manage.

Processing in batches also increases efficiency. For example, an LLM can achieve up to four times higher throughput when running batched requests compared to processing them one at a time[1]. Similarly, image generation models like Stable Diffusion can generate several images much faster[2] when run in batches rather than individually.

Micro-Batching Inference

A schema that shows micro-batching architecture: requests are batched in a queue for a short time but response happens in real time.

Use when:

  • optimizing real-time inference efficiency
  • the system has constantly high load

Pros:

  • more efficient than pure real-time by processing small batches
  • keeps latency low enough for many user-facing cases

Cons:

  • adds some complexity compared to pure real-time
  • introduces slight delay due to buffering

Micro-batching works by briefly collecting incoming requests, typically for a few milliseconds, to accumulate a small batch before processing. This buffering can be done at the load balancer or within an intermediate queue.

A good example is DeepSeek, which uses micro-batching by collecting incoming requests within a short time window, typically between 5 and 200 milliseconds, and combining them into a single batch for processing. They also separate requests that need fast responses from those optimized for throughput and route them through different parts of the model[3].

Micro-batching works best in environments with a high, steady flow of requests.

Conclusion

There’s no one-size-fits-all solution when it comes to running inference in production. Real-time is great when speed matters, batch is ideal when you can wait and want to save resources, and micro-batching helps you get the best of both when you’re handling a high volume of requests. Understanding the trade-offs between them can help you build a system that is both efficient and reliable and fits your real-world needs.

Further Reading

 

Comments