Real-time, Batch, and Micro-Batching Inference Explained

When you put a machine learning model into production, it needs to process new data and return results, whether that’s classifying an image, recommending a product, or detecting potential fraud. This step is called inference, and the way you run it can vary depending on your system’s needs.
Sometimes results need to be delivered immediately, such as when a user is waiting for a response. Other times, it's fine to process a large number of requests in the background. Choosing between real-time and batch inference affects performance, cost, and system design.
In this post, we’ll explore the differences between these approaches, when to use each, and how micro-batching can help balance speed and efficiency.
Real-Time Inference
Use when:
- low latency is critical to the user experience
Pros:
- fast response times
- simpler architecture
Cons:
- more expensive to run (especially with GPU-backed models)
- inefficient use of compute (lack of batching)
- harder to scale with unpredictable or spiky traffic
In a real-time system, requests are typically routed through a load balancer to a cluster of workers running the model. Each request is handled individually as it arrives, so the system must remain responsive at all times.
To deal with changes in traffic, especially sudden spikes, the worker cluster often needs to autoscale. This can lead to higher infrastructure costs because the system has to be prepared for peak traffic.
Common implementations use lightweight APIs such as FastAPI or Flask, or dedicated model-serving tools like TensorFlow Serving or TorchServe.
Batch Inference
Use when:
- immediate results are not required
- data comes in at set times
Pros:
- more efficient
- no need to handle sudden spikes in traffic
Cons:
- more complex to set up
- higher delay before results are ready
Batch inference involves collecting incoming data into a buffer, such as a queue or database, and processing it all at once. This lets you start servers only when needed, which helps save on infrastructure costs. It works best when new data arrives at regular intervals, like hourly, daily, or weekly, and when immediate results are not required.
Because the buffer evens out the load, the system does not need to react instantly to spikes in traffic. This often means cluster autoscaling is less critical and can be simpler to manage.
Processing in batches also increases efficiency. For example, an LLM can achieve up to four times higher throughput when running batched requests compared to processing them one at a time[1]. Similarly, image generation models like Stable Diffusion can generate several images much faster[2] when run in batches rather than individually.
Micro-Batching Inference
Use when:
- optimizing real-time inference efficiency
- the system has constantly high load
Pros:
- more efficient than pure real-time by processing small batches
- keeps latency low enough for many user-facing cases
Cons:
- adds some complexity compared to pure real-time
- introduces slight delay due to buffering
Micro-batching works by briefly collecting incoming requests, typically for a few milliseconds, to accumulate a small batch before processing. This buffering can be done at the load balancer or within an intermediate queue.
A good example is DeepSeek, which uses micro-batching by collecting incoming requests within a short time window, typically between 5 and 200 milliseconds, and combining them into a single batch for processing. They also separate requests that need fast responses from those optimized for throughput and route them through different parts of the model[3].
Micro-batching works best in environments with a high, steady flow of requests.
Conclusion
There’s no one-size-fits-all solution when it comes to running inference in production. Real-time is great when speed matters, batch is ideal when you can wait and want to save resources, and micro-batching helps you get the best of both when you’re handling a high volume of requests. Understanding the trade-offs between them can help you build a system that is both efficient and reliable and fits your real-world needs.
Further Reading
Comments