Efficient LLM Serving Systems A Survey of Model Placement, Batching, and Resource Optimization Techniques
Main Article Content
Abstract
Large Language Models (LLMs) have become a fundamental component of modern artificial intelligence applications due to their remarkable capabilities in natural language understanding, generation, reasoning, and decision support. However, the increasing scale and complexity of thesemodels have introduced significant challenges related to inference efficiency, memory consumption, lat ency, and resource utilization. This paper presents a comprehensive survey of efficient LLM serving systems, focusing on model placement strategies, batching techniques, and resource optimization approaches. The study reviews key deployment
methods, including distributed serving, model parallelism, edge–cloud architectures, dynamic and continuous batching, KV-cache management, quantization, and intelligent scheduling. In addition, major challenges associated with scalability, energy consumption, load balancing, and memory management are examined. The analysis of recent literature demonstrates that the integration of system-level and algorithmic optimizations can substantially improve throughput, reduce latency, enhance resource utilization, and lower operational costs. The findings highlight that efficient memory management, adaptive scheduling, and resource-aware deployment strategies are essential for building scalable, reliable, and high-performance infrastructures for large-scale LLM serving.
Downloads
Article Details
Section

This work is licensed under a Creative Commons Attribution 4.0 International License.