Efficient LLM Serving Systems A Survey of Model Placement, Batching, and Resource Optimization Techniques

Main Article Content

Prachi Rajput 

Abstract

Large Language Models (LLMs) have become a fundamental component of modern artificial intelligence applications due to their remarkable capabilities in natural language understanding, generation, reasoning, and decision support. However, the increasing scale and complexity of thesemodels have introduced significant challenges related to inference efficiency, memory consumption, lat ency, and resource utilization. This paper presents a comprehensive survey of efficient LLM serving systems, focusing on model placement strategies, batching techniques, and resource optimization approaches. The study reviews key deployment
methods, including distributed serving, model parallelism, edge–cloud architectures, dynamic and continuous batching, KV-cache management, quantization, and intelligent scheduling. In addition, major challenges associated with scalability, energy consumption, load balancing, and memory management are examined. The analysis of recent literature demonstrates that the integration of system-level and algorithmic optimizations can substantially improve throughput, reduce latency, enhance resource utilization, and lower operational costs. The findings highlight that efficient memory management, adaptive scheduling, and resource-aware deployment strategies are essential for building scalable, reliable, and high-performance infrastructures for large-scale LLM serving.

Downloads

Download data is not yet available.

Article Details

Section

Review Article

How to Cite

Efficient LLM Serving Systems A Survey of Model Placement, Batching, and Resource Optimization Techniques. (2026). Journal of Global Research in Electronics and Communications(JGREC), 2(5), 43-47. https://doi.org/10.5281/

Similar Articles

You may also start an advanced similarity search for this article.