Efficient LLM Serving Systems A Survey of Model Placement, Batching, and Resource Optimization Techniques

Prachi  Rajput

doi:10.5281/

PDF

Published: 2026-05-29

DOI: https://doi.org/10.5281/

Keywords:

Large Language Models (LLMs), KV-Cache Optimization, Model Parallelism, Resource Scheduling, Batching, Quantization

Prachi Rajput

University, Vadodara

Abstract

Large Language Models (LLMs) have become a fundamental component of modern artificial intelligence applications due to their remarkable capabilities in natural language understanding, generation, reasoning, and decision support. However, the increasing scale and complexity of thesemodels have introduced significant challenges related to inference efficiency, memory consumption, lat ency, and resource utilization. This paper presents a comprehensive survey of efficient LLM serving systems, focusing on model placement strategies, batching techniques, and resource optimization approaches. The study reviews key deployment
methods, including distributed serving, model parallelism, edge–cloud architectures, dynamic and continuous batching, KV-cache management, quantization, and intelligent scheduling. In addition, major challenges associated with scalability, energy consumption, load balancing, and memory management are examined. The analysis of recent literature demonstrates that the integration of system-level and algorithmic optimizations can substantially improve throughput, reduce latency, enhance resource utilization, and lower operational costs. The findings highlight that efficient memory management, adaptive scheduling, and resource-aware deployment strategies are essential for building scalable, reliable, and high-performance infrastructures for large-scale LLM serving.

Downloads

Download data is not yet available.

Issue

Vol. 2 No. 5 (2026): May-2026

Section

Review Article

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Efficient LLM Serving Systems A Survey of Model Placement, Batching, and Resource Optimization Techniques. (2026). Journal of Global Research in Electronics and Communications(JGREC), 2(5), 43-47. https://doi.org/10.5281/

Similar Articles

Rajiv Kumar, Analyzing the Performance of Large Language Models in Complex Spatial Reasoning Tasks , Journal of Global Research in Electronics and Communications(JGREC): Vol. 1 No. 6 (2025): June 2025
Hitesh Kumar Sharma, Dr Samta Jain Goyal, Dr. Sumit Kumar, Minimum Response Time Optimization in Cloud-Based IoT Systems , Journal of Global Research in Electronics and Communications(JGREC): Vol. 2 No. 6s (2026): JUNE 2026
Dr. Manish Jain, Artificial Intelligence-Driven Sentiment Analysis for Product Reviews in Online Retailing Platforms , Journal of Global Research in Electronics and Communications(JGREC): Vol. 2 No. 3 (2026): March-2026
Mr. Vijay Kumar, Mr. Shubham Dwivedi, Ms. Pooja Koshti, Sparse Models vs. Dense Models: Efficiency Trade-offs in Foundation Models , Journal of Global Research in Electronics and Communications(JGREC): Vol. 1 No. 8 (2025): August-2025
Prof. (Dr.) Abid Hussain, A Comparative Review of Heuristic vs Metaheuristic Techniques in Cloud Resource Optimization , Journal of Global Research in Electronics and Communications(JGREC): Vol. 1 No. 8 (2025): August-2025
Saket Kotpalliwar, Dr. Priya Vij, Enhanced Sentiment Analysis on Online Amazon Reviews Using RoBERTa with PSO-Based Hyperparameter Tuning , Journal of Global Research in Electronics and Communications(JGREC): Vol. 2 No. 1 (2026): January-2026
Dr. Pradeep Laxkar , AI-Powered Tools in Software Engineering Applications in Code Generation and Quality Assurance , Journal of Global Research in Electronics and Communications(JGREC): Vol. 1 No. 12 (2025): December-2025
Srashti Farkya, Priyanka Khabiya, Foundation Models and Their Transformative Impact on Machine Learning , Journal of Global Research in Electronics and Communications(JGREC): Vol. 1 No. 5 (2025): May 2025
Md Tahseen Equbal, Md Irshad Anwar, Wasim Ahmad Sheikh, Arif Rasul, Md Wasim Nehal, Md Ashad Iqbal, Few-Shot Question Answering in Low-Resource Languages using Model-Agnostic Meta-Learning (MAML) , Journal of Global Research in Electronics and Communications(JGREC): Vol. 1 No. 12 (2025): December-2025
Mr. Sachin Manekar, Comprehensive Analysis of Deep Learning Architectures for Temporal Prediction in Financial Markets , Journal of Global Research in Electronics and Communications(JGREC): Vol. 1 No. 9 (2025): September-2025

You may also start an advanced similarity search for this article.

Article Sidebar

Main Article Content