Search Topic:

Systematic quantitative analyses of large language model performance optimization in a server inference setting, with a focus on how to navigate latency/throughput tradeoffs when processing large numbers of incoming requests.

Additional Context Provided:

Efficiently serving transformer-based large language models to many users at scale poses complicated system design challenges; load-balancing of requests across machines, batching of requests to improve throughput, management of the so-called “KV cache” in the attention mechanism, and use of advanced techniques like speculative decoding are all considerations when designing a high-performance LLM inference server. I am interested in any papers which have systematically analyzed the latency/throughput tradeoffs involved in serving transformer-based large language models in settings where there are a large number of incoming requests, with discussion of what implementation techniques are required to achieve different points on the tradeoff curve.