We've upgraded! Visit our new site at www.undermind.ai. The new deep search tool covers all of science, not just ArXiv, and is 10x faster. Your previous data is still available at app.undermind.ai.

Search Topic:

Systematic quantitative analyses of large language model performance optimization in a server inference setting, with a focus on how to navigate latency/throughput tradeoffs when processing large numbers of incoming requests.

Additional Context Provided:

Efficiently serving transformer-based large language models to many users at scale poses complicated system design challenges; load-balancing of requests across machines, batching of requests to improve throughput, management of the so-called “KV cache” in the attention mechanism, and use of advanced techniques like speculative decoding are all considerations when designing a high-performance LLM inference server. I am interested in any papers which have systematically analyzed the latency/throughput tradeoffs involved in serving transformer-based large language models in settings where there are a large number of incoming requests, with discussion of what implementation techniques are required to achieve different points on the tradeoff curve.