Recent advancements in LLM capabilities have increased their usability by enabling them to do a broader range of general activities autonomously. The existing methods for expressing and running LM programs could be more efficient, although they are widely used. There are two main obstacles to effective LM program utilization: The non-deterministic character of LLMs makes programming LM programs tedious and complex. Incorporating parallelism mechanisms, dealing with many input modalities, brittle output parsing, experimental adjustment of prompts, and substantial string manipulation are commonplace tasks in LM software development. This complexity greatly diminishes the readability of even the most basic applications. Second, and most crucially, LM program execution wastes memory and computational resources due to redundant calculations.
A group of researchers from Stanford University, UC Berkeley, Shanghai Jiao Tong University, and Texas A&M University introduced SGLang, a Structured Generation Language for LLMs, to take on these problems. The basic premise is to make use of LM programs’ multi-call structure in a systematic way to speed up their execution. This system comprises a language for the front end and a runtime for the back end. While the runtime speeds up the execution of LM programs, the front end makes LM program programming easier. Both components can operate separately or in tandem for optimal performance. Primitives for controlling parallelism (fork and join) and generation (extend, gen, and select) are provided. Because SGLang works with Python’s libraries and control flow, users may easily build sophisticated prompting processes using the language’s natural syntax.
The team also presented a compiler and an interpreter for SGLang. By appropriately controlling synchronization and intra-program parallelism, the interpreter ensures that primitive operations are sent to the stream for asynchronous execution and that the prompt state is managed as a stream. Further optimizations can be achieved by tracing and compiling the SGLang program. To speed up the execution of SGLang applications, the researchers suggest several new optimizations on the runtime side. Automatic KV cache reuse across several generation calls is made possible by the first technique, RadixAttention. Current inference engines wastefully trash a request’s KV cache when processing is finished, which makes it impossible to reuse the cache for subsequent calls and drastically slows down execution. In its place, the system stores all requests within a radix tree in an LRU cache of the KV cache. This method employs a radix tree for efficient matching, inserting, and evicting and handles the KV cache similarly to a conventional cache. It efficiently enables the runtime to manage different reuse patterns using a cache-aware scheduling approach.
A compressed finite state machine is the second method; it allows for restricted decoding of structured outputs to happen more quickly. By hiding the likelihood of forbidden tokens, current systems can only decode a single token at a time, as they only obey the restrictions for the next token. Rather, our approach examines the limitations and constructs a compressed finite-state machine. This method streamlines decoding by combining numerous token paths into one shorter one whenever feasible. This allows for faster decoding of multiple tokens simultaneously.
Finally, an API-only model, such as OpenAI’s GPT-4, can be optimized for multi-call programs using SGLang. For this, they present a third technique called API speculative execution. Agent control, reasoning, retrieval-augmented generation pipelines, JSON decoding, multiturn chat, multi-modality processing, and few-shot learning benchmarks are some of the LLM applications created using SGLang.
On NVIDIA A10G and A100 GPUs, the team evaluated the performance with various models, such as Llama7B/70B, Mistral-8x7B, LLaVA-v1.5-7B (picture), and LLaVA-NeXT-34B (video). Based on the experimental results, SGLang outperforms existing programming and inference systems, such as Guidance, vLLM, and LMQL, throughput by up to 6.4 across various workloads, models, and hardware configurations.
Even though SGLang has come a long way, certain restrictions still point to interesting places to go from here in terms of research. Among these improvements are the following: adding support for more output modalities to SGLang, making RadixAttention work on different levels of the memory hierarchy (e.g., DRAM and Disk), making RadixAttention work with fuzzy semantic matching, adding higher-level primitives to SGLang, fixing cache-aware scheduling’s starvation problem, and making the SGLang compiler better at scheduling and memory planning, among other advanced static optimizations.
Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.
🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…