[논문] Survey: Efficient Large Language Models

개요

대규모 언어 모델은 자연어 이해, 생성, 복잡한 추론과 같은 작업에서 뛰어난 능력을 보여주었다. 그러나 대규모 언어 모델은 막대한 하드웨어 리소스가 필요하고, 효율성을 위한 기술 개발의 니즈가 발생하였다. 이 기술 동향은 효율적인 대규모 언어 모델을 위해 몇 가지 기술 분류와 최근 동향을 제안한다.

Model Compression

Weight-Only Quantization (PTQ)

GPTQ: Accurate Quantization for Generative Pre-trained Transformers, [Paper] [Code] ICLR, 2023
QuIP: 2-Bit Quantization of Large Language Models With Guarantees, [Paper] [Code] arXiv, 2023
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, [Paper] [Code] arXiv, 2023
OWQ: Lessons Learned from Activation Outliers for Weight Quantization in Large Language Models, [Paper] [Code] arXiv, 2023
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression, [Paper] [Code] arXiv, 2023
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs, [Paper] NeurIPS-ENLSP, 2023
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, [Paper] [Code] NeurlPS, 2022
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning, [Paper] [Code] NeurIPS, 2022

Weight-Activation Co-Quantization (PTQ)

Intriguing Properties of Quantization at Scale, [Paper] NeurIPS, 2023
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation, [Paper] [Code] arXiv, 2023
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats, [Paper] [Code] NeurIPS-ENLSP, 2023
OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization, [Paper] [Code] ISCA, 2023
RPTQ: Reorder-based Post-training Quantization for Large Language Models, [Paper] [Code] arXiv, 2023
Outlier Suppression+: Accurate Quantization of Large Language Models by Equivalent and Optimal Shifting and Scaling, [Paper] [Code] arXiv, 2023
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models, [Paper] arXiv, 2023
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, [Paper] [Code] ICML, 2023
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers, [Paper] NeurIPS, 2022

Quantization-Aware Training (QAT)

BitNet: Scaling 1-bit Transformers for Large Language Models, [Paper] arXiv, 2023
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models, [Paper] [Code] arXiv, 2023
Compression of Generative Pre-trained Language Models via Quantization, [Paper] ACL, 2022

Pruning: Structured Pruning

LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery, [Paper] arXiv, 2023
LLM-Pruner: On the Structural Pruning of Large Language Models, [Paper] [Code] NeurIPS, 2023
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, [Paper] [Code] NeurIPS-ENLSP, 2023
LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning, [Paper] arXiv, 2023

Pruning: Unstructured Pruning

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot, [Paper] [Code] ICML, 2023
A Simple and Effective Pruning Approach for Large Language Models, [Paper] [Code] arXiv, 2023
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models, [Paper] arXiv, 2023

Pruning: Low-Rank Approximation

TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition, [Paper] arXiv, 2023
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation, [Paper] [Code] ICML, 2023

White-Box KD

Towards the Law of Capacity Gap in Distilling Language Models, [Paper] [Code] arXiv, 2023
Baby Llama: Knowledge Distillation from an Ensemble of Teachers Trained on a Small Dataset with no Performance Penalty, [Paper] arXiv, 2023
Knowledge Distillation of Large Language Models, [Paper] [Code] arXiv, 2023
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models, [Paper] arXiv, 2023
Propagating Knowledge Updates to LMs Through Distillation, [Paper] [Code] arXiv, 2023
Less is More: Task-aware Layer-wise Distillation for Language Model Compression, [Paper] ICML, 2023
Token-Scaled Logit Distillation for Ternary Weight Generative Language Models, [Paper] arXiv, 2023

Black-Box KD

Zephyr: Direct Distillation of LM Alignment, [Paper] arXiv, 2023
Instruction Tuning with GPT-4, [Paper] [Code] arXiv, 2023
Lion: Adversarial Distillation of Closed-Source Large Language Model, [Paper] [Code] arXiv, 2023
Specializing Smaller Language Models towards Multi-Step Reasoning, [Paper] [Code] ICML, 2023
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes, [Paper] ACL, 2023
Large Language Models Are Reasoning Teachers, [Paper] [Code] ACL, 2023
SCOTT: Self-Consistent Chain-of-Thought Distillation, [Paper] [Code] ACL, 2023
Symbolic Chain-of-Thought Distillation: Small Models Can Also “Think” Step-by-Step, [Paper] ACL, 2023
Distilling Reasoning Capabilities into Smaller Language Models, [Paper] [Code] ACL, 2023
In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models, [Paper] arXiv, 2022
Explanations from Large Language Models Make Small Reasoners Better, [Paper] arXiv, 2022
DISCO: Distilling Counterfactuals with Large Language Models, [Paper] [Code] arXiv, 2022

Efficient Pre-Training

Mixed Precision Acceleration

GACT: Activation Compressed Training for Generic Network Architectures, [Paper] [Code] ICML, 2022
Mesa: A Memory-saving Training Framework for Transformers, [Paper] [Code] arXiv, 2021
Bfloat16 Processing for Neural Networks, [Paper] ARITH, 2019
A Study of BFLOAT16 for Deep Learning Training, [Paper] arXiv, 2019
Mixed Precision Training, [Paper] ICLR, 2018

Scaling Models

Learning to Grow Pretrained Models for Efficient Transformer Training, [Paper] [Code] ICLR, 2023
2x Faster Language Model Pre-training via Masked Structural Growth, [Paper] arXiv, 2023
Reusing Pretrained Models by Multi-linear Operators for Efficient Training, [Paper] NeurIPS, 2023
FLM-101B: An Open LLM and How to Train It with $100 K Budget, [Paper] [Code] arXiv, 2023
Knowledge Inheritance for Pre-trained Language Models, [Paper] [Code] NAACL, 2022
Staged Training for Transformer Language Models, [Paper] [Code] ICML, 2022

Initialization Techniques

Deepnet: Scaling transformers to 1,000 layers, [Paper] [Code] arXiv, 2022
ZerO Initialization: Initializing Neural Networks with only Zeros and Ones, [Paper] [Code] TMLR, 2022
Rezero is All You Need: Fast Convergence at Large Depth, [Paper] [Code] UAI, 2021
Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks, [Paper] NeurIPS, 2020
Improving Transformer Optimization Through Better Initialization, [Paper] [Code] ICML, 2020
Fixup Initialization: Residual Learning without Normalization, [Paper] ICLR, 2019
On Weight Initialization in Deep Neural Networks, [Paper] arXiv, 2017

Optimization Strategies

Symbolic Discovery of Optimization Algorithms, [Paper] arXiv, 2023
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training, [Paper] [Code] arXiv, 2023

Efficient Fine-Tuning

PEFT: Adapter-based Tuning

OpenDelta: A Plug-and-play Library for Parameter-efficient Adaptation of Pre-trained Models, [Paper] [Code] ACL Demo, 2023
LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models, [Paper] [Code] EMNLP, 2023
Compacter: Efficient Low-Rank Hypercomplex Adapter Layers, [Paper] [Code] NeurIPS, 2023
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning, [Paper] [Code] NeurIPS, 2022
Meta-Adapters: Parameter Efficient Few-shot Fine-tuning through Meta-Learning, [Paper] AutoML, 2022
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning, [Paper] [Code] EMNLP, 2022
SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency of Adapters, [Paper] [Code] EMNLP, 2022

PEFT: Low-Rank Adaptation

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning, [Paper] arXiv, 2023
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition, [Paper] [Code] arXiv, 2023
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, [Paper] [Code] arXiv, 2023
Multi-Head Adapter Routing for Cross-Task Generalization, [Paper] [Code] NeurIPS, 2023
Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning, [Paper] ICLR, 2023
DyLoRA: Parameter-Efficient Tuning of Pretrained Models using Dynamic Search-Free Low Rank Adaptation, [Paper] [Code] EACL, 2023
Tied-Lora: Enhacing Parameter Efficiency of LoRA with Weight Tying, [Paper] arXiv, 2023
LoRA: Low-Rank Adaptation of Large Language Models, [Paper] [Code] ICLR, 2022

Prefix Tuning

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, [Paper] [Code] arXiv, 2023
Prefix-Tuning: Optimizing Continuous Prompts for Generation [Paper] [Code] ACL, 2021

Prompt Tuning

Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt, [Paper] arXiv, 2023
GPT Understands, Too, [Paper] [Code] AI Open, 2023
Multi-Task Pre-Training of Modular Prompt for Few-Shot Learning [Paper] [Code] ACL, 2023
Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning, [Paper] ICLR, 2023
PPT: Pre-trained Prompt Tuning for Few-shot Learning, [Paper] [Code] ACL, 2022
Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers, [Paper] [Code] EMNLP-Findings, 2022
P-Tuning v2: Prompt Tuning Can Be Comparable to Finetuning Universally Across Scales and Tasks， [Paper] [Code] ACL-Short, 2022
The Power of Scale for Parameter-Efficient Prompt Tuning, [Paper] EMNLP, 2021

Memory-Efficient Fine-Tuning

Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model, [Paper] [Code] NeurIPS, 2023
Memory-Efficient Selective Fine-Tuning, [Paper] ICML Workshop, 2023
Full Parameter Fine-tuning for Large Language Models with Limited Resources, [Paper] [Code] arXiv, 2023
Fine-Tuning Language Models with Just Forward Passes, [Paper] [Code] NeurIPS, 2023
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization, [Paper] NeurIPS, 2023
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models, [Paper] [Code] arXiv, 2023
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models, [Paper] [Code] arXiv, 2023
QLoRA: Efficient Finetuning of Quantized LLMs, [Paper] [Code1] [Code2] NeurIPS, 2023

Efficient Inference

Speculative Decoding

PaSS: Parallel Speculative Sampling, [Paper] NeurIPS Workshop, 2023
Accelerating Transformer Inference for Translation via Parallel Decoding, [Paper] [Code] ACL, 2023
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, [Blog] [Code] Blog, 2023
Fast Inference from Transformers via Speculative Decoding, [Paper] ICML, 2023
Accelerating LLM Inference with Staged Speculative Decoding, [Paper] ICML Workshop, 2023
Accelerating Large Language Model Decoding with Speculative Sampling, [Paper] arXiv, 2023
Speculative Decoding with Big Little Decoder, [Paper] [Code] NeurIPS, 2023
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification, [Paper] [Code] arXiv, 2023
Inference with Reference: Lossless Acceleration of Large Language Models, [Paper] [Code] arXiv, 2023

KV-Cache Optimization

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs, [Paper] arXiv, 2023
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference, [Paper] arXiv, 2023
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models, [Paper] NeurIPS, 2023
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time, [Paper] NeurIPS, 2023
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, [Paper] arXiv, 2023

Efficient Architecture

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, [Paper] EMNLP, 2023
Fast Transformer Decoding: One Write-Head is All You Need, [Paper] arXiv, 2019

Feature Information Reduction

Nyströmformer: A nyström-based algorithm for approximating self-attention, [Paper] [Code] AAAI, 2021
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing, [Paper] [Code] NeurIPS, 2020
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks, [Paper] ICML, 2019

Kernelization or Low-Rank

Sumformer: Universal Approximation for Efficient Transformers, [Paper] ICML Workshop, 2023
FLuRKA: Fast fused Low-Rank & Kernel Attention, [Paper] arXiv, 2023
Scatterbrain: Unifying Sparse and Low-rank Attention, [Paper] [Code] NeurlPS, 2021
Rethinking Attention with Performers, [Paper] [Code] ICLR, 2021
Random Feature Attention, [Paper] ICLR, 2021
Linformer: Self-Attention with Linear Complexity, [Paper] [Code] arXiv, 2020
Lightweight and Efficient End-to-End Speech Recognition Using Low-Rank Transformer, [Paper] ICASSP, 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, [Paper] [Code] ICML, 2020

Fixed Pattern Strategies

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention, [Paper] ICML Workshop, 2023
Poolingformer: Long Document Modeling with Pooling Attention, [Paper] ICML, 2021
Big Bird: Transformers for Longer Sequences, [Paper] [Code] NeurIPS, 2020
Longformer: The Long-Document Transformer, [Paper] [Code] arXiv, 2020
Blockwise Self-Attention for Long Document Understanding, [Paper] [Code] EMNLP, 2020
Generating Long Sequences with Sparse Transformers, [Paper] arXiv, 2019

Learnable Pattern Strategies

HyperAttention: Long-context Attention in Near-Linear Time, [Paper] [Code] arXiv, 2023
ClusterFormer: Neural Clustering Attention for Efficient and Effective Transformer, [Paper] ACL, 2022
Reformer: The Efficient Transformer, [Paper] [Code] ICLR, 2022
Sparse Sinkhorn Attention, [Paper] ICML, 2020
Fast Transformers with Clustered Attention, [Paper] [Code] NeurIPS, 2020
Efficient Content-Based Sparse Attention with Routing Transformers, [Paper] [Code] TACL, 2020

Mixture of Experts

MoE-based LLMs

Mistral 7B, [Paper] [Code] arXiv, 2023
PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing, [Paper] arXiv, 2023
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, [Paper] [Code] JMLR, 2022
Efficient Large Scale Language Modeling with Mixtures of Experts, [Paper] [Code] EMNLP, 2022
BASE Layers: Simplifying Training of Large, Sparse Models, [Paper] [Code] ICML, 2021
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, [Paper] ICLR, 2021

Algorithm-Level MoE Optimization

Lifelong Language Pretraining with Distribution-Specialized Experts, [Paper] ICML, 2023
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models, [Paper] arXiv, 2023
Mixture-of-Experts with Expert Choice Routing, [Paper] NeurIPS, 2022
StableMoE: Stable Routing Strategy for Mixture of Experts, [Paper] [Code] ACL, 2022
On the Representation Collapse of Sparse Mixture of Experts, [Paper] NeurIPS, 2022

Long Context LLMs: Extrapolation and Interpolation

Scaling Laws of RoPE-based Extrapolation, [Paper] arXiv, 2023
A Length-Extrapolatable Transformer, [Paper] [Code] ACL, 2023
Extending Context Window of Large Language Models via Positional Interpolation, [Paper] arXiv, 2023
NTK Interpolation, [Reddit post] Blog, 2023
YaRN: Efficient Context Window Extension of Large Language Models, [Paper] [Code] arXiv, 2023
CLEX: Continuous Length Extrapolation for Large Language Models, [Paper][Code] arXiv, 2023
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training, [Paper][Code] arXiv, 2023
Functional Interpolation for Relative Positions Improves Long Context Transformers, [Paper] arXiv, 2023
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, [Paper] [Code] ICLR, 2022
Exploring Length Generalization in Large Language Models, [Paper] NeurIPS, 2022
The EOS Decision and Length Extrapolation, [Paper] [Code] EMNLP, 2020

Long Context LLMs: Recurrent Structure

Retentive Network: A Successor to Transformer for Large Language Models, [Paper] [Code] arXiv, 2023
Recurrent Memory Transformer, [Paper] [Code] NeurIPS, 2022
Block-Recurrent Transformers, [Paper] [Code] NeurIPS, 2022
∞-former: Infinite Memory Transformer, [Paper] [Code] ACL, 2022
Memformer: A Memory-Augmented Transformer for Sequence Modeling, [Paper] [Code] AACL-Findings, 2020
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, [Paper] [Code] ACL, 2019

Long Context LLMs: Segmentation and Sliding Window

LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning, [Paper] arXiv, 2024
Extending Context Window of Large Language Models via Semantic Compression, [Paper] arXiv, 2023
Efficient Streaming Language Models with Attention Sinks, [Paper] [Code] arXiv, 2023
Parallel Context Windows for Large Language Models, [Paper] [Code] ACL, 2023
LongNet: Scaling Transformers to 1,000,000,000 Tokens, [Paper] [Code] arXiv, 2023
Efficient Long-Text Understanding with Short-Text Models, [Paper] [Code] TACL, 2023

Memory-Retrieval Augmentation

Landmark Attention: Random-Access Infinite Context Length for Transformers, [Paper] [Code] arXiv, 2023
Augmenting Language Models with Long-Term Memory, [Paper] NeurIPS, 2023
Unlimiformer: Long-Range Transformers with Unlimited Length Input, [Paper] [Code] NeurIPS, 2023
Focused Transformer: Contrastive Training for Context Scaling, [Paper] [Code] NeurIPS, 2023
Retrieval meets Long Context Large Language Models, [Paper] arXiv, 2023
Memorizing Transformers, [Paper] [Code] ICLR, 2022

Transformer Alternative Architecture

State Space Models

Sparse Modular Activation for Efficient Sequence Modeling, [Paper] [Code] NeurIPS, 2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces, [Paper] [Code] arXiv, 2023
Hungry Hungry Hippos: Towards Language Modeling with State Space Models, [Paper] [Code] ICLR 2023
Long Range Language Modeling via Gated State Spaces, [Paper] ICLR, 2023
Block-State Transformers, [Paper] NeurIPS, 2023
Efficiently Modeling Long Sequences with Structured State Spaces, [Paper] [Code] ICLR, 2022
Diagonal State Spaces are as Effective as Structured State Spaces, [Paper] [Code] NeurIPS, 2022

Other Sequential Models

PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation, [Paper] arXiv, 2023
RWKV: Reinventing RNNs for the Transformer Era, [Paper] EMNLP-Findings, 2023
Hyena Hierarchy: Towards Larger Convolutional Language Models, [Paper] arXiv, 2023
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, [Paper] arXiv, 2023

Data-Centric Methods

Data Selection for Efficient Pre-Training

Data Selection for Language Models via Importance Resampling, [Paper] [Code] NeurIPS, 2023
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework, [Paper] [Code] ICML, 2022
Span Selection Pre-training for Question Answering, [Paper] [Code] ACL, 2020

Data Selection for Efficient Fine-Tuning

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning, [Paper] [Code] arXiv, 2023
One Shot Learning as Instruction Data Prospector for Large Language Models, [Paper] arXiv, 2023
MoDS: Model-oriented Data Selection for Instruction Tuning, [Paper] [Code] arXiv, 2023
Instruction Mining: When Data Mining Meets Large Language Model Finetuning, [Paper] arXiv, 2023
Data-Efficient Finetuning Using Cross-Task Nearest Neighbors, [Paper] [Code] ACL, 2023
Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values, [Paper] [Code] ACL SRW, 2023
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning, [Paper] arXiv, 2023
AlpaGasus: Training A Better Alpaca with Fewer Data, [Paper] [Code] arXiv, 2023
LIMA: Less Is More for Alignment, [Paper] arXiv, 2023

Prompt Engineering

Demonstration Selection

Unified Demonstration Retriever for In-Context Learning, [Paper] [Code] ACL, 2023
Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning, [Paper] [Code] NeurIPS, 2023
In-Context Learning with Iterative Demonstration Selection, [Paper] arXiv, 2022
Dr.ICL: Demonstration-Retrieved In-context Learning, [Paper] arXiv, 2022
Learning to Retrieve In-Context Examples for Large Language Models, [Paper] arXiv, 2022
Finding Supporting Examples for In-Context Learning, [Paper] arXiv, 2022
Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering, [Paper] [Code] ACL, 2023
Selective Annotation Makes Language Models Better Few-Shot Learners, [Paper] [Code] ICLR, 2023
What Makes Good In-Context Examples for GPT-3? [Paper] DeeLIO, 2022
Learning To Retrieve Prompts for In-Context Learning, [Paper] [Code] NAACL-HLT, 2022
Active Example Selection for In-Context Learning, [Paper] [Code] EMNLP, 2022
Rethinking the Role of Demonstrations: What makes In-context Learning Work? [Paper] [Code] EMNLP, 2022

Demonstration Ordering

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity, [Paper] ACL, 2022

Instruction Generation

Large Language Models as Optimizers, [Paper] arXiv, 2023
Instruction Induction: From Few Examples to Natural Language Task Descriptions, [Paper] [Code] ACL, 2023
Large Language Models Are Human-Level Prompt Engineers, [Paper] [Code] ICLR, 2023
TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design, [Paper] arXiv, 2023
Self-Instruct: Aligning Language Model with Self Generated Instructions, [Paper] [Code] ACL, 2023

Multi-Step Reasoning

Automatic Chain of Thought Prompting in Large Language Models, [Paper] [Code] ICLR, 2023
Measuring and Narrowing the Compositionality Gap in Language Models, [Paper] [Code] EMNLP, 2023
ReAct: Synergizing Reasoning and Acting in Language Models, [Paper] [Code] ICLR, 2023
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, [Paper] ICLR, 2023
Graph of Thoughts: Solving Elaborate Problems with Large Language Models, [Paper] [Code] arXiv, 2023
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, [Paper] [Code] NeurIPS, 2023
Self-Consistency Improves Chain of Thought Reasoning in Language Models, [Paper] ICLR, 2023
Graph of Thoughts: Solving Elaborate Problems with Large Language Models, [Paper] [Code] arXiv, 2023
Contrastive Chain-of-Thought Prompting, [Paper] [Code] arXiv, 2023
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation, [Paper] arXiv, 2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, [Paper] NeurIPS, 2022

Parallel Generation

Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding, [Paper] [Code] arXiv, 2023

Prompt Compression

Learning to Compress Prompts with Gist Tokens, [Paper] arXiv, 2023
Adapting Language Models to Compress Contexts, [Paper] [Code] EMNLP, 2023
In-context Autoencoder for Context Compression in a Large Language Model, [Paper] [Code] arXiv, 2023
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression, [Paper] [Code] arXiv, 2023
Discrete Prompt Compression with Reinforcement Learning, [Paper] arXiv, 2023
Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models, [Paper] arXiv, 2023

Prompt Generation

TempLM: Distilling Language Models into Template-Based Generators, [Paper] [Code] arXiv, 2022
PromptGen: Automatically Generate Prompts using Generative Models, [Paper] NAACL Findings, 2022
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts, [Paper] [Code] EMNLP, 2020

System-Level Efficiency Optimization and LLM Frameworks

System-Level Pre-Training Efficiency Optimization

CoLLiE: Collaborative Training of Large Language Models in an Efficient Way, [Paper] [Code] EMNLP, 2023
An Efficient 2D Method for Training Super-Large Deep Learning Models, [Paper] [Code] IPDPS, 2023
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel, [Paper] VLDB, 2023
Bamboo: Making Preemptible Instances Resilient for Affordable Training, [Paper] [Code] NSDI, 2023
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates, [Paper] [Code] SOSP, 2023
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models, [Paper] [Code] EuroSys, 2022
Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization, [Paper] [Code] OSDI, 2022
Tesseract: Parallelize the Tensor Parallelism Efficiently, , [Paper] ICPP, 2022
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning, , [Paper][Code] OSDI, 2022
Maximizing Parallelism in Distributed Training for Huge Neural Networks, [Paper] arXiv, 2021
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM, [Paper] [Code] SC, 2021
ZeRO-Infinity: breaking the GPU memory wall for extreme scale deep learning, [Paper] SC, 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training, [Paper] [Code] USENIX ATC, 2021
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, [Paper] [Code] SC, 2020

System-Level Inference Efficiency Optimization

LLM in a flash: Efficient Large Language Model Inference with Limited Memory, [Paper] arXiv, 2023
SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision, [Paper] EMNLP, 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU, [Paper] [Code] ICML, 2023
Flash-Decoding for Long-Context Inference, [Blog] Blog, 2023
FlashDecoding++: Faster Large Language Model Inference on GPUs, [Paper] arXiv, 2023
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time, [Paper] ICML, 2023
Efficiently Scaling Transformer Inference, [Paper] MLSys, 2023
S3: Increasing GPU Utilization during Generative Inference for Higher Throughput, [Paper] arXiv, 2023
DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale, [Paper] SC, 2022

System-Level Serving Efficiency Optimization

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, [Paper] [Code] arXiv, 2023
S-LoRA: Serving Thousands of Concurrent LoRA Adapters, [Paper] [Code] arXiv, 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention, [Paper] [Code] SOSP, 2023
Orca: A Distributed Serving System for Transformer-Based Generative Models, [Paper] OSDI, 2022
Fast Distributed Inference Serving for Large Language Models, [Paper] arXiv, 2023
Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models, [Paper] arXiv, 2023
SpotServe: Serving Generative Large Language Models on Preemptible Instances, [Paper] arXiv, 2023
TurboTransformers: an efficient GPU serving system for transformer models, [Paper] PPoPP, 2021

System-Level Efficient Architecture Optimization

System-Level Attention Optimization

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, [Paper] [Code] arXiv, 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention, [Paper] [Code] SOSP, 2023
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, [Paper] [Code] NeurIPS, 2022
Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server, [Blog] Nvidia Blog, 2022
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks, [Paper] ISCA, 2021
A3: Accelerating Attention Mechanisms in Neural Networks with Approximation, [Paper] HPCA, 2020

System-Level MoE Optimization

Tutel: Adaptive mixture-of-experts at scale, [Paper] [Code] MLSys, 2023
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts, [Paper] [Code] MLSys, 2023
SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization, [Paper] USENIX ATC, 2023
MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism, [Paper] [Code] IPDPS, 2023
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models, [Paper] arXiv, 2022
TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training, [Paper] [Code] NeurIPS, 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, [Paper] [Code] ICML, 2022
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models, [Paper] [Code] PPoPP, 2022
FastMoE: A Fast Mixture-of-Expert Training System, [Paper] [Code] arXiv, 2021

개요#

Model Compression#

Weight-Only Quantization (PTQ)#

Weight-Activation Co-Quantization (PTQ)#

Quantization-Aware Training (QAT)#

Pruning: Structured Pruning#

Pruning: Unstructured Pruning#

Pruning: Low-Rank Approximation#

White-Box KD#

Black-Box KD#

Efficient Pre-Training#

Mixed Precision Acceleration#

Scaling Models#

Initialization Techniques#

Optimization Strategies#

Efficient Fine-Tuning#

PEFT: Adapter-based Tuning#

PEFT: Low-Rank Adaptation#

Prefix Tuning#

Prompt Tuning#

Memory-Efficient Fine-Tuning#

Efficient Inference#

Speculative Decoding#

KV-Cache Optimization#

Efficient Architecture#

Sharing-based Attention#

Feature Information Reduction#

Kernelization or Low-Rank#

Fixed Pattern Strategies#

Learnable Pattern Strategies#

Mixture of Experts#

MoE-based LLMs#

Algorithm-Level MoE Optimization#

Long Context LLMs: Extrapolation and Interpolation#

Long Context LLMs: Recurrent Structure#

Long Context LLMs: Segmentation and Sliding Window#

Memory-Retrieval Augmentation#

Transformer Alternative Architecture#

State Space Models#

Other Sequential Models#

Data-Centric Methods#

Data Selection for Efficient Pre-Training#

Data Selection for Efficient Fine-Tuning#

Prompt Engineering#

Demonstration Selection#

Demonstration Ordering#

Instruction Generation#

Multi-Step Reasoning#

Parallel Generation#

Prompt Compression#

Prompt Generation#

System-Level Efficiency Optimization and LLM Frameworks#

System-Level Pre-Training Efficiency Optimization#

System-Level Inference Efficiency Optimization#

System-Level Serving Efficiency Optimization#

System-Level Efficient Architecture Optimization#

System-Level Attention Optimization#

System-Level MoE Optimization#

개요

Model Compression

Weight-Only Quantization (PTQ)

Weight-Activation Co-Quantization (PTQ)

Quantization-Aware Training (QAT)

Pruning: Structured Pruning

Pruning: Unstructured Pruning

Pruning: Low-Rank Approximation

White-Box KD

Black-Box KD

Efficient Pre-Training

Mixed Precision Acceleration

Scaling Models

Initialization Techniques

Optimization Strategies

Efficient Fine-Tuning

PEFT: Adapter-based Tuning

PEFT: Low-Rank Adaptation

Prefix Tuning

Prompt Tuning

Memory-Efficient Fine-Tuning

Efficient Inference

Speculative Decoding

KV-Cache Optimization

Efficient Architecture

Sharing-based Attention

Feature Information Reduction

Kernelization or Low-Rank

Fixed Pattern Strategies

Learnable Pattern Strategies

Mixture of Experts

MoE-based LLMs

Algorithm-Level MoE Optimization

Long Context LLMs: Extrapolation and Interpolation

Long Context LLMs: Recurrent Structure

Long Context LLMs: Segmentation and Sliding Window

Memory-Retrieval Augmentation

Transformer Alternative Architecture

State Space Models

Other Sequential Models

Data-Centric Methods

Data Selection for Efficient Pre-Training

Data Selection for Efficient Fine-Tuning

Prompt Engineering

Demonstration Selection

Demonstration Ordering

Instruction Generation

Multi-Step Reasoning

Parallel Generation

Prompt Compression

Prompt Generation

System-Level Efficiency Optimization and LLM Frameworks

System-Level Pre-Training Efficiency Optimization

System-Level Inference Efficiency Optimization

System-Level Serving Efficiency Optimization

System-Level Efficient Architecture Optimization

System-Level Attention Optimization

System-Level MoE Optimization