개요
대규모 언어 모델은 자연어 이해, 생성, 복잡한 추론과 같은 작업에서 뛰어난 능력을 보여주었다. 그러나 대규모 언어 모델은 막대한 하드웨어 리소스가 필요하고, 효율성을 위한 기술 개발의 니즈가 발생하였다. 이 기술 동향은 효율적인 대규모 언어 모델을 위해 몇 가지 기술 분류와 최근 동향을 제안한다.
Model Compression
Weight-Only Quantization (PTQ)
GPTQ: Accurate Quantization for Generative Pre-trained Transformers, [Paper] [Code] ICLR, 2023
QuIP: 2-Bit Quantization of Large Language Models With Guarantees, [Paper] [Code] arXiv, 2023
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, [Paper] [Code] arXiv, 2023
OWQ: Lessons Learned from Activation Outliers for Weight Quantization in Large Language Models, [Paper] [Code] arXiv, 2023
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression, [Paper] [Code] arXiv, 2023
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs, [Paper] NeurIPS-ENLSP, 2023
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, [Paper] [Code] NeurlPS, 2022
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning, [Paper] [Code] NeurIPS, 2022
Weight-Activation Co-Quantization (PTQ)
Intriguing Properties of Quantization at Scale, [Paper] NeurIPS, 2023
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation, [Paper] [Code] arXiv, 2023
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats, [Paper] [Code] NeurIPS-ENLSP, 2023
OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization, [Paper] [Code] ISCA, 2023
RPTQ: Reorder-based Post-training Quantization for Large Language Models, [Paper] [Code] arXiv, 2023
Outlier Suppression+: Accurate Quantization of Large Language Models by Equivalent and Optimal Shifting and Scaling, [Paper] [Code] arXiv, 2023
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models, [Paper] arXiv, 2023
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, [Paper] [Code] ICML, 2023
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers, [Paper] NeurIPS, 2022
Quantization-Aware Training (QAT)
BitNet: Scaling 1-bit Transformers for Large Language Models, [Paper] arXiv, 2023
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models, [Paper] [Code] arXiv, 2023
Compression of Generative Pre-trained Language Models via Quantization, [Paper] ACL, 2022
Pruning: Structured Pruning
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery, [Paper] arXiv, 2023
LLM-Pruner: On the Structural Pruning of Large Language Models, [Paper] [Code] NeurIPS, 2023
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, [Paper] [Code] NeurIPS-ENLSP, 2023
LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning, [Paper] arXiv, 2023
Pruning: Unstructured Pruning
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot, [Paper] [Code] ICML, 2023
A Simple and Effective Pruning Approach for Large Language Models, [Paper] [Code] arXiv, 2023
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models, [Paper] arXiv, 2023
Pruning: Low-Rank Approximation
TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition, [Paper] arXiv, 2023
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation, [Paper] [Code] ICML, 2023
White-Box KD
Towards the Law of Capacity Gap in Distilling Language Models, [Paper] [Code] arXiv, 2023
Baby Llama: Knowledge Distillation from an Ensemble of Teachers Trained on a Small Dataset with no Performance Penalty, [Paper] arXiv, 2023
Knowledge Distillation of Large Language Models, [Paper] [Code] arXiv, 2023
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models, [Paper] arXiv, 2023
Propagating Knowledge Updates to LMs Through Distillation, [Paper] [Code] arXiv, 2023
Less is More: Task-aware Layer-wise Distillation for Language Model Compression, [Paper] ICML, 2023
Token-Scaled Logit Distillation for Ternary Weight Generative Language Models, [Paper] arXiv, 2023
Black-Box KD
Zephyr: Direct Distillation of LM Alignment, [Paper] arXiv, 2023
Lion: Adversarial Distillation of Closed-Source Large Language Model, [Paper] [Code] arXiv, 2023
Specializing Smaller Language Models towards Multi-Step Reasoning, [Paper] [Code] ICML, 2023
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes, [Paper] ACL, 2023
Large Language Models Are Reasoning Teachers, [Paper] [Code] ACL, 2023
SCOTT: Self-Consistent Chain-of-Thought Distillation, [Paper] [Code] ACL, 2023
Symbolic Chain-of-Thought Distillation: Small Models Can Also “Think” Step-by-Step, [Paper] ACL, 2023
Distilling Reasoning Capabilities into Smaller Language Models, [Paper] [Code] ACL, 2023
In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models, [Paper] arXiv, 2022
Explanations from Large Language Models Make Small Reasoners Better, [Paper] arXiv, 2022
DISCO: Distilling Counterfactuals with Large Language Models, [Paper] [Code] arXiv, 2022
Efficient Pre-Training
Mixed Precision Acceleration
GACT: Activation Compressed Training for Generic Network Architectures, [Paper] [Code] ICML, 2022
Mesa: A Memory-saving Training Framework for Transformers, [Paper] [Code] arXiv, 2021
Bfloat16 Processing for Neural Networks, [Paper] ARITH, 2019
A Study of BFLOAT16 for Deep Learning Training, [Paper] arXiv, 2019
Mixed Precision Training, [Paper] ICLR, 2018
Scaling Models
Learning to Grow Pretrained Models for Efficient Transformer Training, [Paper] [Code] ICLR, 2023
2x Faster Language Model Pre-training via Masked Structural Growth, [Paper] arXiv, 2023
Reusing Pretrained Models by Multi-linear Operators for Efficient Training, [Paper] NeurIPS, 2023
FLM-101B: An Open LLM and How to Train It with $100 K Budget, [Paper] [Code] arXiv, 2023
Knowledge Inheritance for Pre-trained Language Models, [Paper] [Code] NAACL, 2022
Staged Training for Transformer Language Models, [Paper] [Code] ICML, 2022
Initialization Techniques
Deepnet: Scaling transformers to 1,000 layers, [Paper] [Code] arXiv, 2022
ZerO Initialization: Initializing Neural Networks with only Zeros and Ones, [Paper] [Code] TMLR, 2022
Rezero is All You Need: Fast Convergence at Large Depth, [Paper] [Code] UAI, 2021
Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks, [Paper] NeurIPS, 2020
Improving Transformer Optimization Through Better Initialization, [Paper] [Code] ICML, 2020
Fixup Initialization: Residual Learning without Normalization, [Paper] ICLR, 2019
On Weight Initialization in Deep Neural Networks, [Paper] arXiv, 2017
Optimization Strategies
Symbolic Discovery of Optimization Algorithms, [Paper] arXiv, 2023
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training, [Paper] [Code] arXiv, 2023
Efficient Fine-Tuning
PEFT: Adapter-based Tuning
OpenDelta: A Plug-and-play Library for Parameter-efficient Adaptation of Pre-trained Models, [Paper] [Code] ACL Demo, 2023
LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models, [Paper] [Code] EMNLP, 2023
Compacter: Efficient Low-Rank Hypercomplex Adapter Layers, [Paper] [Code] NeurIPS, 2023
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning, [Paper] [Code] NeurIPS, 2022
Meta-Adapters: Parameter Efficient Few-shot Fine-tuning through Meta-Learning, [Paper] AutoML, 2022
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning, [Paper] [Code] EMNLP, 2022
SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency of Adapters, [Paper] [Code] EMNLP, 2022
PEFT: Low-Rank Adaptation
LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning, [Paper] arXiv, 2023
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition, [Paper] [Code] arXiv, 2023
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, [Paper] [Code] arXiv, 2023
Multi-Head Adapter Routing for Cross-Task Generalization, [Paper] [Code] NeurIPS, 2023
Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning, [Paper] ICLR, 2023
DyLoRA: Parameter-Efficient Tuning of Pretrained Models using Dynamic Search-Free Low Rank Adaptation, [Paper] [Code] EACL, 2023
Tied-Lora: Enhacing Parameter Efficiency of LoRA with Weight Tying, [Paper] arXiv, 2023
LoRA: Low-Rank Adaptation of Large Language Models, [Paper] [Code] ICLR, 2022
Prefix Tuning
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, [Paper] [Code] arXiv, 2023
Prefix-Tuning: Optimizing Continuous Prompts for Generation [Paper] [Code] ACL, 2021
Prompt Tuning
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt, [Paper] arXiv, 2023
Multi-Task Pre-Training of Modular Prompt for Few-Shot Learning [Paper] [Code] ACL, 2023
Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning, [Paper] ICLR, 2023
PPT: Pre-trained Prompt Tuning for Few-shot Learning, [Paper] [Code] ACL, 2022
Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers, [Paper] [Code] EMNLP-Findings, 2022
P-Tuning v2: Prompt Tuning Can Be Comparable to Finetuning Universally Across Scales and Tasks, [Paper] [Code] ACL-Short, 2022
The Power of Scale for Parameter-Efficient Prompt Tuning, [Paper] EMNLP, 2021
Memory-Efficient Fine-Tuning
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model, [Paper] [Code] NeurIPS, 2023
Memory-Efficient Selective Fine-Tuning, [Paper] ICML Workshop, 2023
Full Parameter Fine-tuning for Large Language Models with Limited Resources, [Paper] [Code] arXiv, 2023
Fine-Tuning Language Models with Just Forward Passes, [Paper] [Code] NeurIPS, 2023
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization, [Paper] NeurIPS, 2023
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models, [Paper] [Code] arXiv, 2023
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models, [Paper] [Code] arXiv, 2023
QLoRA: Efficient Finetuning of Quantized LLMs, [Paper] [Code1] [Code2] NeurIPS, 2023
Efficient Inference
Speculative Decoding
PaSS: Parallel Speculative Sampling, [Paper] NeurIPS Workshop, 2023
Accelerating Transformer Inference for Translation via Parallel Decoding, [Paper] [Code] ACL, 2023
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, [Blog] [Code] Blog, 2023
Fast Inference from Transformers via Speculative Decoding, [Paper] ICML, 2023
Accelerating LLM Inference with Staged Speculative Decoding, [Paper] ICML Workshop, 2023
Accelerating Large Language Model Decoding with Speculative Sampling, [Paper] arXiv, 2023
Speculative Decoding with Big Little Decoder, [Paper] [Code] NeurIPS, 2023
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification, [Paper] [Code] arXiv, 2023
Inference with Reference: Lossless Acceleration of Large Language Models, [Paper] [Code] arXiv, 2023
KV-Cache Optimization
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs, [Paper] arXiv, 2023
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference, [Paper] arXiv, 2023
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models, [Paper] NeurIPS, 2023
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time, [Paper] NeurIPS, 2023
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, [Paper] arXiv, 2023
Efficient Architecture
Sharing-based Attention
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, [Paper] EMNLP, 2023
Fast Transformer Decoding: One Write-Head is All You Need, [Paper] arXiv, 2019
Feature Information Reduction
Nyströmformer: A nyström-based algorithm for approximating self-attention, [Paper] [Code] AAAI, 2021
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing, [Paper] [Code] NeurIPS, 2020
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks, [Paper] ICML, 2019
Kernelization or Low-Rank
Sumformer: Universal Approximation for Efficient Transformers, [Paper] ICML Workshop, 2023
FLuRKA: Fast fused Low-Rank & Kernel Attention, [Paper] arXiv, 2023
Scatterbrain: Unifying Sparse and Low-rank Attention, [Paper] [Code] NeurlPS, 2021
Rethinking Attention with Performers, [Paper] [Code] ICLR, 2021
Random Feature Attention, [Paper] ICLR, 2021
Linformer: Self-Attention with Linear Complexity, [Paper] [Code] arXiv, 2020
Lightweight and Efficient End-to-End Speech Recognition Using Low-Rank Transformer, [Paper] ICASSP, 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, [Paper] [Code] ICML, 2020
Fixed Pattern Strategies
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention, [Paper] ICML Workshop, 2023
Poolingformer: Long Document Modeling with Pooling Attention, [Paper] ICML, 2021
Big Bird: Transformers for Longer Sequences, [Paper] [Code] NeurIPS, 2020
Longformer: The Long-Document Transformer, [Paper] [Code] arXiv, 2020
Blockwise Self-Attention for Long Document Understanding, [Paper] [Code] EMNLP, 2020
Generating Long Sequences with Sparse Transformers, [Paper] arXiv, 2019
Learnable Pattern Strategies
HyperAttention: Long-context Attention in Near-Linear Time, [Paper] [Code] arXiv, 2023
ClusterFormer: Neural Clustering Attention for Efficient and Effective Transformer, [Paper] ACL, 2022
Reformer: The Efficient Transformer, [Paper] [Code] ICLR, 2022
Sparse Sinkhorn Attention, [Paper] ICML, 2020
Fast Transformers with Clustered Attention, [Paper] [Code] NeurIPS, 2020
Efficient Content-Based Sparse Attention with Routing Transformers, [Paper] [Code] TACL, 2020
Mixture of Experts
MoE-based LLMs
PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing, [Paper] arXiv, 2023
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, [Paper] [Code] JMLR, 2022
Efficient Large Scale Language Modeling with Mixtures of Experts, [Paper] [Code] EMNLP, 2022
BASE Layers: Simplifying Training of Large, Sparse Models, [Paper] [Code] ICML, 2021
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, [Paper] ICLR, 2021
Algorithm-Level MoE Optimization
Lifelong Language Pretraining with Distribution-Specialized Experts, [Paper] ICML, 2023
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models, [Paper] arXiv, 2023
Mixture-of-Experts with Expert Choice Routing, [Paper] NeurIPS, 2022
StableMoE: Stable Routing Strategy for Mixture of Experts, [Paper] [Code] ACL, 2022
On the Representation Collapse of Sparse Mixture of Experts, [Paper] NeurIPS, 2022
Long Context LLMs: Extrapolation and Interpolation
Scaling Laws of RoPE-based Extrapolation, [Paper] arXiv, 2023
A Length-Extrapolatable Transformer, [Paper] [Code] ACL, 2023
Extending Context Window of Large Language Models via Positional Interpolation, [Paper] arXiv, 2023
NTK Interpolation, [Reddit post] Blog, 2023
YaRN: Efficient Context Window Extension of Large Language Models, [Paper] [Code] arXiv, 2023
CLEX: Continuous Length Extrapolation for Large Language Models, [Paper][Code] arXiv, 2023
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training, [Paper][Code] arXiv, 2023
Functional Interpolation for Relative Positions Improves Long Context Transformers, [Paper] arXiv, 2023
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, [Paper] [Code] ICLR, 2022
Exploring Length Generalization in Large Language Models, [Paper] NeurIPS, 2022
The EOS Decision and Length Extrapolation, [Paper] [Code] EMNLP, 2020
Long Context LLMs: Recurrent Structure
Retentive Network: A Successor to Transformer for Large Language Models, [Paper] [Code] arXiv, 2023
∞-former: Infinite Memory Transformer, [Paper] [Code] ACL, 2022
Memformer: A Memory-Augmented Transformer for Sequence Modeling, [Paper] [Code] AACL-Findings, 2020
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, [Paper] [Code] ACL, 2019
Long Context LLMs: Segmentation and Sliding Window
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning, [Paper] arXiv, 2024
Extending Context Window of Large Language Models via Semantic Compression, [Paper] arXiv, 2023
Efficient Streaming Language Models with Attention Sinks, [Paper] [Code] arXiv, 2023
Parallel Context Windows for Large Language Models, [Paper] [Code] ACL, 2023
LongNet: Scaling Transformers to 1,000,000,000 Tokens, [Paper] [Code] arXiv, 2023
Efficient Long-Text Understanding with Short-Text Models, [Paper] [Code] TACL, 2023
Memory-Retrieval Augmentation
Landmark Attention: Random-Access Infinite Context Length for Transformers, [Paper] [Code] arXiv, 2023
Augmenting Language Models with Long-Term Memory, [Paper] NeurIPS, 2023
Unlimiformer: Long-Range Transformers with Unlimited Length Input, [Paper] [Code] NeurIPS, 2023
Focused Transformer: Contrastive Training for Context Scaling, [Paper] [Code] NeurIPS, 2023
Retrieval meets Long Context Large Language Models, [Paper] arXiv, 2023
Transformer Alternative Architecture
State Space Models
Sparse Modular Activation for Efficient Sequence Modeling, [Paper] [Code] NeurIPS, 2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces, [Paper] [Code] arXiv, 2023
Hungry Hungry Hippos: Towards Language Modeling with State Space Models, [Paper] [Code] ICLR 2023
Long Range Language Modeling via Gated State Spaces, [Paper] ICLR, 2023
Block-State Transformers, [Paper] NeurIPS, 2023
Efficiently Modeling Long Sequences with Structured State Spaces, [Paper] [Code] ICLR, 2022
Diagonal State Spaces are as Effective as Structured State Spaces, [Paper] [Code] NeurIPS, 2022
Other Sequential Models
PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation, [Paper] arXiv, 2023
RWKV: Reinventing RNNs for the Transformer Era, [Paper] EMNLP-Findings, 2023
Hyena Hierarchy: Towards Larger Convolutional Language Models, [Paper] arXiv, 2023
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, [Paper] arXiv, 2023
Data-Centric Methods
Data Selection for Efficient Pre-Training
Data Selection for Language Models via Importance Resampling, [Paper] [Code] NeurIPS, 2023
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework, [Paper] [Code] ICML, 2022
Span Selection Pre-training for Question Answering, [Paper] [Code] ACL, 2020
Data Selection for Efficient Fine-Tuning
What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning, [Paper] [Code] arXiv, 2023
One Shot Learning as Instruction Data Prospector for Large Language Models, [Paper] arXiv, 2023
MoDS: Model-oriented Data Selection for Instruction Tuning, [Paper] [Code] arXiv, 2023
Instruction Mining: When Data Mining Meets Large Language Model Finetuning, [Paper] arXiv, 2023
Data-Efficient Finetuning Using Cross-Task Nearest Neighbors, [Paper] [Code] ACL, 2023
Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values, [Paper] [Code] ACL SRW, 2023
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning, [Paper] arXiv, 2023
AlpaGasus: Training A Better Alpaca with Fewer Data, [Paper] [Code] arXiv, 2023
LIMA: Less Is More for Alignment, [Paper] arXiv, 2023
Prompt Engineering
Demonstration Selection
Unified Demonstration Retriever for In-Context Learning, [Paper] [Code] ACL, 2023
Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning, [Paper] [Code] NeurIPS, 2023
In-Context Learning with Iterative Demonstration Selection, [Paper] arXiv, 2022
Dr.ICL: Demonstration-Retrieved In-context Learning, [Paper] arXiv, 2022
Learning to Retrieve In-Context Examples for Large Language Models, [Paper] arXiv, 2022
Finding Supporting Examples for In-Context Learning, [Paper] arXiv, 2022
Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering, [Paper] [Code] ACL, 2023
Selective Annotation Makes Language Models Better Few-Shot Learners, [Paper] [Code] ICLR, 2023
What Makes Good In-Context Examples for GPT-3? [Paper] DeeLIO, 2022
Learning To Retrieve Prompts for In-Context Learning, [Paper] [Code] NAACL-HLT, 2022
Active Example Selection for In-Context Learning, [Paper] [Code] EMNLP, 2022
Rethinking the Role of Demonstrations: What makes In-context Learning Work? [Paper] [Code] EMNLP, 2022
Demonstration Ordering
- Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity, [Paper] ACL, 2022
Instruction Generation
Large Language Models as Optimizers, [Paper] arXiv, 2023
Instruction Induction: From Few Examples to Natural Language Task Descriptions, [Paper] [Code] ACL, 2023
Large Language Models Are Human-Level Prompt Engineers, [Paper] [Code] ICLR, 2023
TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design, [Paper] arXiv, 2023
Self-Instruct: Aligning Language Model with Self Generated Instructions, [Paper] [Code] ACL, 2023
Multi-Step Reasoning
Automatic Chain of Thought Prompting in Large Language Models, [Paper] [Code] ICLR, 2023
Measuring and Narrowing the Compositionality Gap in Language Models, [Paper] [Code] EMNLP, 2023
ReAct: Synergizing Reasoning and Acting in Language Models, [Paper] [Code] ICLR, 2023
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, [Paper] ICLR, 2023
Graph of Thoughts: Solving Elaborate Problems with Large Language Models, [Paper] [Code] arXiv, 2023
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, [Paper] [Code] NeurIPS, 2023
Self-Consistency Improves Chain of Thought Reasoning in Language Models, [Paper] ICLR, 2023
Graph of Thoughts: Solving Elaborate Problems with Large Language Models, [Paper] [Code] arXiv, 2023
Contrastive Chain-of-Thought Prompting, [Paper] [Code] arXiv, 2023
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation, [Paper] arXiv, 2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, [Paper] NeurIPS, 2022
Parallel Generation
Prompt Compression
Learning to Compress Prompts with Gist Tokens, [Paper] arXiv, 2023
Adapting Language Models to Compress Contexts, [Paper] [Code] EMNLP, 2023
In-context Autoencoder for Context Compression in a Large Language Model, [Paper] [Code] arXiv, 2023
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression, [Paper] [Code] arXiv, 2023
Discrete Prompt Compression with Reinforcement Learning, [Paper] arXiv, 2023
Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models, [Paper] arXiv, 2023
Prompt Generation
TempLM: Distilling Language Models into Template-Based Generators, [Paper] [Code] arXiv, 2022
PromptGen: Automatically Generate Prompts using Generative Models, [Paper] NAACL Findings, 2022
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts, [Paper] [Code] EMNLP, 2020
System-Level Efficiency Optimization and LLM Frameworks
System-Level Pre-Training Efficiency Optimization
CoLLiE: Collaborative Training of Large Language Models in an Efficient Way, [Paper] [Code] EMNLP, 2023
An Efficient 2D Method for Training Super-Large Deep Learning Models, [Paper] [Code] IPDPS, 2023
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel, [Paper] VLDB, 2023
Bamboo: Making Preemptible Instances Resilient for Affordable Training, [Paper] [Code] NSDI, 2023
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates, [Paper] [Code] SOSP, 2023
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models, [Paper] [Code] EuroSys, 2022
Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization, [Paper] [Code] OSDI, 2022
Tesseract: Parallelize the Tensor Parallelism Efficiently, , [Paper] ICPP, 2022
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning, , [Paper][Code] OSDI, 2022
Maximizing Parallelism in Distributed Training for Huge Neural Networks, [Paper] arXiv, 2021
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM, [Paper] [Code] SC, 2021
ZeRO-Infinity: breaking the GPU memory wall for extreme scale deep learning, [Paper] SC, 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training, [Paper] [Code] USENIX ATC, 2021
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, [Paper] [Code] SC, 2020
System-Level Inference Efficiency Optimization
LLM in a flash: Efficient Large Language Model Inference with Limited Memory, [Paper] arXiv, 2023
SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision, [Paper] EMNLP, 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU, [Paper] [Code] ICML, 2023
Flash-Decoding for Long-Context Inference, [Blog] Blog, 2023
FlashDecoding++: Faster Large Language Model Inference on GPUs, [Paper] arXiv, 2023
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time, [Paper] ICML, 2023
Efficiently Scaling Transformer Inference, [Paper] MLSys, 2023
S3: Increasing GPU Utilization during Generative Inference for Higher Throughput, [Paper] arXiv, 2023
DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale, [Paper] SC, 2022
System-Level Serving Efficiency Optimization
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, [Paper] [Code] arXiv, 2023
S-LoRA: Serving Thousands of Concurrent LoRA Adapters, [Paper] [Code] arXiv, 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention, [Paper] [Code] SOSP, 2023
Orca: A Distributed Serving System for Transformer-Based Generative Models, [Paper] OSDI, 2022
Fast Distributed Inference Serving for Large Language Models, [Paper] arXiv, 2023
Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models, [Paper] arXiv, 2023
SpotServe: Serving Generative Large Language Models on Preemptible Instances, [Paper] arXiv, 2023
TurboTransformers: an efficient GPU serving system for transformer models, [Paper] PPoPP, 2021
System-Level Efficient Architecture Optimization
System-Level Attention Optimization
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, [Paper] [Code] arXiv, 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention, [Paper] [Code] SOSP, 2023
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, [Paper] [Code] NeurIPS, 2022
Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server, [Blog] Nvidia Blog, 2022
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks, [Paper] ISCA, 2021
A3: Accelerating Attention Mechanisms in Neural Networks with Approximation, [Paper] HPCA, 2020
System-Level MoE Optimization
Tutel: Adaptive mixture-of-experts at scale, [Paper] [Code] MLSys, 2023
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts, [Paper] [Code] MLSys, 2023
SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization, [Paper] USENIX ATC, 2023
MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism, [Paper] [Code] IPDPS, 2023
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models, [Paper] arXiv, 2022
TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training, [Paper] [Code] NeurIPS, 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, [Paper] [Code] ICML, 2022
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models, [Paper] [Code] PPoPP, 2022
FastMoE: A Fast Mixture-of-Expert Training System, [Paper] [Code] arXiv, 2021