개요

대규모 언어 모델은 자연어 이해, 생성, 복잡한 추론과 같은 작업에서 뛰어난 능력을 보여주었다. 그러나 대규모 언어 모델은 막대한 하드웨어 리소스가 필요하고, 효율성을 위한 기술 개발의 니즈가 발생하였다. 이 기술 동향은 효율적인 대규모 언어 모델을 위해 몇 가지 기술 분류와 최근 동향을 제안한다.

Model Compression

Weight-Only Quantization (PTQ)

  • GPTQ: Accurate Quantization for Generative Pre-trained Transformers,  [Paper] [Code] ICLR, 2023

  • QuIP: 2-Bit Quantization of Large Language Models With Guarantees,  [Paper] [Code] arXiv, 2023

  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,  [Paper] [Code] arXiv, 2023

  • OWQ: Lessons Learned from Activation Outliers for Weight Quantization in Large Language Models,  [Paper] [Code] arXiv, 2023

  • SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression,  [Paper] [Code] arXiv, 2023

  • FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs,  [Paper] NeurIPS-ENLSP, 2023

  • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale,  [Paper] [Code] NeurlPS, 2022

  • Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning,  [Paper] [Code] NeurIPS, 2022

Weight-Activation Co-Quantization (PTQ)

  • Intriguing Properties of Quantization at Scale,  [Paper] NeurIPS, 2023

  • ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation,  [Paper] [Code] arXiv, 2023

  • ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats,  [Paper] [Code] NeurIPS-ENLSP, 2023

  • OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization,  [Paper] [Code] ISCA, 2023

  • RPTQ: Reorder-based Post-training Quantization for Large Language Models,  [Paper] [Code] arXiv, 2023

  • Outlier Suppression+: Accurate Quantization of Large Language Models by Equivalent and Optimal Shifting and Scaling,  [Paper] [Code] arXiv, 2023

  • QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models,  [Paper] arXiv, 2023

  • SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models,  [Paper] [Code] ICML, 2023

  • ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers,  [Paper] NeurIPS, 2022

Quantization-Aware Training (QAT)

  • BitNet: Scaling 1-bit Transformers for Large Language Models,  [Paper] arXiv, 2023

  • LLM-QAT: Data-Free Quantization Aware Training for Large Language Models,  [Paper] [Code] arXiv, 2023

  • Compression of Generative Pre-trained Language Models via Quantization,  [Paper] ACL, 2022

Pruning: Structured Pruning

  • LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery,  [Paper] arXiv, 2023

  • LLM-Pruner: On the Structural Pruning of Large Language Models,  [Paper] [Code] NeurIPS, 2023

  • Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning,  [Paper] [Code] NeurIPS-ENLSP, 2023

  • LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning,  [Paper] arXiv, 2023

Pruning: Unstructured Pruning

  • SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot,  [Paper] [Code] ICML, 2023

  • A Simple and Effective Pruning Approach for Large Language Models,  [Paper] [Code] arXiv, 2023

  • One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models,  [Paper] arXiv, 2023

Pruning: Low-Rank Approximation

  • TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition,  [Paper] arXiv, 2023

  • LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation,  [Paper] [Code] ICML, 2023

White-Box KD

  • Towards the Law of Capacity Gap in Distilling Language Models,  [Paper] [Code] arXiv, 2023

  • Baby Llama: Knowledge Distillation from an Ensemble of Teachers Trained on a Small Dataset with no Performance Penalty,  [Paper] arXiv, 2023

  • Knowledge Distillation of Large Language Models,  [Paper] [Code] arXiv, 2023

  • GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models,  [Paper] arXiv, 2023

  • Propagating Knowledge Updates to LMs Through Distillation,  [Paper] [Code] arXiv, 2023

  • Less is More: Task-aware Layer-wise Distillation for Language Model Compression,  [Paper] ICML, 2023

  • Token-Scaled Logit Distillation for Ternary Weight Generative Language Models,  [Paper] arXiv, 2023

Black-Box KD

  • Zephyr: Direct Distillation of LM Alignment,  [Paper] arXiv, 2023

  • Instruction Tuning with GPT-4,  [Paper] [Code] arXiv, 2023

  • Lion: Adversarial Distillation of Closed-Source Large Language Model,  [Paper] [Code] arXiv, 2023

  • Specializing Smaller Language Models towards Multi-Step Reasoning,  [Paper] [Code] ICML, 2023

  • Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes,  [Paper] ACL, 2023

  • Large Language Models Are Reasoning Teachers,  [Paper] [Code] ACL, 2023

  • SCOTT: Self-Consistent Chain-of-Thought Distillation,  [Paper] [Code] ACL, 2023

  • Symbolic Chain-of-Thought Distillation: Small Models Can Also “Think” Step-by-Step,  [Paper] ACL, 2023

  • Distilling Reasoning Capabilities into Smaller Language Models,  [Paper] [Code] ACL, 2023

  • In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models,  [Paper] arXiv, 2022

  • Explanations from Large Language Models Make Small Reasoners Better,  [Paper] arXiv, 2022

  • DISCO: Distilling Counterfactuals with Large Language Models,  [Paper] [Code] arXiv, 2022

Efficient Pre-Training

Mixed Precision Acceleration

  • GACT: Activation Compressed Training for Generic Network Architectures,  [Paper] [Code] ICML, 2022

  • Mesa: A Memory-saving Training Framework for Transformers,  [Paper] [Code] arXiv, 2021

  • Bfloat16 Processing for Neural Networks,  [Paper] ARITH, 2019

  • A Study of BFLOAT16 for Deep Learning Training,  [Paper] arXiv, 2019

  • Mixed Precision Training,  [Paper] ICLR, 2018

Scaling Models

  • Learning to Grow Pretrained Models for Efficient Transformer Training,  [Paper] [Code] ICLR, 2023

  • 2x Faster Language Model Pre-training via Masked Structural Growth,  [Paper] arXiv, 2023

  • Reusing Pretrained Models by Multi-linear Operators for Efficient Training,  [Paper] NeurIPS, 2023

  • FLM-101B: An Open LLM and How to Train It with $100 K Budget,  [Paper] [Code] arXiv, 2023

  • Knowledge Inheritance for Pre-trained Language Models,  [Paper] [Code] NAACL, 2022

  • Staged Training for Transformer Language Models,  [Paper] [Code] ICML, 2022

Initialization Techniques

  • Deepnet: Scaling transformers to 1,000 layers,  [Paper] [Code] arXiv, 2022

  • ZerO Initialization: Initializing Neural Networks with only Zeros and Ones,  [Paper] [Code] TMLR, 2022

  • Rezero is All You Need: Fast Convergence at Large Depth,  [Paper] [Code] UAI, 2021

  • Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks,  [Paper] NeurIPS, 2020

  • Improving Transformer Optimization Through Better Initialization,  [Paper] [Code] ICML, 2020

  • Fixup Initialization: Residual Learning without Normalization,  [Paper] ICLR, 2019

  • On Weight Initialization in Deep Neural Networks,  [Paper] arXiv, 2017

Optimization Strategies

  • Symbolic Discovery of Optimization Algorithms,  [Paper] arXiv, 2023

  • Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training,  [Paper] [Code] arXiv, 2023

Efficient Fine-Tuning

PEFT: Adapter-based Tuning

  • OpenDelta: A Plug-and-play Library for Parameter-efficient Adaptation of Pre-trained Models,  [Paper] [Code] ACL Demo, 2023

  • LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models,  [Paper] [Code] EMNLP, 2023

  • Compacter: Efficient Low-Rank Hypercomplex Adapter Layers,  [Paper] [Code] NeurIPS, 2023

  • Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning,  [Paper] [Code] NeurIPS, 2022

  • Meta-Adapters: Parameter Efficient Few-shot Fine-tuning through Meta-Learning,  [Paper] AutoML, 2022

  • AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning,  [Paper] [Code] EMNLP, 2022

  • SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency of Adapters,  [Paper] [Code] EMNLP, 2022

PEFT: Low-Rank Adaptation

  • LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning,  [Paper] arXiv, 2023

  • LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition,  [Paper] [Code] arXiv, 2023

  • LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models,  [Paper] [Code] arXiv, 2023

  • Multi-Head Adapter Routing for Cross-Task Generalization,  [Paper] [Code] NeurIPS, 2023

  • Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning,  [Paper] ICLR, 2023

  • DyLoRA: Parameter-Efficient Tuning of Pretrained Models using Dynamic Search-Free Low Rank Adaptation,  [Paper] [Code] EACL, 2023

  • Tied-Lora: Enhacing Parameter Efficiency of LoRA with Weight Tying,  [Paper] arXiv, 2023

  • LoRA: Low-Rank Adaptation of Large Language Models,  [Paper] [Code] ICLR, 2022

Prefix Tuning

  • LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention,  [Paper] [Code] arXiv, 2023

  • Prefix-Tuning: Optimizing Continuous Prompts for Generation  [Paper] [Code] ACL, 2021

Prompt Tuning

  • Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt,  [Paper] arXiv, 2023

  • GPT Understands, Too,  [Paper] [Code] AI Open, 2023

  • Multi-Task Pre-Training of Modular Prompt for Few-Shot Learning  [Paper] [Code] ACL, 2023

  • Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning,  [Paper] ICLR, 2023

  • PPT: Pre-trained Prompt Tuning for Few-shot Learning,  [Paper] [Code] ACL, 2022

  • Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers,  [Paper] [Code] EMNLP-Findings, 2022

  • P-Tuning v2: Prompt Tuning Can Be Comparable to Finetuning Universally Across Scales and Tasks, [Paper] [Code] ACL-Short, 2022

  • The Power of Scale for Parameter-Efficient Prompt Tuning,  [Paper] EMNLP, 2021

Memory-Efficient Fine-Tuning

  • Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model,  [Paper] [Code] NeurIPS, 2023

  • Memory-Efficient Selective Fine-Tuning,  [Paper] ICML Workshop, 2023

  • Full Parameter Fine-tuning for Large Language Models with Limited Resources,  [Paper] [Code] arXiv, 2023

  • Fine-Tuning Language Models with Just Forward Passes,  [Paper] [Code] NeurIPS, 2023

  • Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization,  [Paper] NeurIPS, 2023

  • LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models,  [Paper] [Code] arXiv, 2023

  • QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models,  [Paper] [Code] arXiv, 2023

  • QLoRA: Efficient Finetuning of Quantized LLMs,  [Paper] [Code1] [Code2] NeurIPS, 2023

Efficient Inference

Speculative Decoding

  • PaSS: Parallel Speculative Sampling,  [Paper] NeurIPS Workshop, 2023

  • Accelerating Transformer Inference for Translation via Parallel Decoding,  [Paper] [Code] ACL, 2023

  • Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads,  [Blog] [Code] Blog, 2023

  • Fast Inference from Transformers via Speculative Decoding,  [Paper] ICML, 2023

  • Accelerating LLM Inference with Staged Speculative Decoding,  [Paper] ICML Workshop, 2023

  • Accelerating Large Language Model Decoding with Speculative Sampling,  [Paper] arXiv, 2023

  • Speculative Decoding with Big Little Decoder,  [Paper] [Code] NeurIPS, 2023

  • SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification,  [Paper] [Code] arXiv, 2023

  • Inference with Reference: Lossless Acceleration of Large Language Models,  [Paper] [Code] arXiv, 2023

KV-Cache Optimization

  • Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs,  [Paper] arXiv, 2023

  • SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference,  [Paper] arXiv, 2023

  • H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models,  [Paper] NeurIPS, 2023

  • Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time,  [Paper] NeurIPS, 2023

  • Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers,  [Paper] arXiv, 2023

Efficient Architecture

Sharing-based Attention

  • GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints,  [Paper] EMNLP, 2023

  • Fast Transformer Decoding: One Write-Head is All You Need,  [Paper] arXiv, 2019

Feature Information Reduction

  • Nyströmformer: A nyström-based algorithm for approximating self-attention,  [Paper] [Code] AAAI, 2021

  • Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing,  [Paper] [Code] NeurIPS, 2020

  • Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks,  [Paper] ICML, 2019

Kernelization or Low-Rank

  • Sumformer: Universal Approximation for Efficient Transformers,  [Paper] ICML Workshop, 2023

  • FLuRKA: Fast fused Low-Rank & Kernel Attention,  [Paper] arXiv, 2023

  • Scatterbrain: Unifying Sparse and Low-rank Attention,  [Paper] [Code] NeurlPS, 2021

  • Rethinking Attention with Performers,  [Paper] [Code] ICLR, 2021

  • Random Feature Attention,  [Paper] ICLR, 2021

  • Linformer: Self-Attention with Linear Complexity,  [Paper] [Code] arXiv, 2020

  • Lightweight and Efficient End-to-End Speech Recognition Using Low-Rank Transformer,  [Paper] ICASSP, 2020

  • Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention,  [Paper] [Code] ICML, 2020

Fixed Pattern Strategies

  • Faster Causal Attention Over Large Sequences Through Sparse Flash Attention,  [Paper] ICML Workshop, 2023

  • Poolingformer: Long Document Modeling with Pooling Attention,  [Paper] ICML, 2021

  • Big Bird: Transformers for Longer Sequences,  [Paper] [Code] NeurIPS, 2020

  • Longformer: The Long-Document Transformer,  [Paper] [Code] arXiv, 2020

  • Blockwise Self-Attention for Long Document Understanding,  [Paper] [Code] EMNLP, 2020

  • Generating Long Sequences with Sparse Transformers,  [Paper] arXiv, 2019

Learnable Pattern Strategies

  • HyperAttention: Long-context Attention in Near-Linear Time,  [Paper] [Code] arXiv, 2023

  • ClusterFormer: Neural Clustering Attention for Efficient and Effective Transformer,  [Paper] ACL, 2022

  • Reformer: The Efficient Transformer,  [Paper] [Code] ICLR, 2022

  • Sparse Sinkhorn Attention,  [Paper] ICML, 2020

  • Fast Transformers with Clustered Attention,  [Paper] [Code] NeurIPS, 2020

  • Efficient Content-Based Sparse Attention with Routing Transformers,  [Paper] [Code] TACL, 2020

Mixture of Experts

MoE-based LLMs

  • Mistral 7B,  [Paper] [Code] arXiv, 2023

  • PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing,  [Paper] arXiv, 2023

  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,  [Paper] [Code] JMLR, 2022

  • Efficient Large Scale Language Modeling with Mixtures of Experts,  [Paper] [Code] EMNLP, 2022

  • BASE Layers: Simplifying Training of Large, Sparse Models,  [Paper] [Code] ICML, 2021

  • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding,  [Paper] ICLR, 2021

Algorithm-Level MoE Optimization

  • Lifelong Language Pretraining with Distribution-Specialized Experts,  [Paper] ICML, 2023

  • Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models,  [Paper] arXiv, 2023

  • Mixture-of-Experts with Expert Choice Routing,  [Paper] NeurIPS, 2022

  • StableMoE: Stable Routing Strategy for Mixture of Experts,  [Paper] [Code] ACL, 2022

  • On the Representation Collapse of Sparse Mixture of Experts,  [Paper] NeurIPS, 2022

Long Context LLMs: Extrapolation and Interpolation

  • Scaling Laws of RoPE-based Extrapolation,  [Paper] arXiv, 2023

  • A Length-Extrapolatable Transformer,  [Paper] [Code] ACL, 2023

  • Extending Context Window of Large Language Models via Positional Interpolation,  [Paper] arXiv, 2023

  • NTK Interpolation,  [Reddit post] Blog, 2023

  • YaRN: Efficient Context Window Extension of Large Language Models,  [Paper] [Code] arXiv, 2023

  • CLEX: Continuous Length Extrapolation for Large Language Models,  [Paper][Code] arXiv, 2023

  • PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training,  [Paper][Code] arXiv, 2023

  • Functional Interpolation for Relative Positions Improves Long Context Transformers,  [Paper] arXiv, 2023

  • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation,  [Paper] [Code] ICLR, 2022

  • Exploring Length Generalization in Large Language Models,  [Paper] NeurIPS, 2022

  • The EOS Decision and Length Extrapolation,  [Paper] [Code] EMNLP, 2020

Long Context LLMs: Recurrent Structure

  • Retentive Network: A Successor to Transformer for Large Language Models,  [Paper] [Code] arXiv, 2023

  • Recurrent Memory Transformer,  [Paper] [Code] NeurIPS, 2022

  • Block-Recurrent Transformers,  [Paper] [Code] NeurIPS, 2022

  • ∞-former: Infinite Memory Transformer,  [Paper] [Code] ACL, 2022

  • Memformer: A Memory-Augmented Transformer for Sequence Modeling,  [Paper] [Code] AACL-Findings, 2020

  • Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context,  [Paper] [Code] ACL, 2019

Long Context LLMs: Segmentation and Sliding Window

  • LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning, [Paper] arXiv, 2024

  • Extending Context Window of Large Language Models via Semantic Compression,  [Paper] arXiv, 2023

  • Efficient Streaming Language Models with Attention Sinks,  [Paper] [Code] arXiv, 2023

  • Parallel Context Windows for Large Language Models,  [Paper] [Code] ACL, 2023

  • LongNet: Scaling Transformers to 1,000,000,000 Tokens,  [Paper] [Code] arXiv, 2023

  • Efficient Long-Text Understanding with Short-Text Models,  [Paper] [Code] TACL, 2023

Memory-Retrieval Augmentation

  • Landmark Attention: Random-Access Infinite Context Length for Transformers,  [Paper] [Code] arXiv, 2023

  • Augmenting Language Models with Long-Term Memory,  [Paper] NeurIPS, 2023

  • Unlimiformer: Long-Range Transformers with Unlimited Length Input,  [Paper] [Code] NeurIPS, 2023

  • Focused Transformer: Contrastive Training for Context Scaling,  [Paper] [Code] NeurIPS, 2023

  • Retrieval meets Long Context Large Language Models,  [Paper] arXiv, 2023

  • Memorizing Transformers,  [Paper] [Code] ICLR, 2022

Transformer Alternative Architecture

State Space Models

  • Sparse Modular Activation for Efficient Sequence Modeling,  [Paper] [Code] NeurIPS, 2023

  • Mamba: Linear-Time Sequence Modeling with Selective State Spaces,  [Paper] [Code] arXiv, 2023

  • Hungry Hungry Hippos: Towards Language Modeling with State Space Models,  [Paper] [Code] ICLR 2023

  • Long Range Language Modeling via Gated State Spaces,  [Paper] ICLR, 2023

  • Block-State Transformers,  [Paper] NeurIPS, 2023

  • Efficiently Modeling Long Sequences with Structured State Spaces,  [Paper] [Code] ICLR, 2022

  • Diagonal State Spaces are as Effective as Structured State Spaces,  [Paper] [Code] NeurIPS, 2022

Other Sequential Models

  • PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation,  [Paper] arXiv, 2023

  • RWKV: Reinventing RNNs for the Transformer Era,  [Paper] EMNLP-Findings, 2023

  • Hyena Hierarchy: Towards Larger Convolutional Language Models,  [Paper] arXiv, 2023

  • MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers,  [Paper] arXiv, 2023

Data-Centric Methods

Data Selection for Efficient Pre-Training

  • Data Selection for Language Models via Importance Resampling,  [Paper] [Code] NeurIPS, 2023

  • NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework,  [Paper] [Code] ICML, 2022

  • Span Selection Pre-training for Question Answering,  [Paper] [Code] ACL, 2020

Data Selection for Efficient Fine-Tuning

  • What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning,  [Paper] [Code] arXiv, 2023

  • One Shot Learning as Instruction Data Prospector for Large Language Models,  [Paper] arXiv, 2023

  • MoDS: Model-oriented Data Selection for Instruction Tuning,  [Paper] [Code] arXiv, 2023

  • Instruction Mining: When Data Mining Meets Large Language Model Finetuning,  [Paper] arXiv, 2023

  • Data-Efficient Finetuning Using Cross-Task Nearest Neighbors,  [Paper] [Code] ACL, 2023

  • Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values,  [Paper] [Code] ACL SRW, 2023

  • Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning,  [Paper] arXiv, 2023

  • AlpaGasus: Training A Better Alpaca with Fewer Data,  [Paper] [Code] arXiv, 2023

  • LIMA: Less Is More for Alignment,  [Paper] arXiv, 2023

Prompt Engineering

Demonstration Selection

  • Unified Demonstration Retriever for In-Context Learning,  [Paper] [Code] ACL, 2023

  • Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning,  [Paper] [Code] NeurIPS, 2023

  • In-Context Learning with Iterative Demonstration Selection,  [Paper] arXiv, 2022

  • Dr.ICL: Demonstration-Retrieved In-context Learning,  [Paper] arXiv, 2022

  • Learning to Retrieve In-Context Examples for Large Language Models,  [Paper] arXiv, 2022

  • Finding Supporting Examples for In-Context Learning,  [Paper] arXiv, 2022

  • Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering,  [Paper] [Code] ACL, 2023

  • Selective Annotation Makes Language Models Better Few-Shot Learners,  [Paper] [Code] ICLR, 2023

  • What Makes Good In-Context Examples for GPT-3?  [Paper] DeeLIO, 2022

  • Learning To Retrieve Prompts for In-Context Learning,  [Paper] [Code] NAACL-HLT, 2022

  • Active Example Selection for In-Context Learning,  [Paper] [Code] EMNLP, 2022

  • Rethinking the Role of Demonstrations: What makes In-context Learning Work?  [Paper] [Code] EMNLP, 2022

Demonstration Ordering

  • Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,  [Paper] ACL, 2022

Instruction Generation

  • Large Language Models as Optimizers,  [Paper] arXiv, 2023

  • Instruction Induction: From Few Examples to Natural Language Task Descriptions,  [Paper] [Code] ACL, 2023

  • Large Language Models Are Human-Level Prompt Engineers,  [Paper] [Code] ICLR, 2023

  • TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design,  [Paper] arXiv, 2023

  • Self-Instruct: Aligning Language Model with Self Generated Instructions,  [Paper] [Code] ACL, 2023

Multi-Step Reasoning

  • Automatic Chain of Thought Prompting in Large Language Models,  [Paper] [Code] ICLR, 2023

  • Measuring and Narrowing the Compositionality Gap in Language Models,  [Paper] [Code] EMNLP, 2023

  • ReAct: Synergizing Reasoning and Acting in Language Models,  [Paper] [Code] ICLR, 2023

  • Least-to-Most Prompting Enables Complex Reasoning in Large Language Models,  [Paper] ICLR, 2023

  • Graph of Thoughts: Solving Elaborate Problems with Large Language Models,  [Paper] [Code] arXiv, 2023

  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models,  [Paper] [Code] NeurIPS, 2023

  • Self-Consistency Improves Chain of Thought Reasoning in Language Models,  [Paper] ICLR, 2023

  • Graph of Thoughts: Solving Elaborate Problems with Large Language Models,  [Paper] [Code] arXiv, 2023

  • Contrastive Chain-of-Thought Prompting,  [Paper] [Code] arXiv, 2023

  • Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation,  [Paper] arXiv, 2023

  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,  [Paper] NeurIPS, 2022

Parallel Generation

  • Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding,  [Paper] [Code] arXiv, 2023

Prompt Compression

  • Learning to Compress Prompts with Gist Tokens,  [Paper] arXiv, 2023

  • Adapting Language Models to Compress Contexts,  [Paper] [Code] EMNLP, 2023

  • In-context Autoencoder for Context Compression in a Large Language Model,  [Paper] [Code] arXiv, 2023

  • LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression,  [Paper] [Code] arXiv, 2023

  • Discrete Prompt Compression with Reinforcement Learning,  [Paper] arXiv, 2023

  • Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models,  [Paper] arXiv, 2023

Prompt Generation

  • TempLM: Distilling Language Models into Template-Based Generators,  [Paper] [Code] arXiv, 2022

  • PromptGen: Automatically Generate Prompts using Generative Models,  [Paper] NAACL Findings, 2022

  • AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts,  [Paper] [Code] EMNLP, 2020

System-Level Efficiency Optimization and LLM Frameworks

System-Level Pre-Training Efficiency Optimization

  • CoLLiE: Collaborative Training of Large Language Models in an Efficient Way,  [Paper] [Code] EMNLP, 2023

  • An Efficient 2D Method for Training Super-Large Deep Learning Models,  [Paper] [Code] IPDPS, 2023

  • PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel,  [Paper] VLDB, 2023

  • Bamboo: Making Preemptible Instances Resilient for Affordable Training,  [Paper] [Code] NSDI, 2023

  • Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates,  [Paper] [Code] SOSP, 2023

  • Varuna: Scalable, Low-cost Training of Massive Deep Learning Models,  [Paper] [Code] EuroSys, 2022

  • Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization,  [Paper] [Code] OSDI, 2022

  • Tesseract: Parallelize the Tensor Parallelism Efficiently, , [Paper] ICPP, 2022

  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning, , [Paper][Code] OSDI, 2022

  • Maximizing Parallelism in Distributed Training for Huge Neural Networks,  [Paper] arXiv, 2021

  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM,  [Paper] [Code] SC, 2021

  • ZeRO-Infinity: breaking the GPU memory wall for extreme scale deep learning,  [Paper] SC, 2021

  • ZeRO-Offload: Democratizing Billion-Scale Model Training,  [Paper] [Code] USENIX ATC, 2021

  • ZeRO: Memory Optimizations Toward Training Trillion Parameter Models,  [Paper] [Code] SC, 2020

System-Level Inference Efficiency Optimization

  • LLM in a flash: Efficient Large Language Model Inference with Limited Memory,  [Paper] arXiv, 2023

  • SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision,  [Paper] EMNLP, 2023

  • FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU,  [Paper] [Code] ICML, 2023

  • Flash-Decoding for Long-Context Inference,  [Blog] Blog, 2023

  • FlashDecoding++: Faster Large Language Model Inference on GPUs,  [Paper] arXiv, 2023

  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time,  [Paper] ICML, 2023

  • Efficiently Scaling Transformer Inference,  [Paper] MLSys, 2023

  • S3: Increasing GPU Utilization during Generative Inference for Higher Throughput,  [Paper] arXiv, 2023

  • DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale,  [Paper] SC, 2022

System-Level Serving Efficiency Optimization

  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU,  [Paper] [Code] arXiv, 2023

  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters,  [Paper] [Code] arXiv, 2023

  • Efficient Memory Management for Large Language Model Serving with PagedAttention,  [Paper] [Code] SOSP, 2023

  • Orca: A Distributed Serving System for Transformer-Based Generative Models,  [Paper] OSDI, 2022

  • Fast Distributed Inference Serving for Large Language Models,  [Paper] arXiv, 2023

  • Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models,  [Paper] arXiv, 2023

  • SpotServe: Serving Generative Large Language Models on Preemptible Instances,  [Paper] arXiv, 2023

  • TurboTransformers: an efficient GPU serving system for transformer models,  [Paper] PPoPP, 2021

System-Level Efficient Architecture Optimization

System-Level Attention Optimization

  • FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,  [Paper] [Code] arXiv, 2023

  • Efficient Memory Management for Large Language Model Serving with PagedAttention,  [Paper] [Code] SOSP, 2023

  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,  [Paper] [Code] NeurIPS, 2022

  • Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server,  [Blog] Nvidia Blog, 2022

  • ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks,  [Paper] ISCA, 2021

  • A3: Accelerating Attention Mechanisms in Neural Networks with Approximation,  [Paper] HPCA, 2020

System-Level MoE Optimization

  • Tutel: Adaptive mixture-of-experts at scale,  [Paper] [Code] MLSys, 2023

  • MegaBlocks: Efficient Sparse Training with Mixture-of-Experts,  [Paper] [Code] MLSys, 2023

  • SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization,  [Paper] USENIX ATC, 2023

  • MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism,  [Paper] [Code] IPDPS, 2023

  • EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models,  [Paper] arXiv, 2022

  • TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training,  [Paper] [Code] NeurIPS, 2022

  • DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale,  [Paper] [Code] ICML, 2022

  • FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models,  [Paper] [Code] PPoPP, 2022

  • FastMoE: A Fast Mixture-of-Expert Training System,  [Paper] [Code] arXiv, 2021