Daniel Du

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models

Exploring the Architecture, Training Data, and Training Process of LLaMA

The sources provide a detailed explanation of the LLaMA model, covering its architecture, the data it was trained on, and the training process. Here’s a breakdown:

Model Architecture

The LLaMA model is built on the transformer architecture, similar to other large language models. It incorporates several modifications that have been shown to enhance performance and training stability:

  • Pre-normalization: Instead of normalizing the output of each transformer sub-layer, LLaMA normalizes the input, a technique inspired by the GPT-3 model. This helps improve training stability. LLaMA uses the RMSNorm normalizing function.
  • SwiGLU Activation Function: LLaMA replaces the standard ReLU non-linearity with the SwiGLU activation function. This change was inspired by the PaLM model and has been shown to boost performance.
  • Rotary Embeddings: Unlike the original transformer architecture, LLaMA does not use absolute positional embeddings. Instead, it uses rotary positional embeddings (RoPE) added at each layer. This technique comes from the GPTNeo model.

Table 2 in the sources outlines the specific model sizes, architectures, and optimization hyperparameters for the different LLaMA models.

Training Data

A key differentiating factor of LLaMA is its use of publicly available data. This contrasts with many other LLMs that rely on proprietary and inaccessible datasets. The training data for LLaMA is a mixture of text from several sources, ensuring a diverse range of domains and writing styles:

  • English CommonCrawl: This makes up the largest portion of the training data (67%) and was preprocessed using the CCNet pipeline to remove duplicates, non-English pages, and low-quality content.
  • C4: Another preprocessed CommonCrawl dataset (15%) known for its diversity.
  • GitHub: Public code repositories under specific licenses (4.5%) were filtered for quality and deduplicated.
  • Wikipedia: Dumps from mid-2022 covering 20 languages using Latin or Cyrillic scripts (4.5%) were cleaned of formatting and extraneous content.
  • Gutenberg and Books3: Public domain books from Project Gutenberg and the Books3 section of The Pile dataset (4.5%) were deduplicated based on content overlap.
  • ArXiv: Scientific papers in LaTeX format (2.5%) were processed to remove non-content sections and boilerplate.
  • Stack Exchange: Question and answer data from 28 of the largest Stack Exchange websites (2%) was cleaned and sorted by answer score.

The entire training dataset comprises roughly 1.4 trillion tokens after tokenization using byte-pair encoding (BPE). Most tokens are used only once during training, with the exception of Wikipedia and book data, which undergo approximately two epochs.

Training Process

LLaMA’s training approach is similar to other large language models, drawing inspiration from the Chinchilla scaling laws. Some noteworthy aspects of the training process include:

  • Optimizer: The AdamW optimizer is used with specific hyperparameters, including a cosine learning rate schedule and weight decay.
  • Efficient Implementation: Several optimizations were implemented to enhance training speed:
    • An efficient implementation of causal multi-head attention reduces memory and runtime.
    • Checkpointing reduces recomputation of activations during the backward pass.
    • Model and sequence parallelism minimizes memory usage.
    • Computation and communication between GPUs are overlapped for efficiency.

These optimizations resulted in a training speed of approximately 380 tokens per second per GPU on 2048 A100 GPUs, meaning it took roughly 21 days to train the 65B parameter model.

Figure 1 in the sources shows the training loss over training tokens for different LLaMA models. Notably, the performance of the 7B model continued to improve even after 1 trillion tokens, challenging previous assumptions that larger models are always better. This observation emphasizes LLaMA’s focus on achieving optimal performance at various inference budgets by training on more data rather than simply scaling up model size.

comments powered by Disqus