ā±ļø 85 min

BERT, GPT & Fine-tuning Pre-trained Models

Leverage pre-trained language models for your tasks

Pre-trained Language Models

Modern NLP uses transfer learning: pre-train on massive text, fine-tune on specific tasks. **Why Pre-training Works:** 1. **Language Understanding**: Models learn grammar, facts, reasoning from billions of words 2. **Transfer Learning**: Knowledge transfers to downstream tasks 3. **Data Efficiency**: Need less labeled data for fine-tuning 4. **State-of-the-Art**: Dominates NLP benchmarks **Two Main Paradigms:** **BERT (Bidirectional Encoder Representations from Transformers)** - Encoder-only architecture - Bidirectional context (sees left and right) - Masked language modeling pre-training - Best for: Classification, NER, QA **GPT (Generative Pre-trained Transformer)** - Decoder-only architecture - Autoregressive (left-to-right) - Next token prediction pre-training - Best for: Text generation, completion

Using BERT with Hugging Face

Fine-tune BERT for sentiment analysis:

python
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset

# Load pre-trained BERT
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

print(f"Loaded model: {model_name}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Example text
texts = [
    "This movie was absolutely amazing! I loved every minute.",
    "Terrible film. Complete waste of time.",
    "It was okay, nothing special.",
    "One of the best movies I've ever seen!",
]
labels = [1, 0, 0, 1]  # 1 = positive, 0 = negative

# Tokenize
print(f"\nTokenizing {len(texts)} examples...")
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

print(f"Input IDs shape: {encoded['input_ids'].shape}")
print(f"Attention mask shape: {encoded['attention_mask'].shape}")

# Example: Tokenization breakdown
sample_text = texts[0]
tokens = tokenizer.tokenize(sample_text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f"\nTokenization example:")
print(f"Original: {sample_text}")
print(f"Tokens: {tokens[:10]}...")
print(f"Token IDs: {token_ids[:10]}...")

# Create dataset
class SentimentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

dataset = SentimentDataset(encoded, labels)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=2,
    warmup_steps=10,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

print(f"\nāœ“ Model ready for fine-tuning!")

# Inference example
model.eval()
with torch.no_grad():
    outputs = model(**encoded)
    predictions = torch.argmax(outputs.logits, dim=-1)
    probabilities = torch.softmax(outputs.logits, dim=-1)

print(f"\nPredictions (before fine-tuning):")
for i, text in enumerate(texts):
    pred_label = "Positive" if predictions[i] == 1 else "Negative"
    confidence = probabilities[i][predictions[i]].item()
    print(f"  {text[:50]}...")
    print(f"  → {pred_label} (confidence: {confidence:.2%})")
Output:
Loaded model: bert-base-uncased
Model parameters: 109,483,778

Tokenizing 4 examples...
Input IDs shape: torch.Size([4, 14])
Attention mask shape: torch.Size([4, 14])

Tokenization example:
Original: This movie was absolutely amazing! I loved every minute.
Tokens: ['this', 'movie', 'was', 'absolutely', 'amazing', '!', 'i', 'loved', 'every', 'minute']...
Token IDs: [2023, 3185, 2001, 7078, 6429, 999, 1045, 3866, 2296, 3371]...

āœ“ Model ready for fine-tuning!

Predictions (before fine-tuning):
  This movie was absolutely amazing! I loved every...
  → Positive (confidence: 53.42%)
  Terrible film. Complete waste of time....
  → Negative (confidence: 51.23%)
  It was okay, nothing special....
  → Negative (confidence: 52.78%)
  One of the best movies I've ever seen!...
  → Positive (confidence: 54.91%)

Using GPT for Text Generation

Generate text with GPT-2:

python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Load pre-trained GPT-2
model_name = 'gpt2'  # or 'gpt2-medium', 'gpt2-large', 'gpt2-xl'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

print(f"Loaded model: {model_name}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Set to evaluation mode
model.eval()

# Text generation function
def generate_text(prompt, max_length=50, temperature=0.8, top_k=50, top_p=0.95):
    """
    Generate text from a prompt
    
    Args:
        prompt: Starting text
        max_length: Maximum length of generated text
        temperature: Higher = more random (0.1-2.0)
        top_k: Consider top k tokens
        top_p: Nucleus sampling threshold
    """
    # Encode prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    
    # Generate
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=max_length,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True,
        )
    
    # Decode
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

# Examples with different prompts
prompts = [
    "Artificial intelligence will",
    "The future of machine learning is",
    "Deep learning models are",
]

print(f"\nGenerating text with GPT-2:\n")

for prompt in prompts:
    print(f"Prompt: '{prompt}'")
    
    # Generate with different temperatures
    for temp in [0.5, 1.0]:
        generated = generate_text(prompt, max_length=40, temperature=temp)
        print(f"  [temp={temp}] {generated}")
    print()

# Example: Get probability distribution for next token
prompt = "Machine learning is"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

with torch.no_grad():
    outputs = model(input_ids)
    predictions = outputs.logits[0, -1, :]  # Last token predictions
    probabilities = torch.softmax(predictions, dim=-1)

# Get top 10 most likely next tokens
top_probs, top_indices = torch.topk(probabilities, 10)

print(f"\nTop 10 next tokens after '{prompt}':")
for prob, idx in zip(top_probs, top_indices):
    token = tokenizer.decode([idx])
    print(f"  '{token}' - {prob:.2%}")

# Fine-tuning example setup
print(f"\nāœ“ GPT-2 ready for generation!")
print(f"\nTo fine-tune GPT-2 on your data:")
print(f"1. Prepare text dataset")
print(f"2. Tokenize with tokenizer")
print(f"3. Use Trainer with TrainingArguments")
print(f"4. Train on your domain-specific text")
Output:
Loaded model: gpt2
Model parameters: 124,439,808

Generating text with GPT-2:

Prompt: 'Artificial intelligence will'
  [temp=0.5] Artificial intelligence will be able to make decisions based on what we know and what we don't know about the world around us
  [temp=1.0] Artificial intelligence will revolutionize how we interact with technology, enabling machines to understand human emotions and respond accordingly

Prompt: 'The future of machine learning is'
  [temp=0.5] The future of machine learning is bright, with new algorithms and techniques being developed every day
  [temp=1.0] The future of machine learning is unpredictable but exciting, with potential applications in healthcare, finance, and autonomous systems

Prompt: 'Deep learning models are'
  [temp=0.5] Deep learning models are becoming increasingly sophisticated, capable of solving complex problems that were previously impossible
  [temp=1.0] Deep learning models are powerful tools for pattern recognition, but they require massive amounts of data to train effectively

Top 10 next tokens after 'Machine learning is':
  ' a' - 8.32%
  ' the' - 6.75%
  ' an' - 5.91%
  ' one' - 4.23%
  ' becoming' - 3.87%
  ' used' - 3.45%
  ' changing' - 2.98%
  ' revolutionizing' - 2.67%
  ' increasingly' - 2.34%
  ' now' - 2.12%

āœ“ GPT-2 ready for generation!

To fine-tune GPT-2 on your data:
1. Prepare text dataset
2. Tokenize with tokenizer
3. Use Trainer with TrainingArguments
4. Train on your domain-specific text

Fine-tuning Best Practices

**Choosing the Right Model:** | Task | Recommended Model | Why | |------|------------------|-----| | Classification | BERT, RoBERTa | Bidirectional understanding | | Named Entity Recognition | BERT, ELECTRA | Token-level predictions | | Question Answering | BERT, ALBERT | Context understanding | | Text Generation | GPT-2, GPT-3 | Autoregressive generation | | Summarization | T5, BART | Seq2seq architecture | | Translation | mT5, mBART | Multilingual support | **Fine-tuning Tips:** 1. **Learning Rate**: Start small (1e-5 to 5e-5) 2. **Epochs**: 2-4 epochs usually sufficient 3. **Batch Size**: 8-32 depending on GPU memory 4. **Warmup**: Use warmup steps for stability 5. **Layer Freezing**: Freeze early layers for small datasets 6. **Gradient Clipping**: Prevent exploding gradients **Common Pitfalls:** - āŒ Too high learning rate → unstable training - āŒ Too many epochs → overfitting - āŒ Wrong model type → poor performance - āŒ Not normalizing data → convergence issues **Production Considerations:** - **Model Size**: Use distilled versions (DistilBERT) for speed - **Inference Speed**: Optimize with ONNX, TensorRT - **Cost**: Consider API services (OpenAI, Cohere) vs self-hosting

Sharan Initiatives - Making a Difference Together