Leverage pre-trained language models for your tasks
Modern NLP uses transfer learning: pre-train on massive text, fine-tune on specific tasks. **Why Pre-training Works:** 1. **Language Understanding**: Models learn grammar, facts, reasoning from billions of words 2. **Transfer Learning**: Knowledge transfers to downstream tasks 3. **Data Efficiency**: Need less labeled data for fine-tuning 4. **State-of-the-Art**: Dominates NLP benchmarks **Two Main Paradigms:** **BERT (Bidirectional Encoder Representations from Transformers)** - Encoder-only architecture - Bidirectional context (sees left and right) - Masked language modeling pre-training - Best for: Classification, NER, QA **GPT (Generative Pre-trained Transformer)** - Decoder-only architecture - Autoregressive (left-to-right) - Next token prediction pre-training - Best for: Text generation, completion
Fine-tune BERT for sentiment analysis:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset
# Load pre-trained BERT
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
print(f"Loaded model: {model_name}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
# Example text
texts = [
"This movie was absolutely amazing! I loved every minute.",
"Terrible film. Complete waste of time.",
"It was okay, nothing special.",
"One of the best movies I've ever seen!",
]
labels = [1, 0, 0, 1] # 1 = positive, 0 = negative
# Tokenize
print(f"\nTokenizing {len(texts)} examples...")
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
print(f"Input IDs shape: {encoded['input_ids'].shape}")
print(f"Attention mask shape: {encoded['attention_mask'].shape}")
# Example: Tokenization breakdown
sample_text = texts[0]
tokens = tokenizer.tokenize(sample_text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"\nTokenization example:")
print(f"Original: {sample_text}")
print(f"Tokens: {tokens[:10]}...")
print(f"Token IDs: {token_ids[:10]}...")
# Create dataset
class SentimentDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
item = {key: val[idx] for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
dataset = SentimentDataset(encoded, labels)
# Training arguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=2,
warmup_steps=10,
weight_decay=0.01,
logging_dir='./logs',
)
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
print(f"\nā Model ready for fine-tuning!")
# Inference example
model.eval()
with torch.no_grad():
outputs = model(**encoded)
predictions = torch.argmax(outputs.logits, dim=-1)
probabilities = torch.softmax(outputs.logits, dim=-1)
print(f"\nPredictions (before fine-tuning):")
for i, text in enumerate(texts):
pred_label = "Positive" if predictions[i] == 1 else "Negative"
confidence = probabilities[i][predictions[i]].item()
print(f" {text[:50]}...")
print(f" ā {pred_label} (confidence: {confidence:.2%})")Loaded model: bert-base-uncased Model parameters: 109,483,778 Tokenizing 4 examples... Input IDs shape: torch.Size([4, 14]) Attention mask shape: torch.Size([4, 14]) Tokenization example: Original: This movie was absolutely amazing! I loved every minute. Tokens: ['this', 'movie', 'was', 'absolutely', 'amazing', '!', 'i', 'loved', 'every', 'minute']... Token IDs: [2023, 3185, 2001, 7078, 6429, 999, 1045, 3866, 2296, 3371]... ā Model ready for fine-tuning! Predictions (before fine-tuning): This movie was absolutely amazing! I loved every... ā Positive (confidence: 53.42%) Terrible film. Complete waste of time.... ā Negative (confidence: 51.23%) It was okay, nothing special.... ā Negative (confidence: 52.78%) One of the best movies I've ever seen!... ā Positive (confidence: 54.91%)
Generate text with GPT-2:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# Load pre-trained GPT-2
model_name = 'gpt2' # or 'gpt2-medium', 'gpt2-large', 'gpt2-xl'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
print(f"Loaded model: {model_name}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
# Set to evaluation mode
model.eval()
# Text generation function
def generate_text(prompt, max_length=50, temperature=0.8, top_k=50, top_p=0.95):
"""
Generate text from a prompt
Args:
prompt: Starting text
max_length: Maximum length of generated text
temperature: Higher = more random (0.1-2.0)
top_k: Consider top k tokens
top_p: Nucleus sampling threshold
"""
# Encode prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt')
# Generate
with torch.no_grad():
output = model.generate(
input_ids,
max_length=max_length,
temperature=temperature,
top_k=top_k,
top_p=top_p,
num_return_sequences=1,
pad_token_id=tokenizer.eos_token_id,
do_sample=True,
)
# Decode
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
return generated_text
# Examples with different prompts
prompts = [
"Artificial intelligence will",
"The future of machine learning is",
"Deep learning models are",
]
print(f"\nGenerating text with GPT-2:\n")
for prompt in prompts:
print(f"Prompt: '{prompt}'")
# Generate with different temperatures
for temp in [0.5, 1.0]:
generated = generate_text(prompt, max_length=40, temperature=temp)
print(f" [temp={temp}] {generated}")
print()
# Example: Get probability distribution for next token
prompt = "Machine learning is"
input_ids = tokenizer.encode(prompt, return_tensors='pt')
with torch.no_grad():
outputs = model(input_ids)
predictions = outputs.logits[0, -1, :] # Last token predictions
probabilities = torch.softmax(predictions, dim=-1)
# Get top 10 most likely next tokens
top_probs, top_indices = torch.topk(probabilities, 10)
print(f"\nTop 10 next tokens after '{prompt}':")
for prob, idx in zip(top_probs, top_indices):
token = tokenizer.decode([idx])
print(f" '{token}' - {prob:.2%}")
# Fine-tuning example setup
print(f"\nā GPT-2 ready for generation!")
print(f"\nTo fine-tune GPT-2 on your data:")
print(f"1. Prepare text dataset")
print(f"2. Tokenize with tokenizer")
print(f"3. Use Trainer with TrainingArguments")
print(f"4. Train on your domain-specific text")Loaded model: gpt2 Model parameters: 124,439,808 Generating text with GPT-2: Prompt: 'Artificial intelligence will' [temp=0.5] Artificial intelligence will be able to make decisions based on what we know and what we don't know about the world around us [temp=1.0] Artificial intelligence will revolutionize how we interact with technology, enabling machines to understand human emotions and respond accordingly Prompt: 'The future of machine learning is' [temp=0.5] The future of machine learning is bright, with new algorithms and techniques being developed every day [temp=1.0] The future of machine learning is unpredictable but exciting, with potential applications in healthcare, finance, and autonomous systems Prompt: 'Deep learning models are' [temp=0.5] Deep learning models are becoming increasingly sophisticated, capable of solving complex problems that were previously impossible [temp=1.0] Deep learning models are powerful tools for pattern recognition, but they require massive amounts of data to train effectively Top 10 next tokens after 'Machine learning is': ' a' - 8.32% ' the' - 6.75% ' an' - 5.91% ' one' - 4.23% ' becoming' - 3.87% ' used' - 3.45% ' changing' - 2.98% ' revolutionizing' - 2.67% ' increasingly' - 2.34% ' now' - 2.12% ā GPT-2 ready for generation! To fine-tune GPT-2 on your data: 1. Prepare text dataset 2. Tokenize with tokenizer 3. Use Trainer with TrainingArguments 4. Train on your domain-specific text
**Choosing the Right Model:** | Task | Recommended Model | Why | |------|------------------|-----| | Classification | BERT, RoBERTa | Bidirectional understanding | | Named Entity Recognition | BERT, ELECTRA | Token-level predictions | | Question Answering | BERT, ALBERT | Context understanding | | Text Generation | GPT-2, GPT-3 | Autoregressive generation | | Summarization | T5, BART | Seq2seq architecture | | Translation | mT5, mBART | Multilingual support | **Fine-tuning Tips:** 1. **Learning Rate**: Start small (1e-5 to 5e-5) 2. **Epochs**: 2-4 epochs usually sufficient 3. **Batch Size**: 8-32 depending on GPU memory 4. **Warmup**: Use warmup steps for stability 5. **Layer Freezing**: Freeze early layers for small datasets 6. **Gradient Clipping**: Prevent exploding gradients **Common Pitfalls:** - ā Too high learning rate ā unstable training - ā Too many epochs ā overfitting - ā Wrong model type ā poor performance - ā Not normalizing data ā convergence issues **Production Considerations:** - **Model Size**: Use distilled versions (DistilBERT) for speed - **Inference Speed**: Optimize with ONNX, TensorRT - **Cost**: Consider API services (OpenAI, Cohere) vs self-hosting