📘 Simple Explanation
Dense Feature
Zyada baar non-zero value aati hai. Gradient frequently update hota hai.
Sparse Feature
Zyada baar ZERO value aati hai. Gradient rarely update hota hai.
🏫 Real Life Example — School Attendance:
"Present" bolna = Dense (Roz aate ho) | "Absent" bolna = Sparse (Kabhi kabhi aate ho)
📊 Contour Plot — EXAM KA SABSE IMPORTANT RULE!
w₂
|
4 | ___
| / \
0 | | MIN |----→ w₁ axis pe LAMBA = w₁ DENSE
| \___/ (zyada updates)
-4 |
|________________ w₁
-4 0 4
LAMBA ELLIPSE jis axis pe → Us axis ka feature DENSE hai
CHOTA ELLIPSE jis axis pe → Us axis ka feature SPARSE hai
LAMBA = LABORIOUS = Dense kaam karta hai 😄
Exam Mein Kaise Dekhen:
- Figure A — w₁ axis pe lamba ellipse → x₁ is DENSE, x₂ is SPARSE ✅
- Figure B — w₂ axis pe lamba ellipse → x₁ is SPARSE, x₂ is DENSE ✅
📘 Simple Explanation
"Jo parameter zyada update hua, uski learning rate kam karo"
"Jo parameter kam update hua, uski learning rate zyada rakho"
🍕 Pizza Shop Analogy:
Cheese = Dense (har pizza mein) → learning rate ghatti jaati hai
Truffle = Sparse (kabhi kabhi) → learning rate zyada maintain hoti hai
📐 Formula
AdaGrad Update Rule
Step 1: vt = vt-1 + (∇wt)² ← Gradient squared add karo
Step 2: wt+1 = wt - (η / √(vt + ε)) × ∇wt ← Update karo
Key Points:
• vt = Sum of all past squared gradients (hamesha badhta rehta hai)
• η = Initial learning rate (typically 0.1)
• ε = Very small number (1e-8) → Division by zero rokne ke liye
• Effective LR = η / √vt → Sirf GHATTA hai, kabhi badhta nahi!
v sirf BADHTA hai → η sirf GHATTA hai — AdaGrad ka curse!
🔢 Numerical Solve Karna — Template
Given: ∇wt = [1, 0.9, 0.6, 0.01, 0.1, 0.2, 0.5, 0.55, 0.56], v₋₁ = 0, ε = 0, η₋₁ = 0.1 | Find η₆
FORMULA: vt = vt-1 + (∇wt)² | ηt = 0.1/√vt
t=0: v₀ = 0 + (1)² = 1.0000 → η₀ = 0.1/√1.0000 = 0.1000
t=1: v₁ = 1 + (0.9)² = 1.8100 → η₁ = 0.1/√1.8100 ≈ 0.0743
t=2: v₂ = 1.81+(0.6)² = 2.1700 → η₂ = 0.1/√2.1700 ≈ 0.0679
t=3: v₃ = 2.17+(0.01)²= 2.1701 → η₃ = 0.1/√2.1701 ≈ 0.0679
t=4: v₄ = 2.17+(0.1)² = 2.1801 → η₄ = 0.1/√2.1801 ≈ 0.0677
t=5: v₅ = 2.18+(0.2)² = 2.2201 → η₅ = 0.1/√2.2201 ≈ 0.0671
t=6: v₆ = 2.22+(0.5)² = 2.4701 → η₆ = 0.1/√2.4701 ≈ 0.063 ✓
FINAL ANSWER: η₆ ≈ 0.063
SHORTCUT: vn = Σ(∇wt)² from t=0 to n (bas saare squared gradients add karo!)
📘 Simple Explanation
AdaGrad ki Problem: vt hamesha badhta rehta hai → η basically ZERO ho jaata hai → Learning ruk jaati hai!
RMSProp ka Solution: "Purani history ko thoda bhool jao" — Exponential Moving Average use karo!
🎬 Real Life Example:
AdaGrad = Poori zindagi ke marks add karo (10th se abhi tak) → Average bahut low
RMSProp = Sirf last 3 exams consider karo → More relevant!
📐 Formula
vt = β × vt-1 + (1-β) × (∇wt)²
wt+1 = wt - (η/√(vt + ε)) × ∇wt
Key Difference from AdaGrad:
| AdaGrad | RMSProp |
| vt = vt-1 + (∇wt)² | vt = βvt-1 + (1-β)(∇wt)² |
| vt sirf badhta hai ↑ | vt badh aur ghatt DONO sakta hai ↕ |
| Learning rate sirf ghatti hai | Learning rate INCREASE bhi ho sakti hai! |
| β = 1 (implicit) | β = 0.9 (typical) |
RMS = Ruk Mat Stop → Learning rate ruk ke zero nahi hoti! 😄
📘 Simple Explanation
Adam = Momentum + RMSProp + Bias Correction
🚗 Car Analogy:
Momentum (mt): Steering wheel ka previous direction yaad rakhna
RMSProp (vt): Speed limiter — steep road pe slow, flat pe fast
Bias Correction: Gaadi cold start pe suddenly fast nahi chalti
📐 Full Formula — 5 Steps
Adam Update Rule (Exam Mein Likhna Zaroori!)
Step 1: mt = β₁×mt-1 + (1-β₁)×∇wt ← Momentum
Step 2: m̂t = mt / (1 - β₁ᵗ) ← Bias correction for m
Step 3: vt = β₂×vt-1 + (1-β₂)×(∇wt)² ← RMSProp part
Step 4: v̂t = vt / (1 - β₂ᵗ) ← Bias correction for v
Step 5: wt+1 = wt - (η/√(v̂t+ε)) × m̂t ← Final update
Default Values — HAMESHA YAAD RAKHO:
β₁ = 0.9 (momentum) | β₂ = 0.999 (RMSProp) | ε = 1e-8 | η = 0.001
📘 Simple Explanation
Jab hum start karte hain, m₀ = 0 aur v₀ = 0. Is wajah se pehle kuch steps mein bahut badi learning rate mil jaati hai!
🌡️ AC Analogy:
Without bias correction: AC start hote hi full blast → Room over-cool!
With bias correction: Dhire dhire temperature adjust → Comfortable!
📐 Mathematical Proof
WITHOUT bias correction:
v₀ = 0.999×0 + 0.001×(0.1)² = 0.00001
η_eff = 1/√0.00001 = 316.22 ← BAHUT BADA! 😱
WITH bias correction:
v̂₀ = 0.00001 / (1-0.999) = 0.00001/0.001 = 0.01
η_eff = 1/√0.01 = 10 ← Controlled! ✅
PROOF (3 lines):
E[mt] = E[∇w] × (1 - β₁ᵗ)
E[mt/(1-β₁ᵗ)] = E[∇w]
∴ m̂t = mt/(1-β₁ᵗ) is UNBIASED! ✓
Bias Correction = Training Wheels on Cycle → Shuru mein support chahiye!
📘 Simple Explanation
RMSProp ki problem: Initial η₀ set karna mushkil. AdaDelta bolti hai: "Main khud η figure out kar lungi!"
AdaDelta Update Rule
Step 1: vt = β×vt-1 + (1-β)×(∇wt)²
Step 2: Δwt = -(√(ut-1+ε) / √(vt+ε)) × ∇wt
Step 3: wt+1 = wt + Δwt
Step 4: ut = β×ut-1 + (1-β)×(Δwt)²
Key Points:
• Numerator = Past weight changes ka history (ut)
• Denominator = Past gradients ka history (vt)
• Ratio automatically meaningful scale deta hai
• No η₀ required!
📘 Simple Explanation
Weights ko bahut bada hone se rokta hai!
Loss with L2 = Original Loss + λ × Σ(w²)
| Statement | True/False | Why |
| Training error badhta hai | TRUE ✅ | Penalty add ki, perfect fit nahi hoga |
| Test error ghatta hai | TRUE ✅ | Overfitting rokta hai, better generalization |
| Model complexity ghatti hai | TRUE ✅ | Small weights = simpler model |
| Weights exactly zero ho jaate hain | FALSE ❌ | L1 karta hai yeh, L2 nahi! |
| Training error ghatta hai | FALSE ❌ | L2 typically training error badhata hai |
L2 = Shrinks weights (CLOSE to zero) | L1 = Drives weights (EXACTLY to zero)
📘 Results (Default β values, 10 iterations)
| Statement | Result |
| MGD moves past the minimum because of added momentum | TRUE ✅ |
| Adam moves past the minimum because of added momentum | FALSE ❌ |
| Adam doesn't move past minimum because of adaptive learning rate | TRUE ✅ |
| MGD is closer to the minimum than Adam, after 10 iterations | TRUE ✅ |
| Adam is closer to the minimum than MGD, after 10 iterations | FALSE ❌ |
MGD = Race car — tez jaata hai, brake late lagata hai | Adam = Smart car — speed automatically adjust karta hai
📘 Types of Schedulers
1. Step Decay
Har N epochs baad learning rate ek factor se multiply karo.
η = η × 0.1 (har 30 epochs baad)
2. Exponential Decay
Exponentially ghatta hai.
η = η₀ × e^(-kt)
3. Cyclical LR (CLR)
Learning rate ek range ke beech upar-neeche chalti rehti hai. Saddle points se escape!
4. Cosine Annealing
ηt = ηmin + (ηmax-ηmin)/2 × (1 + cos(π×t/T))
T iterations ke baad phir ηmax se start → "Warm Restart"
🎡 Analogy: CLR = Jhula (swing) — upar jaata hai, neeche aata hai, phir upar!
Saddle point ek plateau ki tarah hai — flat area jahan simple GD ruk jaata hai, CLR escape karta hai!
📘 Simple Explanation
Adam (L2 norm)
vt = β₂vt-1 + (1-β₂)(∇wt)²
AdaMax (Max norm)
vt = max(β₂vt-1, |∇wt|)
Key Advantage:
• Max norm zero bias nahi hoti → Bias correction NOT needed for vt!
• Sparse features ke liye better!
• Bias correction sirf mt ke liye chahiye (same as Adam)
AdaMax = Maximum respect for biggest gradient!
📊 Master Comparison Table — PHOTO KHEENCH LO!
| Algorithm | vt Formula | η₀ Needed? | Bias Correction? | LR Trend | Best For |
| AdaGrad |
vt-1 + (∇wt)² |
YES |
NO |
Only ↓ |
Sparse data |
| RMSProp |
β·vt-1 + (1-β)(∇wt)² |
YES |
NO |
↑ or ↓ |
Non-stationary |
| AdaDelta |
Same as RMSProp |
NO ✨ |
NO |
↑ or ↓ |
General use |
| Adam |
mt + vt (both) |
YES |
YES (both) |
↑ or ↓ |
Default choice 🏆 |
| AdaMax |
max(β₂vt-1, |∇wt|) |
YES |
Only mt |
↑ or ↓ |
Sparse features |
Q1 · 3M
Contour Plot Wala Question — True/False karo
Figure mein w₁ axis pe lamba ellipse hai. Which statements are TRUE?
(a) x₁ is sparse, x₂ is dense (b) x₁ is dense, x₂ is sparse (c) w₁ gets more updates than w₂
Answer: (b) and (c) ✅
Explanation:
- Ellipse w₁ axis pe lamba hai → w₁ ko zyada updates milte hain
- Zyada updates = Feature x₁ DENSE hai
- Kam updates = Feature x₂ SPARSE hai
- Dense feature ka gradient zyada baar non-zero hota hai
- Isliye w₁ > w₂ updates
Keywords: dense, sparse, gradient updates, non-zero, ellipse elongated
Q2 · 2M
Gradient Descent Trajectory — True ya False?
w₁ = -6, w₂ = 0 se start karke GD chalate hain. Trajectory straight horizontal line hai. Claim sahi hai?
Answer: FALSE ❌
Explanation:
- Figure mein x₁ dense hai aur x₂ sparse hai
- Dense parameter (w₁) ko zyada updates milenge
- Sparse parameter (w₂) ko kam updates milenge
- Isliye trajectory STRAIGHT nahi hogi
- w₁ zyada move karega, w₂ kam
- Path curved/uneven hoga, straight nahi
- Dense feature ki wajah se w₁ tezi se minimum ki taraf jaayega
Keywords: dense parameter, sparse parameter, unequal updates, curved trajectory
Q3 · 2M
AdaGrad Effective Learning Rate Graph — Kaun sa sahi hai?
Answer: Monotonically DECREASING graph (sirf ghatti hai) ✅
Explanation:
- AdaGrad mein: vt = vt-1 + (∇wt)²
- vt hamesha BADHTA rehta hai (squared term add hota hai, kabhi subtract nahi)
- Effective LR = η/√vt
- vt badhta hai → 1/√vt ghatta hai
- Isliye effective learning rate SIRF GHATTI HAI
- Kabhi nahi badhti, chahe gradient zero bhi ho jaaye
Graph: Smoothly decreasing curve — starting high, going to near zero
(Decaying curve wala option select karo)
Q4 · 5M
AdaGrad η₆ Calculate Karo — Numerical
∇wt = [1, 0.9, 0.6, 0.01, 0.1, 0.2, 0.5, 0.55, 0.56] | v₋₁ = 0, ε = 0, η₋₁ = 0.1 | Find η₆
FORMULA: vt = vt-1 + (∇wt)² | ηt = η₋₁/√vt
t=0: v₀ = 0 + 1² = 1.0000 → η₀ = 0.1/√1.0000 = 0.1000
t=1: v₁ = 1 + 0.81 = 1.8100 → η₁ = 0.1/√1.8100 ≈ 0.0743
t=2: v₂ = 1.81+0.36 = 2.1700 → η₂ = 0.1/√2.1700 ≈ 0.0679
t=3: v₃ = 2.17+0.0001=2.1701 → η₃ = 0.1/√2.1701 ≈ 0.0679
t=4: v₄ = 2.17+0.01 = 2.1801 → η₄ = 0.1/√2.1801 ≈ 0.0677
t=5: v₅ = 2.18+0.04 = 2.2201 → η₅ = 0.1/√2.2201 ≈ 0.0671
t=6: v₆ = 2.22+0.25 = 2.4701 → η₆ = 0.1/√2.4701 ≈ 0.063
FINAL ANSWER: η₆ ≈ 0.063 ✓
EXAM TIP: Formula likhna mat bhoolna! Marks milte hain formula ke bhi.
Q5 · 3M
RMSProp vs AdaGrad Difference — Formulas ke saath
AdaGrad Update Rule:
vt = vt-1 + (∇wt)²
RMSProp Update Rule:
vt = β×vt-1 + (1-β)×(∇wt)²
Main Differences:
1. DENOMINATOR GROWTH:
AdaGrad: vt hamesha badhta hai (monotonically increasing)
RMSProp: vt increase ya decrease dono kar sakta hai
2. EFFECTIVE LEARNING RATE:
AdaGrad: Sirf ghatti hai (monotonically decreasing)
RMSProp: Increase, decrease ya constant reh sakti hai
3. LONG TRAINING:
AdaGrad: After long training, LR → 0 (training ruk jaati hai!)
RMSProp: LR controlled rehti hai throughout
4. β PARAMETER:
AdaGrad: Koi β nahi (ya β=1 implicitly)
RMSProp: β ∈ [0,1), typically β = 0.9
Conclusion:
RMSProp, AdaGrad ka improved version hai jo
"ever-growing denominator" ki problem solve karta hai.
Q6 · 4M
RMSProp Numerical — η calculate karo
β = 0.9, v₋₁ = 0, ε = 0, η₋₁ = 0.1 | Find first few η values
FORMULA: vt = β×vt-1 + (1-β)×(∇wt)² → β=0.9, (1-β)=0.1
t=0: v₀ = 0.9×0 + 0.1×(1)² = 0.1000 → η₀ = 0.1/√0.1 = 0.316
t=1: v₁ = 0.9×0.1+0.1×(0.9)²= 0.171 → η₁ = 0.1/√0.171 = 0.242
t=2: v₂ = 0.9×0.171+0.1×0.36= 0.1899 → η₂ = 0.1/√0.1899= 0.229
KEY OBSERVATION:
RMSProp mein vt AdaGrad jitna nahi badhta
Isliye: ηRMSProp > ηAdaGrad (same iterations ke baad)
→ RMSProp learning jaldi nahi rokta!
Q7 · 3M
Adam vs MGD — f(w) = w², 10 Iterations
β₁=0.9, β₂=0.999, β₁₋₁=0, β₂₋₁=0. Which statements are TRUE?
Answer: (a) TRUE, (c) TRUE, (d) TRUE
(a) TRUE - MGD moves past minimum:
Momentum ki wajah se, jab minimum pe pahunchta hai
tab bhi velocity hoti hai → minimum se aage nikal jaata hai
β₁ = 0.9 → high momentum → overshooting
(b) FALSE - Adam does NOT move past minimum:
Adam ka adaptive learning rate automatically adjust hota hai
Minimum ke paas gradient chhota hota hai → step size bhi chhota
Isliye Adam minimum ke paas controlled rehta hai
(c) TRUE - Correct explanation for Adam's behavior
(d) TRUE - MGD is closer to minimum after 10 iterations:
Adam oscillate karta hai thoda near minimum
MGD with momentum direct path leta hai
10 iterations mein MGD actually minimum ke kaafi karib hota hai
Note: Code likh ke verify kar sakte hain (question ka hint)
Q8 · 2M
L(w,b) = 0.5w² + 5b² + 1 — Find L(w*, b*)
To find minimum, partial derivatives set to zero:
∂L/∂w = 0.5 × 2w = w = 0 → w* = 0
∂L/∂b = 5 × 2b = 10b = 0 → b* = 0
L(w*, b*) = 0.5×(0)² + 5×(0)² + 1
= 0 + 0 + 1
= 1
Answer: L(w*, b*) = 1 ✓
Note: Minimum point (0,0) pe hai — yeh contour plot ke
centre mein hoga. Constant term 1 isliye answer 1 aaya.
Q9 · 3M
L2 Regularization ke Effects — True/False
Answer: (a) TRUE, (b) TRUE, (c) TRUE, (d) FALSE, (e) FALSE
(a) TRUE - Training error INCREASES:
L2 adds penalty term λΣw² to loss function
Model cannot fit training data perfectly
Isliye training error thoda badhta hai (by design!)
(b) TRUE - Test error DECREASES:
L2 prevents overfitting by penalizing large weights
Smaller weights → better generalization on unseen data
Hence test/validation error decreases
(c) TRUE - Model complexity REDUCES:
L2 encourages smaller weights throughout
Less important weights shrink close to zero
Simpler model = less complexity = better generalization
(d) FALSE - Weights NOT exactly zero:
L2 = weights ko zero ke CLOSE laata hai (shrinkage)
L1 = weights ko EXACTLY zero karta hai (sparsity)
This is the KEY difference!
(e) FALSE - Training error does NOT decrease with L2
Formula: L2 Loss = Original Loss + λ × Σwᵢ²
Q10 · 4M
Bias Correction in Adam — Mathematical Justification
REASON FOR BIAS CORRECTION IN ADAM:
1. PROBLEM (Initialization Bias):
m₀ = 0 aur v₀ = 0 initialize karte hain
m₁ = β₁×0 + (1-β₁)×∇w₁ = 0.1×∇w₁ ← bahut chhota!
2. EFFECT WITHOUT BIAS CORRECTION:
v₀ = 0.001×(0.1)² = 0.00001
η_eff = 1/√0.00001 = 316 ← BAHUT BADA → Unstable!
3. MATHEMATICAL PROOF:
mt = (1-β₁) × Σ(β₁^(t-τ) × ∇wτ)
Taking expectation (assuming stationary distribution):
E[mt] = E[∇w] × (1 - β₁ᵗ) [Sum of GP with ratio β₁]
We want E[m̂t] = E[∇w]
∴ m̂t = mt / (1 - β₁ᵗ) ← This is Bias Correction!
4. EFFECT WITH BIAS CORRECTION:
v̂₀ = 0.00001/0.001 = 0.01
η_eff = 1/√0.01 = 10 ← Controlled! ✓
CONCLUSION:
Bias correction ensures initial steps stable rahen
aur exponential moving average unbiased estimate de.
"A lack of initialization bias correction would lead
to initial steps that are much larger" — Adam paper
Q11 · 3M
AdaDelta — No η₀ Kyun Zaroori Nahi?
AdaDelta Algorithm:
1. vt = β×vt-1 + (1-β)×(∇wt)² ← past gradients
2. Δwt = -(√(ut-1+ε)/√(vt+ε)) × ∇wt ← weight update
3. wt+1 = wt + Δwt
4. ut = β×ut-1 + (1-β)×(Δwt)² ← past weight changes
WHY NO INITIAL LEARNING RATE NEEDED:
In RMSProp: Δwt = -(η₀/√vt) × ∇wt
→ η₀ constant hai, manually set karna padta hai
In AdaDelta: Δwt = -(√ut-1/√vt) × ∇wt
→ Numerator = past weight changes ka RMS
→ Denominator = past gradients ka RMS
→ Ratio automatically meaningful scale deta hai!
→ No manual η₀ required!
Key Insight:
- ut tracks history of weight UPDATES (Δwt)
- vt tracks history of GRADIENTS (∇wt)
- √ut/√vt acts as adaptive learning rate automatically
- This ratio is in correct units/scale automatically
Benefit: Not sensitive to initial conditions, works across
different problems without manual tuning of η₀
Q12 · 2M
CLR / Cosine Annealing — Kyun Useful Hai?
Cyclical Learning Rate (CLR) Useful Hai Kyunki:
1. SADDLE POINT ESCAPE:
- Loss surfaces mein saddle points hote hain (flat areas)
- Monotonically decreasing LR saddle point pe STUCK ho jaati hai
- CLR mein LR increase bhi hoti hai → saddle point se ESCAPE possible!
2. BETTER EXPLORATION + FINE-TUNING:
- High LR pe: Wide exploration (naya area dhundhna)
- Low LR pe: Fine-tuning near minimum
- Dono ka faida ek saath milta hai!
3. FORMULA (Triangular CLR):
ηt = ηmin + (ηmax-ηmin) × max(0, 1-|t/μ - 2⌊1+t/2μ⌋+1|)
μ = step size (ek cycle ki half length)
4. COSINE ANNEALING (Warm Restart):
ηt = ηmin + (ηmax-ηmin)/2 × (1 + cos(π×t/T))
T iterations ke baad ηmax se abrupt restart → "Warm Restart"
CONCLUSION: CLR especially helpful hai jab loss surface
mein saddle points ya plateaus hon. Monotonic decay
wahan fail ho jaata hai.
📖 Important Definitions — Exam Mein Copy Karo!
1. Sparse Feature:
"A feature is called sparse if its value is zero for most training examples, resulting in infrequent gradient updates for the corresponding weight."
2. Adaptive Learning Rate:
"A learning rate that automatically adjusts for each parameter based on the history of its gradients, giving larger updates to infrequent parameters and smaller updates to frequent ones."
3. Adam Optimizer:
"Adam (Adaptive Moment Estimation) combines momentum and RMSProp, maintaining exponentially decaying averages of past gradients (mt) and squared gradients (vt), with bias correction to ensure unbiased estimates."
4. Bias Correction:
"The process of dividing the moving average estimates by (1-βᵗ) to correct the initialization bias towards zero, ensuring the expected value equals the true expected gradient."
5. L2 Regularization:
"A regularization technique that adds λ×Σw² penalty to the loss function, preventing overfitting by discouraging large weights, thereby improving generalization on unseen data."
📐 ALL FORMULAS — EK JAGAH
╔══════════════════════════════════════════════════════════╗
║ ALL OPTIMIZER FORMULAS ║
╠══════════════════════════════════════════════════════════╣
║ ║
║ ADAGRAD: ║
║ vt = vt-1 + (∇wt)² ║
║ wt+1 = wt - (η/√(vt+ε)) × ∇wt ║
║ ║
║ RMSPROP: ║
║ vt = β×vt-1 + (1-β)×(∇wt)² ║
║ wt+1 = wt - (η/√(vt+ε)) × ∇wt ║
║ ║
║ ADADELTA: ║
║ vt = β×vt-1 + (1-β)×(∇wt)² ║
║ Δwt = -(√(ut-1+ε)/√(vt+ε)) × ∇wt ║
║ wt+1 = wt + Δwt ║
║ ut = β×ut-1 + (1-β)×(Δwt)² ║
║ ║
║ ADAM: ║
║ mt = β₁×mt-1 + (1-β₁)×∇wt ║
║ m̂t = mt/(1-β₁ᵗ) ← BIAS CORRECTION (m) ║
║ vt = β₂×vt-1 + (1-β₂)×(∇wt)² ║
║ v̂t = vt/(1-β₂ᵗ) ← BIAS CORRECTION (v) ║
║ wt+1 = wt - (η/√(v̂t+ε)) × m̂t ║
║ ║
║ ADAMAX: ║
║ mt = β₁×mt-1 + (1-β₁)×∇wt ║
║ m̂t = mt/(1-β₁ᵗ) ← BIAS CORRECTION (only m) ║
║ vt = max(β₂×vt-1, |∇wt|) ← MAX NORM (no BC!) ║
║ wt+1 = wt - (η/(vt+ε)) × m̂t ║
║ ║
╠══════════════════════════════════════════════════════════╣
║ DEFAULT VALUES (HAMESHA YAAD RAKHO!): ║
║ β₁ = 0.9 │ β₂ = 0.999 │ ε = 1e-8 │ η = 0.001 ║
╚══════════════════════════════════════════════════════════╝
📊 Contour Plot — 30 Second Rule
╔══════════════════════════════════════════╗
║ ELLIPSE w₁ axis pe LAMBA ║
║ ↓ ║
║ w₁ ko ZYADA updates mile ║
║ ↓ ║
║ Feature x₁ = DENSE ║
║ Feature x₂ = SPARSE ║
╠══════════════════════════════════════════╣
║ ELLIPSE w₂ axis pe LAMBA ║
║ ↓ ║
║ w₂ ko ZYADA updates mile ║
║ ↓ ║
║ Feature x₁ = SPARSE ║
║ Feature x₂ = DENSE ║
╚══════════════════════════════════════════╝
⚡ True/False Quick Reference
| Statement | Answer |
| AdaGrad effective LR monotonically decreases | TRUE ✅ |
| RMSProp LR can increase | TRUE ✅ |
| AdaDelta needs initial learning rate | FALSE ❌ |
| Adam uses bias correction for both mt and vt | TRUE ✅ |
| AdaMax needs bias correction for vt | FALSE ❌ |
| L2 regularization increases training error | TRUE ✅ |
| L2 regularization decreases test error | TRUE ✅ |
| L2 drives weights exactly to zero | FALSE ❌ |
| L1 drives weights exactly to zero | TRUE ✅ |
| Sparse feature gets more gradient updates | FALSE ❌ |
| Dense feature corresponds to elongated ellipse axis | TRUE ✅ |
| MGD overshoots minimum due to momentum | TRUE ✅ |
| Adam overshoots minimum | FALSE ❌ |
| CLR helps escape saddle points | TRUE ✅ |
| vt in AdaGrad never decreases | TRUE ✅ |
🔑 5 Golden Rules — Hamesha Kaam Aayenge!
RULE 1: "LAMBA ELLIPSE = DENSE FEATURE" — Contour plot mein jis axis pe ellipse lamba hai, us axis ka feature DENSE hai
RULE 2: "ADAGRAD SIRF GHATTA HAI" — AdaGrad mein effective LR sirf decrease hoti hai, kabhi increase nahi hoti
RULE 3: "RMSPROP = ADAGRAD + FORGETTING" — RMSProp exponential moving average use karta hai, past ko completely nahi yaad rakhta
RULE 4: "ADAM = MOMENTUM + RMSPROP + BIAS FIX" — Adam = mt (momentum) + vt (RMSProp) + bias correction dono ke liye
RULE 5: "L2 SHRINKS, L1 KILLS" — L2 weights ko ZERO ke PAAS laata hai | L1 weights ko EXACTLY ZERO kar deta hai
🧠 Memory Tricks — ARDA Method
| Letter | Algorithm | Trick |
| A | AdaGrad | "Aaram se chalti hai, phir ROOK jaati hai" — LR zero ho jaati hai |
| R | RMSProp | "Ruk Mat Stop! Flexible rehti hai" — LR fluctuate kar sakti hai |
| D | AdaDelta | "Dil se kaam karti hai, khud decide karti hai" — No η₀ needed |
| A | Adam | "All-rounder! Sab ka combination!" — Default choice 🏆 |
| M | AdaMax | "Maximum respect for biggest gradient!" — max norm |
🚨 Common Mistakes — Mat Karna!
"AdaGrad mein LR increase ho sakti hai"
"AdaGrad mein LR sirf decrease hoti hai — vt hamesha badhta hai"
"L2 regularization weights exactly zero karta hai"
"L1 exactly zero, L2 sirf close to zero (shrinkage)"
"Adam mein bias correction sirf mt ke liye hai"
"Bias correction DONO mt aur vt ke liye hai — dono divide hote hain"
"AdaMax mein vt ke liye bhi bias correction chahiye"
"Max norm zero bias nahi hoti, so sirf mt ke liye BC chahiye"
Numerical mein β = 0.9 ko ignore karna (AdaGrad ka formula use karna)
Check karo pehle — AdaGrad (no β) ya RMSProp (β hai)?
"Lamba ellipse = Sparse feature"
"Lamba ellipse = DENSE feature (zyada updates milte hain)"
🏆 Predicted Exam Pattern
📊 Q1-Q2: Contour plot — Sparse/Dense identify karo
3 Marks
📉 Q3: Effective LR graph identify karo
2 Marks
🔢 Q4: AdaGrad η₆ numerical calculate karo
5 Marks
🔢 Q5-Q6: RMSProp numerical
4 Marks
⚡ Q7: Adam vs MGD comparison
3 Marks
📐 Q8: L(w*,b*) minimum find karo
2 Marks
✅ Q9: L2 regularization effects True/False
3 Marks
📖 Q10: Bias correction explanation
4 Marks
Total: ~26 Marks Coverage in Preparation! 🎯
✍️ Exam Answer Writing Tips
📝 Formula Pehle
Har numerical mein pehle formula likhna. Marks milte hain formula ke bhi!
📊 Step by Step
t=0, t=1, t=2... clearly likho. Intermediate values dikhao.
🎯 Conclusion
"Therefore η₆ = 0.063" ya "Hence AdaGrad better for sparse" type conclusion hamesha likho.
🔑 Keywords
Dense, sparse, adaptive, monotonically, bias correction, exponential moving average use karo.
✏️ Diagram
Contour plot sketch karo jahan possible ho. Arrow dikhao trajectory ka.
⏱️ Easy Pehle
Pehle easy/short questions karo. Confidence ke saath aage badho!
✅ Last 30 Minutes Revision Checklist
- Contour plot rule yaad hai? (LAMBA = DENSE)
- AdaGrad formula yaad hai? (vt = vt-1 + grad²)
- RMSProp formula yaad hai? (vt = β×vt-1 + (1-β)×grad²)
- Adam ke 5 steps yaad hain? (mt, m̂t, vt, v̂t, update)
- Default values yaad hain? (β₁=0.9, β₂=0.999, ε=1e-8)
- Bias correction kyun? (Initial zero bias, large initial steps rokne ke liye)
- L2 vs L1 difference? (Close to 0 vs Exactly 0)
- AdaDelta — no η₀ kyun? (Numerator = weight change history)
- CLR kyun useful? (Saddle point escape)
- Numerical solve karna aata hai? (η = η₋₁/√vt)