Deep Learning - Complete Study Guide

#	Topic	Priority	Marks
1	Sparse vs Dense Features — Contour plots se identify karna	🔥 HIGH	2-3
2	AdaGrad Update Rule — Formula + η₆ calculate karna	🔥 HIGH	4-5
3	RMSProp vs AdaGrad — Difference kya hai	🔥 HIGH	3
4	Adam Algorithm — Full update rule + Bias Correction	🔥 HIGH	4
5	Gradient Descent Trajectory — Kaun sa algorithm kaisa move karta hai	🔥 HIGH	2
6	AdaDelta — No η₀ kyun chahiye	⭐ MED	3
7	AdaMax — L2 norm ki jagah max norm kyun	⭐ MED	2
8	L2 Regularization — Training/Test error pe effect	⭐ MED	3
9	Bias Correction in Adam — Mathematical reason	⭐ MED	4
10	Learning Rate Schedulers — CLR, Cosine Annealing	⭐ MED	2

Sparse vs Dense Features + Contour Plots

📘 Simple Explanation

Dense Feature

Zyada baar non-zero value aati hai. Gradient frequently update hota hai.

Sparse Feature

Zyada baar ZERO value aati hai. Gradient rarely update hota hai.

🏫 Real Life Example — School Attendance:
"Present" bolna = Dense (Roz aate ho) | "Absent" bolna = Sparse (Kabhi kabhi aate ho)

📊 Contour Plot — EXAM KA SABSE IMPORTANT RULE!

     w₂
      |
    4 |    ___
      |   /   \
    0 |  | MIN |----→  w₁ axis pe LAMBA = w₁ DENSE
      |   \___/             (zyada updates)
   -4 |
      |________________ w₁
         -4    0    4

LAMBA ELLIPSE jis axis pe → Us axis ka feature DENSE hai

CHOTA ELLIPSE jis axis pe → Us axis ka feature SPARSE hai

LAMBA = LABORIOUS = Dense kaam karta hai 😄

Exam Mein Kaise Dekhen:

Figure A — w₁ axis pe lamba ellipse → x₁ is DENSE, x₂ is SPARSE ✅
Figure B — w₂ axis pe lamba ellipse → x₁ is SPARSE, x₂ is DENSE ✅

AdaGrad — Adaptive Gradient

📘 Simple Explanation

"Jo parameter zyada update hua, uski learning rate kam karo"

"Jo parameter kam update hua, uski learning rate zyada rakho"

🍕 Pizza Shop Analogy:
Cheese = Dense (har pizza mein) → learning rate ghatti jaati hai
Truffle = Sparse (kabhi kabhi) → learning rate zyada maintain hoti hai

📐 Formula

AdaGrad Update Rule

Step 1: v_t = v_t-1 + (∇w_t)² ← Gradient squared add karo Step 2: w_t+1 = w_t - (η / √(v_t + ε)) × ∇w_t ← Update karo

Key Points:
• vt = Sum of all past squared gradients (hamesha badhta rehta hai)
• η = Initial learning rate (typically 0.1)
• ε = Very small number (1e-8) → Division by zero rokne ke liye
• Effective LR = η / √vt → Sirf GHATTA hai, kabhi badhta nahi!

v sirf BADHTA hai → η sirf GHATTA hai — AdaGrad ka curse!

🔢 Numerical Solve Karna — Template

Given: ∇wt = [1, 0.9, 0.6, 0.01, 0.1, 0.2, 0.5, 0.55, 0.56], v₋₁ = 0, ε = 0, η₋₁ = 0.1 | Find η₆

FORMULA: vt = vt-1 + (∇wt)²   |   ηt = 0.1/√vt

t=0: v₀ = 0 + (1)²    = 1.0000  →  η₀ = 0.1/√1.0000 = 0.1000
t=1: v₁ = 1 + (0.9)²  = 1.8100  →  η₁ = 0.1/√1.8100 ≈ 0.0743
t=2: v₂ = 1.81+(0.6)² = 2.1700  →  η₂ = 0.1/√2.1700 ≈ 0.0679
t=3: v₃ = 2.17+(0.01)²= 2.1701  →  η₃ = 0.1/√2.1701 ≈ 0.0679
t=4: v₄ = 2.17+(0.1)² = 2.1801  →  η₄ = 0.1/√2.1801 ≈ 0.0677
t=5: v₅ = 2.18+(0.2)² = 2.2201  →  η₅ = 0.1/√2.2201 ≈ 0.0671
t=6: v₆ = 2.22+(0.5)² = 2.4701  →  η₆ = 0.1/√2.4701 ≈ 0.063 ✓

FINAL ANSWER: η₆ ≈ 0.063

SHORTCUT: vn = Σ(∇wt)² from t=0 to n (bas saare squared gradients add karo!)

RMSProp — AdaGrad ka Upgrade

📘 Simple Explanation

AdaGrad ki Problem: vt hamesha badhta rehta hai → η basically ZERO ho jaata hai → Learning ruk jaati hai!

RMSProp ka Solution: "Purani history ko thoda bhool jao" — Exponential Moving Average use karo!

🎬 Real Life Example:
AdaGrad = Poori zindagi ke marks add karo (10th se abhi tak) → Average bahut low
RMSProp = Sirf last 3 exams consider karo → More relevant!

📐 Formula

v_t = β × v_t-1 + (1-β) × (∇w_t)² w_t+1 = w_t - (η/√(v_t + ε)) × ∇w_t

Key Difference from AdaGrad:

AdaGrad	RMSProp
vt = vt-1 + (∇wt)²	vt = βvt-1 + (1-β)(∇wt)²
vt sirf badhta hai ↑	vt badh aur ghatt DONO sakta hai ↕
Learning rate sirf ghatti hai	Learning rate INCREASE bhi ho sakti hai!
β = 1 (implicit)	β = 0.9 (typical)

RMS = Ruk Mat Stop → Learning rate ruk ke zero nahi hoti! 😄

Adam — The HERO 🦸

📘 Simple Explanation

Adam = Momentum + RMSProp + Bias Correction

🚗 Car Analogy:
Momentum (mt): Steering wheel ka previous direction yaad rakhna
RMSProp (vt): Speed limiter — steep road pe slow, flat pe fast
Bias Correction: Gaadi cold start pe suddenly fast nahi chalti

📐 Full Formula — 5 Steps

Adam Update Rule (Exam Mein Likhna Zaroori!)

Step 1: m_t = β₁×m_t-1 + (1-β₁)×∇w_t ← Momentum Step 2: m̂_t = m_t / (1 - β₁ᵗ) ← Bias correction for m Step 3: v_t = β₂×v_t-1 + (1-β₂)×(∇w_t)² ← RMSProp part Step 4: v̂_t = v_t / (1 - β₂ᵗ) ← Bias correction for v Step 5: w_t+1 = w_t - (η/√(v̂_t+ε)) × m̂_t ← Final update

Default Values — HAMESHA YAAD RAKHO:
β₁ = 0.9 (momentum) | β₂ = 0.999 (RMSProp) | ε = 1e-8 | η = 0.001

Bias Correction — Kyun Zaroori Hai?

📘 Simple Explanation

Jab hum start karte hain, m₀ = 0 aur v₀ = 0. Is wajah se pehle kuch steps mein bahut badi learning rate mil jaati hai!

🌡️ AC Analogy:
Without bias correction: AC start hote hi full blast → Room over-cool!
With bias correction: Dhire dhire temperature adjust → Comfortable!

📐 Mathematical Proof

WITHOUT bias correction:
v₀ = 0.999×0 + 0.001×(0.1)² = 0.00001
η_eff = 1/√0.00001 = 316.22  ← BAHUT BADA! 😱

WITH bias correction:
v̂₀ = 0.00001 / (1-0.999) = 0.00001/0.001 = 0.01
η_eff = 1/√0.01 = 10  ← Controlled! ✅

PROOF (3 lines):
E[mt] = E[∇w] × (1 - β₁ᵗ)
E[mt/(1-β₁ᵗ)] = E[∇w]
∴ m̂t = mt/(1-β₁ᵗ) is UNBIASED! ✓

Bias Correction = Training Wheels on Cycle → Shuru mein support chahiye!

AdaDelta — No Learning Rate Needed!

📘 Simple Explanation

RMSProp ki problem: Initial η₀ set karna mushkil. AdaDelta bolti hai: "Main khud η figure out kar lungi!"

AdaDelta Update Rule

Step 1: v_t = β×v_t-1 + (1-β)×(∇w_t)² Step 2: Δw_t = -(√(u_t-1+ε) / √(v_t+ε)) × ∇w_t Step 3: w_t+1 = w_t + Δw_t Step 4: u_t = β×u_t-1 + (1-β)×(Δw_t)²

Key Points:
• Numerator = Past weight changes ka history (ut)
• Denominator = Past gradients ka history (vt)
• Ratio automatically meaningful scale deta hai
• No η₀ required!

L2 Regularization

📘 Simple Explanation

Weights ko bahut bada hone se rokta hai!

Loss with L2 = Original Loss + λ × Σ(w²)

Statement	True/False	Why
Training error badhta hai	TRUE ✅	Penalty add ki, perfect fit nahi hoga
Test error ghatta hai	TRUE ✅	Overfitting rokta hai, better generalization
Model complexity ghatti hai	TRUE ✅	Small weights = simpler model
Weights exactly zero ho jaate hain	FALSE ❌	L1 karta hai yeh, L2 nahi!
Training error ghatta hai	FALSE ❌	L2 typically training error badhata hai

L2 = Shrinks weights (CLOSE to zero) | L1 = Drives weights (EXACTLY to zero)

Adam vs MGD — Parabola f(w) = w²

📘 Results (Default β values, 10 iterations)

Statement	Result
MGD moves past the minimum because of added momentum	TRUE ✅
Adam moves past the minimum because of added momentum	FALSE ❌
Adam doesn't move past minimum because of adaptive learning rate	TRUE ✅
MGD is closer to the minimum than Adam, after 10 iterations	TRUE ✅
Adam is closer to the minimum than MGD, after 10 iterations	FALSE ❌

MGD = Race car — tez jaata hai, brake late lagata hai | Adam = Smart car — speed automatically adjust karta hai

Learning Rate Schedulers

📘 Types of Schedulers

1. Step Decay

Har N epochs baad learning rate ek factor se multiply karo.
η = η × 0.1 (har 30 epochs baad)

2. Exponential Decay

Exponentially ghatta hai.
η = η₀ × e^(-kt)

3. Cyclical LR (CLR)

Learning rate ek range ke beech upar-neeche chalti rehti hai. Saddle points se escape!

4. Cosine Annealing

ηt = ηmin + (ηmax-ηmin)/2 × (1 + cos(π×t/T))
T iterations ke baad phir ηmax se start → "Warm Restart"

🎡 Analogy: CLR = Jhula (swing) — upar jaata hai, neeche aata hai, phir upar!
Saddle point ek plateau ki tarah hai — flat area jahan simple GD ruk jaata hai, CLR escape karta hai!

AdaMax — Max Norm Instead of L2

📘 Simple Explanation

Adam (L2 norm)

vt = β₂vt-1 + (1-β₂)(∇wt)²

AdaMax (Max norm)

vt = max(β₂vt-1, |∇wt|)

Key Advantage:
• Max norm zero bias nahi hoti → Bias correction NOT needed for vt!
• Sparse features ke liye better!
• Bias correction sirf mt ke liye chahiye (same as Adam)

AdaMax = Maximum respect for biggest gradient!

📊 Master Comparison Table — PHOTO KHEENCH LO!

Algorithm	vt Formula	η₀ Needed?	Bias Correction?	LR Trend	Best For
AdaGrad	`vt-1 + (∇wt)²`	YES	NO	Only ↓	Sparse data
RMSProp	`β·vt-1 + (1-β)(∇wt)²`	YES	NO	↑ or ↓	Non-stationary
AdaDelta	Same as RMSProp	NO ✨	NO	↑ or ↓	General use
Adam	mt + vt (both)	YES	YES (both)	↑ or ↓	Default choice 🏆
AdaMax	`max(β₂vt-1, \|∇wt\|)`	YES	Only mt	↑ or ↓	Sparse features

Q1 · 3M Contour Plot Wala Question — True/False karo

Figure mein w₁ axis pe lamba ellipse hai. Which statements are TRUE?
(a) x₁ is sparse, x₂ is dense (b) x₁ is dense, x₂ is sparse (c) w₁ gets more updates than w₂

Answer: (b) and (c) ✅ Explanation: - Ellipse w₁ axis pe lamba hai → w₁ ko zyada updates milte hain - Zyada updates = Feature x₁ DENSE hai - Kam updates = Feature x₂ SPARSE hai - Dense feature ka gradient zyada baar non-zero hota hai - Isliye w₁ > w₂ updates Keywords: dense, sparse, gradient updates, non-zero, ellipse elongated

Q2 · 2M Gradient Descent Trajectory — True ya False?

w₁ = -6, w₂ = 0 se start karke GD chalate hain. Trajectory straight horizontal line hai. Claim sahi hai?

Answer: FALSE ❌ Explanation: - Figure mein x₁ dense hai aur x₂ sparse hai - Dense parameter (w₁) ko zyada updates milenge - Sparse parameter (w₂) ko kam updates milenge - Isliye trajectory STRAIGHT nahi hogi - w₁ zyada move karega, w₂ kam - Path curved/uneven hoga, straight nahi - Dense feature ki wajah se w₁ tezi se minimum ki taraf jaayega Keywords: dense parameter, sparse parameter, unequal updates, curved trajectory

Q3 · 2M AdaGrad Effective Learning Rate Graph — Kaun sa sahi hai?

Answer: Monotonically DECREASING graph (sirf ghatti hai) ✅ Explanation: - AdaGrad mein: vt = vt-1 + (∇wt)² - vt hamesha BADHTA rehta hai (squared term add hota hai, kabhi subtract nahi) - Effective LR = η/√vt - vt badhta hai → 1/√vt ghatta hai - Isliye effective learning rate SIRF GHATTI HAI - Kabhi nahi badhti, chahe gradient zero bhi ho jaaye Graph: Smoothly decreasing curve — starting high, going to near zero (Decaying curve wala option select karo)

Q4 · 5M AdaGrad η₆ Calculate Karo — Numerical

∇wt = [1, 0.9, 0.6, 0.01, 0.1, 0.2, 0.5, 0.55, 0.56] | v₋₁ = 0, ε = 0, η₋₁ = 0.1 | Find η₆

FORMULA: vt = vt-1 + (∇wt)² | ηt = η₋₁/√vt t=0: v₀ = 0 + 1² = 1.0000 → η₀ = 0.1/√1.0000 = 0.1000 t=1: v₁ = 1 + 0.81 = 1.8100 → η₁ = 0.1/√1.8100 ≈ 0.0743 t=2: v₂ = 1.81+0.36 = 2.1700 → η₂ = 0.1/√2.1700 ≈ 0.0679 t=3: v₃ = 2.17+0.0001=2.1701 → η₃ = 0.1/√2.1701 ≈ 0.0679 t=4: v₄ = 2.17+0.01 = 2.1801 → η₄ = 0.1/√2.1801 ≈ 0.0677 t=5: v₅ = 2.18+0.04 = 2.2201 → η₅ = 0.1/√2.2201 ≈ 0.0671 t=6: v₆ = 2.22+0.25 = 2.4701 → η₆ = 0.1/√2.4701 ≈ 0.063 FINAL ANSWER: η₆ ≈ 0.063 ✓ EXAM TIP: Formula likhna mat bhoolna! Marks milte hain formula ke bhi.

Q5 · 3M RMSProp vs AdaGrad Difference — Formulas ke saath

AdaGrad Update Rule: vt = vt-1 + (∇wt)² RMSProp Update Rule: vt = β×vt-1 + (1-β)×(∇wt)² Main Differences: 1. DENOMINATOR GROWTH: AdaGrad: vt hamesha badhta hai (monotonically increasing) RMSProp: vt increase ya decrease dono kar sakta hai 2. EFFECTIVE LEARNING RATE: AdaGrad: Sirf ghatti hai (monotonically decreasing) RMSProp: Increase, decrease ya constant reh sakti hai 3. LONG TRAINING: AdaGrad: After long training, LR → 0 (training ruk jaati hai!) RMSProp: LR controlled rehti hai throughout 4. β PARAMETER: AdaGrad: Koi β nahi (ya β=1 implicitly) RMSProp: β ∈ [0,1), typically β = 0.9 Conclusion: RMSProp, AdaGrad ka improved version hai jo "ever-growing denominator" ki problem solve karta hai.

Q6 · 4M RMSProp Numerical — η calculate karo

β = 0.9, v₋₁ = 0, ε = 0, η₋₁ = 0.1 | Find first few η values

FORMULA: vt = β×vt-1 + (1-β)×(∇wt)² → β=0.9, (1-β)=0.1 t=0: v₀ = 0.9×0 + 0.1×(1)² = 0.1000 → η₀ = 0.1/√0.1 = 0.316 t=1: v₁ = 0.9×0.1+0.1×(0.9)²= 0.171 → η₁ = 0.1/√0.171 = 0.242 t=2: v₂ = 0.9×0.171+0.1×0.36= 0.1899 → η₂ = 0.1/√0.1899= 0.229 KEY OBSERVATION: RMSProp mein vt AdaGrad jitna nahi badhta Isliye: ηRMSProp > ηAdaGrad (same iterations ke baad) → RMSProp learning jaldi nahi rokta!

Q7 · 3M Adam vs MGD — f(w) = w², 10 Iterations

β₁=0.9, β₂=0.999, β₁₋₁=0, β₂₋₁=0. Which statements are TRUE?

Answer: (a) TRUE, (c) TRUE, (d) TRUE (a) TRUE - MGD moves past minimum: Momentum ki wajah se, jab minimum pe pahunchta hai tab bhi velocity hoti hai → minimum se aage nikal jaata hai β₁ = 0.9 → high momentum → overshooting (b) FALSE - Adam does NOT move past minimum: Adam ka adaptive learning rate automatically adjust hota hai Minimum ke paas gradient chhota hota hai → step size bhi chhota Isliye Adam minimum ke paas controlled rehta hai (c) TRUE - Correct explanation for Adam's behavior (d) TRUE - MGD is closer to minimum after 10 iterations: Adam oscillate karta hai thoda near minimum MGD with momentum direct path leta hai 10 iterations mein MGD actually minimum ke kaafi karib hota hai Note: Code likh ke verify kar sakte hain (question ka hint)

Q8 · 2M L(w,b) = 0.5w² + 5b² + 1 — Find L(w*, b*)

To find minimum, partial derivatives set to zero: ∂L/∂w = 0.5 × 2w = w = 0 → w* = 0 ∂L/∂b = 5 × 2b = 10b = 0 → b* = 0 L(w*, b*) = 0.5×(0)² + 5×(0)² + 1 = 0 + 0 + 1 = 1 Answer: L(w*, b*) = 1 ✓ Note: Minimum point (0,0) pe hai — yeh contour plot ke centre mein hoga. Constant term 1 isliye answer 1 aaya.

Q9 · 3M L2 Regularization ke Effects — True/False

Answer: (a) TRUE, (b) TRUE, (c) TRUE, (d) FALSE, (e) FALSE (a) TRUE - Training error INCREASES: L2 adds penalty term λΣw² to loss function Model cannot fit training data perfectly Isliye training error thoda badhta hai (by design!) (b) TRUE - Test error DECREASES: L2 prevents overfitting by penalizing large weights Smaller weights → better generalization on unseen data Hence test/validation error decreases (c) TRUE - Model complexity REDUCES: L2 encourages smaller weights throughout Less important weights shrink close to zero Simpler model = less complexity = better generalization (d) FALSE - Weights NOT exactly zero: L2 = weights ko zero ke CLOSE laata hai (shrinkage) L1 = weights ko EXACTLY zero karta hai (sparsity) This is the KEY difference! (e) FALSE - Training error does NOT decrease with L2 Formula: L2 Loss = Original Loss + λ × Σwᵢ²

Q10 · 4M Bias Correction in Adam — Mathematical Justification

REASON FOR BIAS CORRECTION IN ADAM: 1. PROBLEM (Initialization Bias): m₀ = 0 aur v₀ = 0 initialize karte hain m₁ = β₁×0 + (1-β₁)×∇w₁ = 0.1×∇w₁ ← bahut chhota! 2. EFFECT WITHOUT BIAS CORRECTION: v₀ = 0.001×(0.1)² = 0.00001 η_eff = 1/√0.00001 = 316 ← BAHUT BADA → Unstable! 3. MATHEMATICAL PROOF: mt = (1-β₁) × Σ(β₁^(t-τ) × ∇wτ) Taking expectation (assuming stationary distribution): E[mt] = E[∇w] × (1 - β₁ᵗ) [Sum of GP with ratio β₁] We want E[m̂t] = E[∇w] ∴ m̂t = mt / (1 - β₁ᵗ) ← This is Bias Correction! 4. EFFECT WITH BIAS CORRECTION: v̂₀ = 0.00001/0.001 = 0.01 η_eff = 1/√0.01 = 10 ← Controlled! ✓ CONCLUSION: Bias correction ensures initial steps stable rahen aur exponential moving average unbiased estimate de. "A lack of initialization bias correction would lead to initial steps that are much larger" — Adam paper

Q11 · 3M AdaDelta — No η₀ Kyun Zaroori Nahi?

AdaDelta Algorithm: 1. vt = β×vt-1 + (1-β)×(∇wt)² ← past gradients 2. Δwt = -(√(ut-1+ε)/√(vt+ε)) × ∇wt ← weight update 3. wt+1 = wt + Δwt 4. ut = β×ut-1 + (1-β)×(Δwt)² ← past weight changes WHY NO INITIAL LEARNING RATE NEEDED: In RMSProp: Δwt = -(η₀/√vt) × ∇wt → η₀ constant hai, manually set karna padta hai In AdaDelta: Δwt = -(√ut-1/√vt) × ∇wt → Numerator = past weight changes ka RMS → Denominator = past gradients ka RMS → Ratio automatically meaningful scale deta hai! → No manual η₀ required! Key Insight: - ut tracks history of weight UPDATES (Δwt) - vt tracks history of GRADIENTS (∇wt) - √ut/√vt acts as adaptive learning rate automatically - This ratio is in correct units/scale automatically Benefit: Not sensitive to initial conditions, works across different problems without manual tuning of η₀

Q12 · 2M CLR / Cosine Annealing — Kyun Useful Hai?

Cyclical Learning Rate (CLR) Useful Hai Kyunki: 1. SADDLE POINT ESCAPE: - Loss surfaces mein saddle points hote hain (flat areas) - Monotonically decreasing LR saddle point pe STUCK ho jaati hai - CLR mein LR increase bhi hoti hai → saddle point se ESCAPE possible! 2. BETTER EXPLORATION + FINE-TUNING: - High LR pe: Wide exploration (naya area dhundhna) - Low LR pe: Fine-tuning near minimum - Dono ka faida ek saath milta hai! 3. FORMULA (Triangular CLR): ηt = ηmin + (ηmax-ηmin) × max(0, 1-|t/μ - 2⌊1+t/2μ⌋+1|) μ = step size (ek cycle ki half length) 4. COSINE ANNEALING (Warm Restart): ηt = ηmin + (ηmax-ηmin)/2 × (1 + cos(π×t/T)) T iterations ke baad ηmax se abrupt restart → "Warm Restart" CONCLUSION: CLR especially helpful hai jab loss surface mein saddle points ya plateaus hon. Monotonic decay wahan fail ho jaata hai.

📖 Important Definitions — Exam Mein Copy Karo!

1. Sparse Feature:
"A feature is called sparse if its value is zero for most training examples, resulting in infrequent gradient updates for the corresponding weight."

2. Adaptive Learning Rate:
"A learning rate that automatically adjusts for each parameter based on the history of its gradients, giving larger updates to infrequent parameters and smaller updates to frequent ones."

3. Adam Optimizer:
"Adam (Adaptive Moment Estimation) combines momentum and RMSProp, maintaining exponentially decaying averages of past gradients (mt) and squared gradients (vt), with bias correction to ensure unbiased estimates."

4. Bias Correction:
"The process of dividing the moving average estimates by (1-βᵗ) to correct the initialization bias towards zero, ensuring the expected value equals the true expected gradient."

5. L2 Regularization:
"A regularization technique that adds λ×Σw² penalty to the loss function, preventing overfitting by discouraging large weights, thereby improving generalization on unseen data."

📐 ALL FORMULAS — EK JAGAH

╔══════════════════════════════════════════════════════════╗
║           ALL OPTIMIZER FORMULAS                        ║
╠══════════════════════════════════════════════════════════╣
║                                                          ║
║  ADAGRAD:                                               ║
║  vt = vt-1 + (∇wt)²                                    ║
║  wt+1 = wt - (η/√(vt+ε)) × ∇wt                        ║
║                                                          ║
║  RMSPROP:                                               ║
║  vt = β×vt-1 + (1-β)×(∇wt)²                           ║
║  wt+1 = wt - (η/√(vt+ε)) × ∇wt                        ║
║                                                          ║
║  ADADELTA:                                              ║
║  vt = β×vt-1 + (1-β)×(∇wt)²                           ║
║  Δwt = -(√(ut-1+ε)/√(vt+ε)) × ∇wt                    ║
║  wt+1 = wt + Δwt                                       ║
║  ut = β×ut-1 + (1-β)×(Δwt)²                           ║
║                                                          ║
║  ADAM:                                                  ║
║  mt = β₁×mt-1 + (1-β₁)×∇wt                            ║
║  m̂t = mt/(1-β₁ᵗ)          ← BIAS CORRECTION (m)      ║
║  vt = β₂×vt-1 + (1-β₂)×(∇wt)²                        ║
║  v̂t = vt/(1-β₂ᵗ)          ← BIAS CORRECTION (v)      ║
║  wt+1 = wt - (η/√(v̂t+ε)) × m̂t                        ║
║                                                          ║
║  ADAMAX:                                                ║
║  mt = β₁×mt-1 + (1-β₁)×∇wt                            ║
║  m̂t = mt/(1-β₁ᵗ)          ← BIAS CORRECTION (only m) ║
║  vt = max(β₂×vt-1, |∇wt|)  ← MAX NORM (no BC!)        ║
║  wt+1 = wt - (η/(vt+ε)) × m̂t                          ║
║                                                          ║
╠══════════════════════════════════════════════════════════╣
║  DEFAULT VALUES (HAMESHA YAAD RAKHO!):                  ║
║  β₁ = 0.9  │  β₂ = 0.999  │  ε = 1e-8  │  η = 0.001  ║
╚══════════════════════════════════════════════════════════╝

📊 Contour Plot — 30 Second Rule

╔══════════════════════════════════════════╗
║  ELLIPSE w₁ axis pe LAMBA               ║
║         ↓                               ║
║  w₁ ko ZYADA updates mile               ║
║         ↓                               ║
║  Feature x₁ = DENSE                    ║
║  Feature x₂ = SPARSE                   ║
╠══════════════════════════════════════════╣
║  ELLIPSE w₂ axis pe LAMBA               ║
║         ↓                               ║
║  w₂ ko ZYADA updates mile              ║
║         ↓                               ║
║  Feature x₁ = SPARSE                   ║
║  Feature x₂ = DENSE                    ║
╚══════════════════════════════════════════╝

⚡ True/False Quick Reference

Statement	Answer
AdaGrad effective LR monotonically decreases	TRUE ✅
RMSProp LR can increase	TRUE ✅
AdaDelta needs initial learning rate	FALSE ❌
Adam uses bias correction for both mt and vt	TRUE ✅
AdaMax needs bias correction for vt	FALSE ❌
L2 regularization increases training error	TRUE ✅
L2 regularization decreases test error	TRUE ✅
L2 drives weights exactly to zero	FALSE ❌
L1 drives weights exactly to zero	TRUE ✅
Sparse feature gets more gradient updates	FALSE ❌
Dense feature corresponds to elongated ellipse axis	TRUE ✅
MGD overshoots minimum due to momentum	TRUE ✅
Adam overshoots minimum	FALSE ❌
CLR helps escape saddle points	TRUE ✅
vt in AdaGrad never decreases	TRUE ✅

🔑 5 Golden Rules — Hamesha Kaam Aayenge!

RULE 1: "LAMBA ELLIPSE = DENSE FEATURE" — Contour plot mein jis axis pe ellipse lamba hai, us axis ka feature DENSE hai

RULE 2: "ADAGRAD SIRF GHATTA HAI" — AdaGrad mein effective LR sirf decrease hoti hai, kabhi increase nahi hoti

RULE 3: "RMSPROP = ADAGRAD + FORGETTING" — RMSProp exponential moving average use karta hai, past ko completely nahi yaad rakhta

RULE 4: "ADAM = MOMENTUM + RMSPROP + BIAS FIX" — Adam = mt (momentum) + vt (RMSProp) + bias correction dono ke liye

RULE 5: "L2 SHRINKS, L1 KILLS" — L2 weights ko ZERO ke PAAS laata hai | L1 weights ko EXACTLY ZERO kar deta hai

🧠 Memory Tricks — ARDA Method

Letter	Algorithm	Trick
A	AdaGrad	"Aaram se chalti hai, phir ROOK jaati hai" — LR zero ho jaati hai
R	RMSProp	"Ruk Mat Stop! Flexible rehti hai" — LR fluctuate kar sakti hai
D	AdaDelta	"Dil se kaam karti hai, khud decide karti hai" — No η₀ needed
A	Adam	"All-rounder! Sab ka combination!" — Default choice 🏆
M	AdaMax	"Maximum respect for biggest gradient!" — max norm

🚨 Common Mistakes — Mat Karna!

"AdaGrad mein LR increase ho sakti hai"

"AdaGrad mein LR sirf decrease hoti hai — vt hamesha badhta hai"

"L2 regularization weights exactly zero karta hai"

"L1 exactly zero, L2 sirf close to zero (shrinkage)"

"Adam mein bias correction sirf mt ke liye hai"

"Bias correction DONO mt aur vt ke liye hai — dono divide hote hain"

"AdaMax mein vt ke liye bhi bias correction chahiye"

"Max norm zero bias nahi hoti, so sirf mt ke liye BC chahiye"

Numerical mein β = 0.9 ko ignore karna (AdaGrad ka formula use karna)

Check karo pehle — AdaGrad (no β) ya RMSProp (β hai)?

"Lamba ellipse = Sparse feature"

"Lamba ellipse = DENSE feature (zyada updates milte hain)"

🏆 Predicted Exam Pattern

📊 Q1-Q2: Contour plot — Sparse/Dense identify karo

3 Marks

📉 Q3: Effective LR graph identify karo

2 Marks

🔢 Q4: AdaGrad η₆ numerical calculate karo

5 Marks

🔢 Q5-Q6: RMSProp numerical

4 Marks

⚡ Q7: Adam vs MGD comparison

3 Marks

📐 Q8: L(w*,b*) minimum find karo

2 Marks

✅ Q9: L2 regularization effects True/False

3 Marks

📖 Q10: Bias correction explanation

4 Marks

Total: ~26 Marks Coverage in Preparation! 🎯

✍️ Exam Answer Writing Tips

📝 Formula Pehle Har numerical mein pehle formula likhna. Marks milte hain formula ke bhi!

📊 Step by Step t=0, t=1, t=2... clearly likho. Intermediate values dikhao.

🎯 Conclusion "Therefore η₆ = 0.063" ya "Hence AdaGrad better for sparse" type conclusion hamesha likho.

🔑 Keywords Dense, sparse, adaptive, monotonically, bias correction, exponential moving average use karo.

✏️ Diagram Contour plot sketch karo jahan possible ho. Arrow dikhao trajectory ka.

⏱️ Easy Pehle Pehle easy/short questions karo. Confidence ke saath aage badho!

✅ Last 30 Minutes Revision Checklist

Contour plot rule yaad hai? (LAMBA = DENSE)
AdaGrad formula yaad hai? (vt = vt-1 + grad²)
RMSProp formula yaad hai? (vt = β×vt-1 + (1-β)×grad²)
Adam ke 5 steps yaad hain? (mt, m̂t, vt, v̂t, update)
Default values yaad hain? (β₁=0.9, β₂=0.999, ε=1e-8)
Bias correction kyun? (Initial zero bias, large initial steps rokne ke liye)
L2 vs L1 difference? (Close to 0 vs Exactly 0)
AdaDelta — no η₀ kyun? (Numerator = weight change history)
CLR kyun useful? (Saddle point escape)
Numerical solve karna aata hai? (η = η₋₁/√vt)

Complete Study Guide 📚

🗺️ Yeh Subject Hai Kya? (5 Minute Story)

🎭 Characters of Our Story

📖 Complete Story Flow

🍕 Real Life Analogy — Pizza Shop

📱 Dense Feature

☂️ Sparse Feature

🔥 High Priority Topics

Sparse vs Dense Features + Contour Plots

📘 Simple Explanation

Dense Feature

Sparse Feature

📊 Contour Plot — EXAM KA SABSE IMPORTANT RULE!

Exam Mein Kaise Dekhen:

AdaGrad — Adaptive Gradient

📘 Simple Explanation

📐 Formula

🔢 Numerical Solve Karna — Template

RMSProp — AdaGrad ka Upgrade

📘 Simple Explanation

📐 Formula

Key Difference from AdaGrad:

Adam — The HERO 🦸

📘 Simple Explanation

📐 Full Formula — 5 Steps

Bias Correction — Kyun Zaroori Hai?

📘 Simple Explanation

📐 Mathematical Proof

AdaDelta — No Learning Rate Needed!

📘 Simple Explanation

L2 Regularization

📘 Simple Explanation

Adam vs MGD — Parabola f(w) = w²

📘 Results (Default β values, 10 iterations)

Learning Rate Schedulers

📘 Types of Schedulers

1. Step Decay

2. Exponential Decay

3. Cyclical LR (CLR)

4. Cosine Annealing

AdaMax — Max Norm Instead of L2

📘 Simple Explanation

Adam (L2 norm)

AdaMax (Max norm)

📊 Master Comparison Table — PHOTO KHEENCH LO!

📖 Important Definitions — Exam Mein Copy Karo!

📐 ALL FORMULAS — EK JAGAH

📊 Contour Plot — 30 Second Rule

⚡ True/False Quick Reference

🔑 5 Golden Rules — Hamesha Kaam Aayenge!

🧠 Memory Tricks — ARDA Method

🚨 Common Mistakes — Mat Karna!

🏆 Predicted Exam Pattern

✍️ Exam Answer Writing Tips

✅ Last 30 Minutes Revision Checklist

🎯 Ready Ho! 100/100 Zaroor Milega!