Deep Learning Cheatsheet

1 Biological Neuron & MP Neuron

🌳 Biological Neuron — Tree jaisi structure

Part	Kaam (Function)
Dendrite	Signals receive karta hai (ears 👂)
Soma	Process karta hai (brain 🧠)
Axon	Signal bhejta hai (mouth 🗣️)
Synapse	Connection point (telephone wire 📞)

💡 Brain mein 86 billion neurons hain — sab parallel kaam karte hain!

🚧 McCulloch-Pitts (MP) Neuron — 1943

Step 1 — g(x): Inputs Add Karo

g(x) = x₁ + x₂ + x₃ + ... + xₙ

Step 2 — f: Threshold θ se Compare Karo

Output = 1 if g(x) ≥ θ (fires!) Output = 0 if g(x) < θ (silent!)

Excitatory input = normal vote
Inhibitory input = VETO power — agar 1 hai toh output automatic 0 🗳️

🔢 AND / OR Gate Examples

AND Gate — θ = 2 set karo

x₁	x₂	Sum	Output
0	0	0	0
1	0	1	0
0	1	1	0
1	1	2	1 ✅

OR Gate — θ = 1 set karo

Fire karo jab koi bhi ek input = 1 ho ✅

2 Linear Separability & XOR Problem

📊 Linear Separability Kya Hai?

Ek straight line se 1s aur 0s ko alag kar sako — tab linearly separable hai.

✅ AND — Separable

🔵 🔴
🔵 🔵
Diagonal line khich sakti hai!

❌ XOR — Not Separable

🔴 🔵
🔵 🔴
Diagonal corners mein hain!

⚠️ Key Rule: Single MP/Perceptron neuron sirf linearly separable problems solve kar sakta hai!

🧩 XOR Problem — AI Winter Ka Reason

x₁	x₂	XOR Output
0	0	0
1	0	1
0	1	1
1	1	0

XOR = "Ek ho ya doosra, dono nahi"

🔥 Solution: Multiple layers use karo — yahi MLP hai!

Minsky & Papert ne 1960s mein prove kiya: single perceptron can't solve XOR
Isse "AI Winter" aa gaya — funding band! ❄️

3 Perceptron — Upgraded Neuron (1958)

⚖️ Perceptron Formula — Weights Introduce Hue!

Main Formula

y = 1 if Σ(wᵢ × xᵢ) ≥ θ, else 0 i=1 to n

Bias Form — θ ko w₀ mein fold karo (w₀ = -θ, x₀ = 1)

y = 1 if w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ ≥ 0

Vector Form

y = 1 if wᵀx ≥ 0 → Side 1 (output=1) y = 0 if wᵀx < 0 → Side 2 (output=0) Dividing line: wᵀx = 0 (weight vector ⊥ to this line!)

💡 Bias (w₀) = Judge ka pre-existing opinion — input dekhne se pehle hi output ko push karta hai!

📊 MP Neuron vs Perceptron

Feature	MP Neuron	Perceptron
Inputs	0 or 1	Any real number
Weights	All equal =1	Learnable, different
Threshold	Fixed, manual	Learned as bias
Flexibility	Low	Higher ✅

4 Perceptron Learning Algorithm — Self-Learning! 🤖

🔄 Algorithm Steps

Weights ko randomly initialize karo

Ek random example x pick karo

Agar x positive class ka hai (y=1) lekin hum predict kiya 0:
w = w + x (w ko x ke paas lao)

Agar x negative class ka hai (y=0) lekin hum predict kiya 1:
w = w - x (w ko x se dur lao)

Repeat karo jab tak sab examples correctly classify na ho

Update Rules Summary

Predicted 0, Should be 1: w = w + x Predicted 1, Should be 0: w = w - x

📐 Geometry: w = w + x Kyun Kaam Karta Hai?

Angle between w and x determines output:

angle < 90° → dot product > 0 → output = 1
angle > 90° → dot product < 0 → output = 0

Jab w + x karte hain toh w aur x ka angle decrease hota hai → w x ke direction mein aata hai → output 1 ho jaata hai! ✅

Convergence Theorem: Agar data linearly separable hai toh algorithm finite steps mein converge ZAROOR karega!

⚠️ Data linearly separable nahi hai → Algorithm forever run karega!

5 Multi-Layer Perceptron (MLP) — XOR ka Solution!

🏛️ MLP Structure

Input Layer — Raw data (x₁, x₂, ...) — koi processing nahi

Hidden Layers — Intermediate processing — "thinking" layers 🧑‍🍳

Output Layer — Final answer

XOR ke liye — 4 Hidden Neurons

h₁ fires when x₁=-1, x₂=-1 → case (0,0) h₂ fires when x₁=+1, x₂=-1 → case (1,0) ✅ h₃ fires when x₁=-1, x₂=+1 → case (0,1) ✅ h₄ fires when x₁=+1, x₂=+1 → case (1,1) Output: w₁=0, w₂=+1, w₃=+1, w₄=0 → XOR SOLVED!

📏 Representation Power — BIG Theorem!

🔥 Theorem: Koi bhi Boolean function with n inputs ko ek network represent kar sakta hai jisme:

Hidden layer = 2ⁿ perceptrons Output layer = 1 perceptron

Inputs (n)	Hidden Neurons (2ⁿ)
2	4
5	32
10	1,024
20	1,048,576 (!)

⚠️ Catch: n badhne pe neurons exponentially badhte hain → Yahi reason hai Deep Learning ka! 🚀

6 Sigmoid Neuron — Smooth Learning! 📈

📉 Step Function Ka Problem

Problem 1: Tiny change → Huge jump (49 marks = FAIL, 50 marks = PASS — unfair!)
Problem 2: Zero gradient = Learning nahi ho sakta! 😱
Step function = Old light switch (ON/OFF)
Sigmoid = Modern dimmer (gradually changes)

📐 Sigmoid Formula

Full Form

σ(z) = 1 / (1 + e⁻ᶻ) where z = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ

Properties

Output range: (0, 1) When z = 0 → σ = 0.5 (uncertain!) When z → +∞ → σ → 1 (confident YES) When z → -∞ → σ → 0 (confident NO)

🎲 Sigmoid as Probability!

Output	Matlab
0.95	95% chance — spam hai! 📧
0.50	50-50 — pata nahi 🤷
0.03	3% chance — safe hai ✅

✅ Sigmoid smooth & differentiable hai → Gradient exist karta hai everywhere → Learning ho sakti hai!

7 ML ke 4 Components — Supervised Learning Setup

🏠 4 Components (House Price Example se samjho)

DATA — Training examples {xᵢ, yᵢ} — x = features (size, rooms), y = price

MODEL — Mathematical guess: ŷ = f̂(x; θ)

LEARNING ALGORITHM — Best parameters kaise dhundein → Gradient Descent!

LOSS FUNCTION — Prediction kitna galat hai → MSE!

Model Types

Linear: ŷ = wᵀx Sigmoid: ŷ = 1/(1 + e^(-wᵀx)) Quadratic: ŷ = xᵀWx

📉 Loss Function — MSE (Mean Squared Error)

MSE Formula

L = (1/N) × Σ (ŷᵢ - yᵢ)² i=1 to N

Squaring kyun? 3 reasons:

-3 aur +3 cancel nahi honge (negative errors fix)
Big errors zyada penalize hote hain (error 2→4, error 4→16!)
Mathematically easy to differentiate

Example

Actual: [45, 72, 28] Predicted: [42, 78, 27] Errors²: [9, 36, 1] MSE = (9+36+1)/3 = 15.33 ← Lower = Better!

8 Gradient Descent — Mountain Se Valley Tak! 🏔️

⛰️ Intuition + Math Derivation

Blindfold pe mountain pe ho, neeche valley mein jaana hai → Slope feel karo → Downhill step lo → Repeat!

Taylor Series (1st order approximation)

L(θ + ηu) ≈ L(θ) + η × uᵀ × ∇θL(θ)

Minimize karne ke liye u ka direction chahiye

uᵀ∇L = ||u||||∇L||cosβ Most negative when cosβ = -1, β = 180° → u = -∇θL(θ) (OPPOSITE direction of gradient!)

🔥 Parameter Update Rules

w_(t+1) = w_t - η × ∇w_t b_(t+1) = b_t - η × ∇b_t η (eta) = learning rate (step ka size)

✅ Minus sign important hai — gradient ke opposite direction mein jaate hain (downhill)!

🔢 Gradients for Sigmoid Neuron (MSE loss)

Gradient w.r.t. Weight

∇w = ∂L/∂w = (f(x) - y) × f(x) × (1 - f(x)) × x

Gradient w.r.t. Bias

∇b = ∂L/∂b = (f(x) - y) × f(x) × (1 - f(x))

Term	Matlab
(f(x) - y)	Kitna galat predict kiya?
f(x)(1-f(x))	Sigmoid derivative (slope)
x	Kis input se relation hai?

Learning Rate η — Goldilocks Zone 🐻

η too large → overshoot, diverge ❌ (pahaad se gir gaye) η too small → very slow ❌ (renge renge chalte raho) η just right → fast convergence ✅

9 Universal Approximation Theorem

🧱 The Big Theorem — Koi bhi Function!

🔥 Theorem: Ek single hidden layer wala multilayer sigmoid network koi bhi continuous function approximate kar sakta hai — kisi bhi desired precision se!

Face recognition
Weather prediction
Medical diagnosis
Language translation

LEGO Analogy 🧱: Individual LEGO bricks rectangular hain, lekin enough bricks se koi bhi shape ban sakta hai. Waise hi individual sigmoids S-shaped hain, lekin unhe combine karo aur koi bhi function approximate kar sakte ho!

Individual sigmoids (building blocks): σ₁: ──╯ σ₂: ──╯ σ₃: ╰── Combined → approximates any target function! → Yahi hai Deep Learning ka foundation! 🚀

10 Feed Forward Neural Network (FFNN) — Assembly Line! 🏭

⚙️ FFNN Structure — Har Layer Mein 2 Operations

Operation 1: Pre-Activation aᵢ (Linear Transformation)

aᵢ(x) = bᵢ + Wᵢ × hᵢ₋₁(x)

Operation 2: Activation hᵢ (Non-Linear)

hᵢ(x) = g(aᵢ(x))

Complete Forward Pass (3-layer network)

ŷ = O(W₃ · g(W₂ · g(W₁x + b₁) + b₂) + b₃)

Parameters θ (sab weights + biases)

θ = {W₁, W₂, ..., WL, b₁, b₂, ..., bL}

💡 Input Layer = h₀ = x (koi processing nahi!)

🤔 Non-Linearity Kyun Zaruri Hai?

Without Activation — Sab Linear Collapse!

h₁ = W₁x + b₁ h₂ = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁+b₂) ← Still JUST ONE LINEAR LAYER! ❌

With Activation — Non-Linear Magic! ✅

h₁ = g(W₁x + b₁) ← Non-linear! h₂ = g(W₂h₁ + b₂) ← Non-linear combo!

Sigmoid g(z) = 1/(1+e⁻ᶻ) → output (0,1)

tanh g(z) = tanh(z) → output (-1,1)

ReLU g(z) = max(0,z) → output [0,∞)

Weight Matrix Dimensions

Wᵢ shape: m × n (m = neurons in layer i, n = prev layer) bᵢ shape: m × 1 (ek bias per neuron)

11 Output Functions & Loss Functions — Right Tool, Right Job! 🎯

📈 Regression — Real Values Predict karo

Output Activation: Linear

f(x) = WO × aL + bO (No squishing — any real number!)

Loss: MSE

L(θ) = (1/N) × Σᵢ Σⱼ (ŷᵢⱼ - yᵢⱼ)²

House price: ₹45,73,291
Temperature: 27.3°C
Stock price: $142.67

🐾 Classification — Categories Predict karo

Output Activation: Softmax

ŷⱼ = e^(aL,j) / Σᵢ e^(aL,i) Properties: - 0 < ŷⱼ < 1 ✅ - Σ ŷⱼ = 1 ✅ (valid probability!)

Example: [Dog=3.0, Cat=1.0, Bird=0.2]

e^3.0=20.09, e^1.0=2.72, e^0.2=1.22 Sum=24.03 Dog=83.6%, Cat=11.3%, Bird=5.1% ✅

📉 Cross Entropy Loss

Full Formula

L(θ) = -(1/N) × Σᵢ Σⱼ [yᵢⱼ log(ŷᵢⱼ) + (1-yᵢⱼ) log(1-ŷᵢⱼ)]

With One-Hot Labels: Simplifies to

L = -log(ŷₗ) (l = true class)

ŷ	Loss -log(ŷ)
0.99 ✅	0.01 (tiny)
0.50	0.69 (medium)
0.01 ❌	4.61 (HUGE!)

❌ Wrong + Confident = Catastrophic loss!

🌳 Output Selection Decision Tree — Kya use karein?

Output type?

→

Number (regression)

→

Linear activation + MSE loss

Output type?

→

2 categories (binary)

→

Sigmoid + Binary Cross Entropy

Output type?

→

>2 categories (multi-class)

→

Softmax + Cross Entropy

12 Backpropagation — Galti ka Blame Dono Taraf! 🔄

🔗 Chain Rule — Foundation of Backprop

Analogy: Late ho gaye → Alarm nahi baja → Phone silent tha → Friend ka late text — causes ki chain! 🚗

Basic Chain Rule

dy/dx = (dy/dz) × (dz/dx)

Multiple Paths (sum karo sab)

∂p/∂z = Σₘ (∂p/∂qₘ) × (∂qₘ/∂z)

Forward vs Backward Pass

Forward: Input → L1 → L2 → L3 → Loss Backward: Loss → L3 → L2 → L1 (gradients!)

💡 Backprop = "Blame" divide karo — har weight ka loss mein kitna contribution tha?

📊 Backprop Gradients — Step by Step

Part 1: Output Layer (Softmax + Cross Entropy)

∂L/∂aL,i = ŷᵢ - yᵢ (Predicted probability minus True probability!)

Concrete Example 🐾: True=Cat, Predicted=[Dog=0.7, Cat=0.2, Bird=0.1]

Gradients (ŷ - y): Dog: 0.7 - 0 = +0.7 → too high, push DOWN ⬇️ Cat: 0.2 - 1 = -0.8 → too low, push UP ⬆️ Bird: 0.1 - 0 = +0.1 → slightly high, push down ⬇️

Part 2: Hidden Layer Gradient

∂L/∂hᵢⱼ = Σₘ (∂L/∂aᵢ₊₁,ₘ) × Wᵢ₊₁,ₘ,ⱼ ∂L/∂aᵢⱼ = (∂L/∂hᵢⱼ) × g'(aᵢⱼ)

Part 3: Weight & Bias Gradient

∂L/∂Wᵢ = (∂L/∂aᵢ) × hᵢ₋₁ᵀ ∂L/∂bᵢ = ∂L/∂aᵢ

📐 Activation Derivatives

Sigmoid: g'(z) = σ(z)(1 - σ(z)) Example: σ=0.8 → g' = 0.8×0.2 = 0.16 tanh: g'(z) = 1 - tanh²(z) Example: tanh=0.9 → g' = 1 - 0.81 = 0.19 ReLU: g'(z) = 1 if z > 0, else 0

⚠️ Sigmoid problem: jab σ ≈ 0 ya 1 → g' ≈ 0 → Vanishing Gradient!

🔄 Full Backprop Algorithm

Forward pass → ŷ compute karo

Loss L compute karo

Output layer gradient: ŷ - y

Hidden layers backward (chain rule)

Weight/bias gradients compute karo

θ update: θ_new = θ_old - η × ∇θ

🏀 Big Picture — Basketball Analogy

Basketball	ML Equivalent
Throwing technique	Parameters (w, b)
Ball goes in?	Prediction ŷ
Basket location	True label y
Distance missed	Loss/Error
Coach's feedback	Gradient
Adjustment size	Learning rate η
Practice sessions	Training iterations

13 Gradient Descent Variants — Faster & Smarter! ⚡

🚀 Momentum-Based GD

Regular GD (flat regions mein stuck!)

w_(t+1) = w_t - η × ∇w_t

Momentum GD (ball rolling down hill!) 🎳

uₜ = β × uₜ₋₁ + ∇wₜ (update direction) wₜ₊₁ = wₜ - η × uₜ β (beta) = momentum coefficient β = 0.9 is common choice

💡 Momentum = speed build-up karta hai — flat regions mein bhi aage badhta hai!

β = 0: Regular GD
β = 0.9: 90% previous direction use karo
Too high β: Overshoot ho sakta hai ⚠️

👀 Nesterov Accelerated Gradient (NAG)

"Look Ahead" Strategy 🔮

NAG Step 1: Lookahead point compute karo w_lookahead = w_t - β × uₜ₋₁ NAG Step 2: Gradient at lookahead compute uₜ = β × uₜ₋₁ + ∇w_lookahead NAG Step 3: Update wₜ₊₁ = wₜ - η × uₜ

Momentum ❌

Current point pe gradient lete ho, phir jump

NAG ✅

Pehle jump karo, gradient wahan lete ho — zyada accurate!

✅ NAG = less oscillation, faster convergence!

📦 Batch vs SGD vs Mini-Batch

Type	Data Used	Speed
Batch GD	Full dataset	Slow/step
SGD	1 sample	Fast/step
Mini-Batch	B samples	Best! ⭐

Mini-Batch Update Rule

wₜ₊₁ = wₜ - (η/B) × Σ ∇wₜ (sum over batch) B = 32 or 64 is standard choice!

⭐ Mini-batch GD = Best balance of speed + accuracy → Default choice in deep learning!

14 Learning Rate Scheduling — Start Big, End Small! 📉

📊 Scheduling Methods

Method 1: Step Decay — Fixed intervals pe halve karo

Every 5 epochs: η = η / 2 Epoch 1-5: η=0.1 Epoch 6-10: η=0.05 (halved) Epoch 11-15:η=0.025 (halved again)

Method 2: Exponential Decay 📉

ηₜ = η₀ × e^(-kt) k = decay rate, t = current step/epoch Example: η₀=0.1, k=0.1 t=0: η = 0.100 t=10: η = 0.037 t=50: η = 0.001

Method 3: 1/t Decay (Inverse Time)

ηₜ = η₀ / (1 + k×t) Stays higher for longer → more exploration!

🏆 Best Combination for Real Projects

⭐ Production mein yeh combination use karo:

Mini-batch (B=32 or 64)

Momentum or NAG

Learning Rate Scheduling

= Fast, stable, good convergence! 🚀

Method	Best For
Batch GD	Small datasets, full memory
SGD	Online learning
Mini-Batch ⭐	Default — everything!
Step Decay	Common practice
Exp Decay	Image classification
Line Search	Research settings

★ MASTER SUMMARY — Poori Journey! 🗺️

🧬

Biological Neuron

Dendrite → Soma → Axon → Synapse

🔌

MP Neuron (1943)

Binary inputs, fixed threshold, no learning

⚖️

Perceptron (1958)

Real inputs, learnable weights, only linearly separable

🏛️

MLP

Layers stack → solves XOR → any boolean fn

📈

Sigmoid Neuron

Smooth, probabilistic, differentiable, can learn!

🚀

FFNN + Backprop

Forward pass → Loss → Backward pass → Update

🎯 Exam mein yaad rakho: Data → Model → Loss → Gradient → Update → Repeat! Yahi Deep Learning hai!