๐ง Deep Learning โ Complete Cheatsheet
Biological Neuron โ MP Neuron โ Perceptron โ MLP โ Sigmoid โ FFNN โ Backprop โ Gradient Descent
๐ 100/100 KE LIYE โ HINGLISH EDITION ๐ฅ
1 Biological Neuron & MP Neuron
๐ณ Biological Neuron โ Tree jaisi structure
| Part | Kaam (Function) |
| Dendrite | Signals receive karta hai (ears ๐) |
| Soma | Process karta hai (brain ๐ง ) |
| Axon | Signal bhejta hai (mouth ๐ฃ๏ธ) |
| Synapse | Connection point (telephone wire ๐) |
๐ก Brain mein 86 billion neurons hain โ sab parallel kaam karte hain!
๐ง McCulloch-Pitts (MP) Neuron โ 1943
Step 1 โ g(x): Inputs Add Karo
g(x) = xโ + xโ + xโ + ... + xโ
Step 2 โ f: Threshold ฮธ se Compare Karo
Output = 1 if g(x) โฅ ฮธ (fires!)
Output = 0 if g(x) < ฮธ (silent!)
- Excitatory input = normal vote
- Inhibitory input = VETO power โ agar 1 hai toh output automatic 0 ๐ณ๏ธ
๐ข AND / OR Gate Examples
AND Gate โ ฮธ = 2 set karo
| xโ | xโ | Sum | Output |
| 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 0 | 1 | 1 | 0 |
| 1 | 1 | 2 | 1 โ
|
OR Gate โ ฮธ = 1 set karo
Fire karo jab koi bhi ek input = 1 ho โ
2 Linear Separability & XOR Problem
๐ Linear Separability Kya Hai?
Ek straight line se 1s aur 0s ko alag kar sako โ tab linearly separable hai.
โ
AND โ Separable
๐ต ๐ด
๐ต ๐ต
Diagonal line khich sakti hai!
โ XOR โ Not Separable
๐ด ๐ต
๐ต ๐ด
Diagonal corners mein hain!
โ ๏ธ Key Rule: Single MP/Perceptron neuron sirf linearly separable problems solve kar sakta hai!
๐งฉ XOR Problem โ AI Winter Ka Reason
| xโ | xโ | XOR Output |
| 0 | 0 | 0 |
| 1 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 1 | 0 |
XOR = "Ek ho ya doosra, dono nahi"
๐ฅ Solution: Multiple layers use karo โ yahi MLP hai!
- Minsky & Papert ne 1960s mein prove kiya: single perceptron can't solve XOR
- Isse "AI Winter" aa gaya โ funding band! โ๏ธ
3 Perceptron โ Upgraded Neuron (1958)
โ๏ธ Perceptron Formula โ Weights Introduce Hue!
Main Formula
y = 1 if ฮฃ(wแตข ร xแตข) โฅ ฮธ, else 0
i=1 to n
Bias Form โ ฮธ ko wโ mein fold karo (wโ = -ฮธ, xโ = 1)
y = 1 if wโ + wโxโ + wโxโ + ... + wโxโ โฅ 0
Vector Form
y = 1 if wแตx โฅ 0 โ Side 1 (output=1)
y = 0 if wแตx < 0 โ Side 2 (output=0)
Dividing line: wแตx = 0 (weight vector โฅ to this line!)
๐ก Bias (wโ) = Judge ka pre-existing opinion โ input dekhne se pehle hi output ko push karta hai!
๐ MP Neuron vs Perceptron
| Feature | MP Neuron | Perceptron |
| Inputs | 0 or 1 | Any real number |
| Weights | All equal =1 | Learnable, different |
| Threshold | Fixed, manual | Learned as bias |
| Flexibility | Low | Higher โ
|
4 Perceptron Learning Algorithm โ Self-Learning! ๐ค
๐ Algorithm Steps
1
Weights ko randomly initialize karo
2
Ek random example x pick karo
3
Agar x positive class ka hai (y=1) lekin hum predict kiya 0:
w = w + x (w ko x ke paas lao)
4
Agar x negative class ka hai (y=0) lekin hum predict kiya 1:
w = w - x (w ko x se dur lao)
5
Repeat karo jab tak sab examples correctly classify na ho
Update Rules Summary
Predicted 0, Should be 1: w = w + x
Predicted 1, Should be 0: w = w - x
๐ Geometry: w = w + x Kyun Kaam Karta Hai?
Angle between w and x determines output:
- angle < 90ยฐ โ dot product > 0 โ output = 1
- angle > 90ยฐ โ dot product < 0 โ output = 0
Jab w + x karte hain toh w aur x ka angle decrease hota hai โ w x ke direction mein aata hai โ output 1 ho jaata hai! โ
Convergence Theorem: Agar data linearly separable hai toh algorithm finite steps mein converge ZAROOR karega!
โ ๏ธ Data linearly separable nahi hai โ Algorithm forever run karega!
5 Multi-Layer Perceptron (MLP) โ XOR ka Solution!
๐๏ธ MLP Structure
Input Layer โ Raw data (xโ, xโ, ...) โ koi processing nahi
Hidden Layers โ Intermediate processing โ "thinking" layers ๐งโ๐ณ
Output Layer โ Final answer
XOR ke liye โ 4 Hidden Neurons
hโ fires when xโ=-1, xโ=-1 โ case (0,0)
hโ fires when xโ=+1, xโ=-1 โ case (1,0) โ
hโ fires when xโ=-1, xโ=+1 โ case (0,1) โ
hโ fires when xโ=+1, xโ=+1 โ case (1,1)
Output: wโ=0, wโ=+1, wโ=+1, wโ=0 โ XOR SOLVED!
๐ Representation Power โ BIG Theorem!
๐ฅ Theorem: Koi bhi Boolean function with n inputs ko ek network represent kar sakta hai jisme:
Hidden layer = 2โฟ perceptrons
Output layer = 1 perceptron
| Inputs (n) | Hidden Neurons (2โฟ) |
| 2 | 4 |
| 5 | 32 |
| 10 | 1,024 |
| 20 | 1,048,576 (!) |
โ ๏ธ Catch: n badhne pe neurons exponentially badhte hain โ Yahi reason hai Deep Learning ka! ๐
6 Sigmoid Neuron โ Smooth Learning! ๐
๐ Step Function Ka Problem
- Problem 1: Tiny change โ Huge jump (49 marks = FAIL, 50 marks = PASS โ unfair!)
- Problem 2: Zero gradient = Learning nahi ho sakta! ๐ฑ
- Step function = Old light switch (ON/OFF)
- Sigmoid = Modern dimmer (gradually changes)
๐ Sigmoid Formula
Full Form
ฯ(z) = 1 / (1 + eโปแถป)
where z = wโ + wโxโ + wโxโ + ... + wโxโ
Properties
Output range: (0, 1)
When z = 0 โ ฯ = 0.5 (uncertain!)
When z โ +โ โ ฯ โ 1 (confident YES)
When z โ -โ โ ฯ โ 0 (confident NO)
๐ฒ Sigmoid as Probability!
| Output | Matlab |
| 0.95 | 95% chance โ spam hai! ๐ง |
| 0.50 | 50-50 โ pata nahi ๐คท |
| 0.03 | 3% chance โ safe hai โ
|
โ
Sigmoid smooth & differentiable hai โ Gradient exist karta hai everywhere โ Learning ho sakti hai!
7 ML ke 4 Components โ Supervised Learning Setup
๐ 4 Components (House Price Example se samjho)
D
DATA โ Training examples {xแตข, yแตข} โ x = features (size, rooms), y = price
M
MODEL โ Mathematical guess: ลท = fฬ(x; ฮธ)
A
LEARNING ALGORITHM โ Best parameters kaise dhundein โ Gradient Descent!
L
LOSS FUNCTION โ Prediction kitna galat hai โ MSE!
Model Types
Linear: ลท = wแตx
Sigmoid: ลท = 1/(1 + e^(-wแตx))
Quadratic: ลท = xแตWx
๐ Loss Function โ MSE (Mean Squared Error)
MSE Formula
L = (1/N) ร ฮฃ (ลทแตข - yแตข)ยฒ
i=1 to N
Squaring kyun? 3 reasons:
- -3 aur +3 cancel nahi honge (negative errors fix)
- Big errors zyada penalize hote hain (error 2โ4, error 4โ16!)
- Mathematically easy to differentiate
Example
Actual: [45, 72, 28]
Predicted: [42, 78, 27]
Errorsยฒ: [9, 36, 1]
MSE = (9+36+1)/3 = 15.33 โ Lower = Better!
8 Gradient Descent โ Mountain Se Valley Tak! ๐๏ธ
โฐ๏ธ Intuition + Math Derivation
Blindfold pe mountain pe ho, neeche valley mein jaana hai โ Slope feel karo โ Downhill step lo โ Repeat!
Taylor Series (1st order approximation)
L(ฮธ + ฮทu) โ L(ฮธ) + ฮท ร uแต ร โฮธL(ฮธ)
Minimize karne ke liye u ka direction chahiye
uแตโL = ||u||||โL||cosฮฒ
Most negative when cosฮฒ = -1, ฮฒ = 180ยฐ
โ u = -โฮธL(ฮธ) (OPPOSITE direction of gradient!)
๐ฅ Parameter Update Rules
w_(t+1) = w_t - ฮท ร โw_t
b_(t+1) = b_t - ฮท ร โb_t
ฮท (eta) = learning rate (step ka size)
โ
Minus sign important hai โ gradient ke opposite direction mein jaate hain (downhill)!
๐ข Gradients for Sigmoid Neuron (MSE loss)
Gradient w.r.t. Weight
โw = โL/โw = (f(x) - y) ร f(x) ร (1 - f(x)) ร x
Gradient w.r.t. Bias
โb = โL/โb = (f(x) - y) ร f(x) ร (1 - f(x))
| Term | Matlab |
| (f(x) - y) | Kitna galat predict kiya? |
| f(x)(1-f(x)) | Sigmoid derivative (slope) |
| x | Kis input se relation hai? |
Learning Rate ฮท โ Goldilocks Zone ๐ป
ฮท too large โ overshoot, diverge โ (pahaad se gir gaye)
ฮท too small โ very slow โ (renge renge chalte raho)
ฮท just right โ fast convergence โ
9 Universal Approximation Theorem
๐งฑ The Big Theorem โ Koi bhi Function!
๐ฅ Theorem: Ek single hidden layer wala multilayer sigmoid network koi bhi continuous function approximate kar sakta hai โ kisi bhi desired precision se!
- Face recognition
- Weather prediction
- Medical diagnosis
- Language translation
LEGO Analogy ๐งฑ: Individual LEGO bricks rectangular hain, lekin enough bricks se koi bhi shape ban sakta hai. Waise hi individual sigmoids S-shaped hain, lekin unhe combine karo aur koi bhi function approximate kar sakte ho!
Individual sigmoids (building blocks):
ฯโ: โโโฏ ฯโ: โโโฏ ฯโ: โฐโโ
Combined โ approximates any target function!
โ Yahi hai Deep Learning ka foundation! ๐
10 Feed Forward Neural Network (FFNN) โ Assembly Line! ๐ญ
โ๏ธ FFNN Structure โ Har Layer Mein 2 Operations
Operation 1: Pre-Activation aแตข (Linear Transformation)
aแตข(x) = bแตข + Wแตข ร hแตขโโ(x)
Operation 2: Activation hแตข (Non-Linear)
hแตข(x) = g(aแตข(x))
Complete Forward Pass (3-layer network)
ลท = O(Wโ ยท g(Wโ ยท g(Wโx + bโ) + bโ) + bโ)
Parameters ฮธ (sab weights + biases)
ฮธ = {Wโ, Wโ, ..., WL, bโ, bโ, ..., bL}
๐ก Input Layer = hโ = x (koi processing nahi!)
๐ค Non-Linearity Kyun Zaruri Hai?
Without Activation โ Sab Linear Collapse!
hโ = Wโx + bโ
hโ = Wโ(Wโx + bโ) + bโ
= (WโWโ)x + (Wโbโ+bโ)
โ Still JUST ONE LINEAR LAYER! โ
With Activation โ Non-Linear Magic! โ
hโ = g(Wโx + bโ) โ Non-linear!
hโ = g(Wโhโ + bโ) โ Non-linear combo!
Sigmoid g(z) = 1/(1+eโปแถป) โ output (0,1)
tanh g(z) = tanh(z) โ output (-1,1)
ReLU g(z) = max(0,z) โ output [0,โ)
Weight Matrix Dimensions
Wแตข shape: m ร n (m = neurons in layer i, n = prev layer)
bแตข shape: m ร 1 (ek bias per neuron)
11 Output Functions & Loss Functions โ Right Tool, Right Job! ๐ฏ
๐ Regression โ Real Values Predict karo
Output Activation: Linear
f(x) = WO ร aL + bO
(No squishing โ any real number!)
Loss: MSE
L(ฮธ) = (1/N) ร ฮฃแตข ฮฃโฑผ (ลทแตขโฑผ - yแตขโฑผ)ยฒ
- House price: โน45,73,291
- Temperature: 27.3ยฐC
- Stock price: $142.67
๐พ Classification โ Categories Predict karo
Output Activation: Softmax
ลทโฑผ = e^(aL,j) / ฮฃแตข e^(aL,i)
Properties:
- 0 < ลทโฑผ < 1 โ
- ฮฃ ลทโฑผ = 1 โ
(valid probability!)
Example: [Dog=3.0, Cat=1.0, Bird=0.2]
e^3.0=20.09, e^1.0=2.72, e^0.2=1.22
Sum=24.03
Dog=83.6%, Cat=11.3%, Bird=5.1% โ
๐ Cross Entropy Loss
Full Formula
L(ฮธ) = -(1/N) ร ฮฃแตข ฮฃโฑผ [yแตขโฑผ log(ลทแตขโฑผ)
+ (1-yแตขโฑผ) log(1-ลทแตขโฑผ)]
With One-Hot Labels: Simplifies to
L = -log(ลทโ) (l = true class)
| ลท | Loss -log(ลท) |
| 0.99 โ
| 0.01 (tiny) |
| 0.50 | 0.69 (medium) |
| 0.01 โ | 4.61 (HUGE!) |
โ Wrong + Confident = Catastrophic loss!
๐ณ Output Selection Decision Tree โ Kya use karein?
Output type?
โ
Number (regression)
โ
Linear activation + MSE loss
Output type?
โ
2 categories (binary)
โ
Sigmoid + Binary Cross Entropy
Output type?
โ
>2 categories (multi-class)
โ
Softmax + Cross Entropy
12 Backpropagation โ Galti ka Blame Dono Taraf! ๐
๐ Chain Rule โ Foundation of Backprop
Analogy: Late ho gaye โ Alarm nahi baja โ Phone silent tha โ Friend ka late text โ causes ki chain! ๐
Basic Chain Rule
dy/dx = (dy/dz) ร (dz/dx)
Multiple Paths (sum karo sab)
โp/โz = ฮฃโ (โp/โqโ) ร (โqโ/โz)
Forward vs Backward Pass
Forward: Input โ L1 โ L2 โ L3 โ Loss
Backward: Loss โ L3 โ L2 โ L1 (gradients!)
๐ก Backprop = "Blame" divide karo โ har weight ka loss mein kitna contribution tha?
๐ Backprop Gradients โ Step by Step
Part 1: Output Layer (Softmax + Cross Entropy)
โL/โaL,i = ลทแตข - yแตข
(Predicted probability minus True probability!)
Concrete Example ๐พ: True=Cat, Predicted=[Dog=0.7, Cat=0.2, Bird=0.1]
Gradients (ลท - y):
Dog: 0.7 - 0 = +0.7 โ too high, push DOWN โฌ๏ธ
Cat: 0.2 - 1 = -0.8 โ too low, push UP โฌ๏ธ
Bird: 0.1 - 0 = +0.1 โ slightly high, push down โฌ๏ธ
Part 2: Hidden Layer Gradient
โL/โhแตขโฑผ = ฮฃโ (โL/โaแตขโโ,โ) ร Wแตขโโ,โ,โฑผ
โL/โaแตขโฑผ = (โL/โhแตขโฑผ) ร g'(aแตขโฑผ)
Part 3: Weight & Bias Gradient
โL/โWแตข = (โL/โaแตข) ร hแตขโโแต
โL/โbแตข = โL/โaแตข
๐ Activation Derivatives
Sigmoid: g'(z) = ฯ(z)(1 - ฯ(z))
Example: ฯ=0.8 โ g' = 0.8ร0.2 = 0.16
tanh: g'(z) = 1 - tanhยฒ(z)
Example: tanh=0.9 โ g' = 1 - 0.81 = 0.19
ReLU: g'(z) = 1 if z > 0, else 0
โ ๏ธ Sigmoid problem: jab ฯ โ 0 ya 1 โ g' โ 0 โ Vanishing Gradient!
๐ Full Backprop Algorithm
1
Forward pass โ ลท compute karo
3
Output layer gradient: ลท - y
4
Hidden layers backward (chain rule)
5
Weight/bias gradients compute karo
6
ฮธ update: ฮธ_new = ฮธ_old - ฮท ร โฮธ
๐ Big Picture โ Basketball Analogy
| Basketball | ML Equivalent |
| Throwing technique | Parameters (w, b) |
| Ball goes in? | Prediction ลท |
| Basket location | True label y |
| Distance missed | Loss/Error |
| Coach's feedback | Gradient |
| Adjustment size | Learning rate ฮท |
| Practice sessions | Training iterations |
13 Gradient Descent Variants โ Faster & Smarter! โก
๐ Momentum-Based GD
Regular GD (flat regions mein stuck!)
w_(t+1) = w_t - ฮท ร โw_t
Momentum GD (ball rolling down hill!) ๐ณ
uโ = ฮฒ ร uโโโ + โwโ (update direction)
wโโโ = wโ - ฮท ร uโ
ฮฒ (beta) = momentum coefficient
ฮฒ = 0.9 is common choice
๐ก Momentum = speed build-up karta hai โ flat regions mein bhi aage badhta hai!
- ฮฒ = 0: Regular GD
- ฮฒ = 0.9: 90% previous direction use karo
- Too high ฮฒ: Overshoot ho sakta hai โ ๏ธ
๐ Nesterov Accelerated Gradient (NAG)
"Look Ahead" Strategy ๐ฎ
NAG Step 1: Lookahead point compute karo
w_lookahead = w_t - ฮฒ ร uโโโ
NAG Step 2: Gradient at lookahead compute
uโ = ฮฒ ร uโโโ + โw_lookahead
NAG Step 3: Update
wโโโ = wโ - ฮท ร uโ
Momentum โ
Current point pe gradient lete ho, phir jump
NAG โ
Pehle jump karo, gradient wahan lete ho โ zyada accurate!
โ
NAG = less oscillation, faster convergence!
๐ฆ Batch vs SGD vs Mini-Batch
| Type | Data Used | Speed |
| Batch GD | Full dataset | Slow/step |
| SGD | 1 sample | Fast/step |
| Mini-Batch | B samples | Best! โญ |
Mini-Batch Update Rule
wโโโ = wโ - (ฮท/B) ร ฮฃ โwโ (sum over batch)
B = 32 or 64 is standard choice!
โญ Mini-batch GD = Best balance of speed + accuracy โ Default choice in deep learning!
14 Learning Rate Scheduling โ Start Big, End Small! ๐
๐ Scheduling Methods
Method 1: Step Decay โ Fixed intervals pe halve karo
Every 5 epochs: ฮท = ฮท / 2
Epoch 1-5: ฮท=0.1
Epoch 6-10: ฮท=0.05 (halved)
Epoch 11-15:ฮท=0.025 (halved again)
Method 2: Exponential Decay ๐
ฮทโ = ฮทโ ร e^(-kt)
k = decay rate, t = current step/epoch
Example: ฮทโ=0.1, k=0.1
t=0: ฮท = 0.100
t=10: ฮท = 0.037
t=50: ฮท = 0.001
Method 3: 1/t Decay (Inverse Time)
ฮทโ = ฮทโ / (1 + kรt)
Stays higher for longer โ more exploration!
๐ Best Combination for Real Projects
โญ Production mein yeh combination use karo:
Mini-batch (B=32 or 64)
+
Momentum or NAG
+
Learning Rate Scheduling
= Fast, stable, good convergence! ๐
| Method | Best For |
| Batch GD | Small datasets, full memory |
| SGD | Online learning |
| Mini-Batch โญ | Default โ everything! |
| Step Decay | Common practice |
| Exp Decay | Image classification |
| Line Search | Research settings |
โ
MASTER SUMMARY โ Poori Journey! ๐บ๏ธ
๐งฌ
Biological Neuron
Dendrite โ Soma โ Axon โ Synapse
๐
MP Neuron (1943)
Binary inputs, fixed threshold, no learning
โ๏ธ
Perceptron (1958)
Real inputs, learnable weights, only linearly separable
๐๏ธ
MLP
Layers stack โ solves XOR โ any boolean fn
๐
Sigmoid Neuron
Smooth, probabilistic, differentiable, can learn!
๐
FFNN + Backprop
Forward pass โ Loss โ Backward pass โ Update
๐ฏ Exam mein yaad rakho: Data โ Model โ Loss โ Gradient โ Update โ Repeat! Yahi Deep Learning hai!