Questions and Answers

Part a) Analysis of Weights

Question

A sample of eleven weights (in kg) is given: 60, 72, 65, 68, 70, 100, 62, 75, 78, 80, 83.

  1. Calculate Q1, Q2, Q3, and IQR.
  2. Determine potential outliers using the 1.5 × IQR rule.

Goal: To find the quartiles (spread) and check for any unusual values (outliers).

i) Calculate Q1, Q2, Q3, and IQR

Step 1: Sort the data.

Sorted: 60, 62, 65, 68, 70, 72, 75, 78, 80, 83, 100

(There are 11 numbers total)

Step 2: Find the Median (Q2).
The exact middle number is the 6th number.

Q2 (Median) = 72

Step 3: Find the First Quartile (Q1).
Q1 is the middle of the lower half (numbers left of 72): 60, 62, 65, 68, 70.

Q1 = 65

Step 4: Find the Third Quartile (Q3).
Q3 is the middle of the upper half (numbers right of 72): 75, 78, 80, 83, 100.

Q3 = 80

Step 5: Calculate Interquartile Range (IQR).
This is the spread of the middle 50%.

IQR = Q3 - Q1 = 80 - 65

IQR = 15

ii) Check for Outliers

The Rule: Values are outliers if they are more than 1.5 × IQR away from Q1 or Q3.

  • Fence Distance: 1.5 × 15 = 22.5
  • Lower Limit: Q1 - 22.5 = 65 - 22.5 = 42.5
  • Upper Limit: Q3 + 22.5 = 80 + 22.5 = 102.5

Conclusion: Looking at our data (60 to 100):
- Is anything below 42.5? No.
- Is anything above 102.5? No. (100 is close, but safe!)

There are no outliers.

Part A: Basic Computational Speed (10 Questions) (Click to Expand)

Find $Q1, Q2, Q3, IQR$, and check for Outliers.


  1. Data: 10, 12, 15, 18, 20, 22, 25, 30, 80
    Ans: $Q1=13.5, Q2=20, Q3=27.5, IQR=14$. Outlier: 80.
  2. Data: 5, 5, 6, 7, 8, 8, 9, 10, 12, 15
    Ans: $Q1=6, Q2=8, Q3=10, IQR=4$. Outlier: None.
  3. Data: 100, 110, 120, 130, 140, 150, 160, 250
    Ans: $Q1=115, Q2=135, Q3=155, IQR=40$. Outlier: 250.
  4. Data: 40, 42, 45, 47, 49, 50, 52, 55, 58, 60, 110
    Ans: $Q1=45, Q2=50, Q3=58, IQR=13$. Outlier: 110.
  5. Data: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    Ans: $Q1=3, Q2=5.5, Q3=8, IQR=5$. Outlier: None.
  6. Data: 22, 24, 26, 28, 30, 32, 34, 36, 100, 110
    Ans: $Q1=26, Q2=31, Q3=36, IQR=10$. Outliers: 100, 110.
  7. Data: 50, 55, 60, 65, 70, 75, 80
    Ans: $Q1=55, Q2=65, Q3=75, IQR=20$. Outlier: None.
  8. Data: 0, 45, 46, 47, 48, 49, 50, 95
    Ans: $Q1=45.5, Q2=47.5, Q3=49.5, IQR=4$. Outliers: 0, 95.
  9. Data: 15, 15, 15, 15, 15, 15, 15, 45
    Ans: $Q1=15, Q2=15, Q3=15, IQR=0$. Outlier: 45.
  10. Data: 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32
    Ans: $Q1=16, Q2=22, Q3=28, IQR=12$. Outlier: None.
Part B: Conceptual ISM Application (10 Questions) (Click to Expand)

Think about these in the context of network latency or login attempts.


  1. Latency spikes: 20ms, 22ms, 21ms, 23ms, 20ms, 200ms. Is 200ms a statistical outlier?
    Ans: $Q1=20, Q3=23, IQR=3$. Upper bound = $27.5$. Yes, it is an outlier.
  2. Login failures: 2, 3, 1, 4, 2, 30, 2, 1. Is 30 a brute force indicator (outlier)?
    Ans: $Q1=1, Q3=3, IQR=2$. Upper bound = $6$. Yes, 30 is an outlier.
  3. Data Packet sizes: 500, 510, 505, 520, 10, 515. Is 10 a fragmented/malicious packet?
    Ans: $Q1=500, Q3=515, IQR=15$. Lower bound = $477.5$. Yes, 10 is an outlier.
  4. System Uptime (days): 300, 305, 310, 20, 315, 320.
    Ans: $Q1=300, Q3=315, IQR=15$. Lower bound = $277.5$. 20 is an outlier.
  5. User Access Times (hr): 9, 10, 11, 12, 13, 14, 23. Is the 11 PM access an outlier?
    Ans: $Q1=10, Q3=14, IQR=4$. Upper bound = $20$. Yes, 23 is an outlier.
  6. Sensor Temperatures: 25, 26, 25, 27, 26, 45, 25.
    Ans: $Q1=25, Q3=27, IQR=2$. Upper bound = $30$. Yes, 45 is an outlier.
  7. File Sizes (MB): 5, 7, 6, 8, 5, 50.
    Ans: $Q1=5, Q3=8, IQR=3$. Upper bound = $12.5$. Yes, 50 is an outlier.
  8. Encryption Time (sec): 1.2, 1.3, 1.2, 1.4, 3.5.
    Ans: $Q1=1.2, Q3=1.4, IQR=0.2$. Upper bound = $1.7$. Yes, 3.5 is an outlier.
  9. Download Speeds: 10, 12, 11, 13, 1, 12.
    Ans: $Q1=10, Q3=12, IQR=2$. Lower bound = $7$. Yes, 1 is an outlier.
  10. Weekly Alerts: 100, 110, 105, 120, 115, 500.
    Ans: $Q1=105, Q3=120, IQR=15$. Upper bound = $142.5$. Yes, 500 is an outlier.
Section A, B, & C: Theoretical & Case Study Practice (Click to Expand)

Section A: Theoretical Foundation

Question 1: Sensitivity
If a dataset of system response times has a mean of 50ms and a median of 45ms, and a single system crash causes one entry to be 10,000ms, which value ($Q2$ or the mean) will change more significantly?

Answer: The mean will change more significantly because it factors in every data point. The median ($Q2$) only shifts to the next adjacent value, remaining largely unchanged. This prevents a "false baseline" during system crashes.

Question 2: The "Whisker" Logic
If the whiskers are very short compared to the box, what does that tell a security analyst about the predictability of the system?

Answer: Short whiskers indicate very little variance at the extremes. The system is highly predictable; even a small deviation beyond these whiskers is likely a legitimate anomaly worth investigating.

Section B: Computational Analysis

Question 3: Raw Data Analysis (Packet sizes: 12, 15, 17, 20, 22, 25, 30, 35, 40, 150)

  • a) $Q1, Q2, Q3$: $Q2=23.5$, $Q1=17$, $Q3=35$.
  • b) $IQR$: $35 - 17 = 18$.
  • c) Outlier Check: Upper Fence $= 35 + (1.5 \times 18) = 62$. Since $150 > 62$, 150 is an outlier.

Question 4: Grouped Data Analysis (N=100)

Time (Min) Frequency (f) Cumulative Freq (cf)
0 - 101515
10 - 20 (Median Class)4560
20 - 303090
30 - 4010100

$Q2$ Calculation: $10 + \left( \frac{50 - 15}{45} \right) \times 10 = 17.78$ mins.


Section C: Application & Case Study

Question 5: Comparing Encryption Algorithms

  • Predictability: Protocol X ($IQR=10$) is more predictable than Protocol Y ($IQR=35$).
  • Is 90% an outlier for Y? Upper Fence $= 45 + (1.5 \times 35) = 97.5$. Since $90 < 97.5$, it is not an outlier.
  • Approval: Protocol X is approved for deployment because its tight CPU usage prevents resource starvation.

Part b) Astronaut Training Times

Question

The following data shows the time (in minutes) it took for a group of astronauts to complete various training simulations: 18, 25, 19, 32, 27, 22, 29, 30, 24, 26.

Calculate the mean and standard deviation of these times.


Goal: Calculate the Mean (average) and Standard Deviation (spread).

Data: 18, 25, 19, 32, 27, 22, 29, 30, 24, 26

(n = 10 numbers)

1. Calculate the Mean

Sum of all numbers = 252

Mean = Sum ÷ Count = 252 ÷ 10

Mean = 25.2 minutes

2. Calculate Standard Deviation

We look at how far each number is from the mean (25.2) and square it.

Data Point (x) Distance (x - 25.2) Squared Distance
18-7.251.84
25-0.20.04
19-6.238.44
326.846.24
271.83.24
22-3.210.24
293.814.44
304.823.04
24-1.21.44
260.80.64
SUM 189.6

Formula Step 1 (Variance): Sum ÷ (n - 1)

189.6 ÷ 9 = 21.067

Formula Step 2 (Standard Deviation): Square root of Variance

√21.067 ≈ 4.59

Standard Deviation = 4.59 minutes

Practice: Mean & Sample Standard Deviation (10 Questions) (Click to Expand)

Formulas Used:

Mean: $\bar{x} = \frac{\sum x}{n}$

Sample Standard Deviation: $s = \sqrt{\frac{\sum (x - \bar{x})^2}{n-1}}$

Find the Mean ($\bar{x}$) and Sample Standard Deviation ($s$) for the following sets:


  1. Data: 5, 10, 15, 20, 25
    Ans: $\bar{x} = 15, s \approx 7.91$
  2. Data: 100, 102, 104, 106
    Ans: $\bar{x} = 103, s \approx 2.58$
  3. Data: 2, 4, 6, 8, 10, 12
    Ans: $\bar{x} = 7, s \approx 3.74$
  4. Data: 50, 50, 50, 50 (Zero variance test)
    Ans: $\bar{x} = 50, s = 0$
  5. Data: 1, 9, 1, 9
    Ans: $\bar{x} = 5, s \approx 4.62$
  6. Data: 12, 15, 18, 21, 24
    Ans: $\bar{x} = 18, s \approx 4.74$
  7. Data: 45, 47, 52, 48, 50
    Ans: $\bar{x} = 48.4, s \approx 2.70$
  8. Data: 10, 20, 30, 40, 50, 60, 70
    Ans: $\bar{x} = 40, s \approx 21.60$
  9. Data: 3, 3, 4, 5, 5
    Ans: $\bar{x} = 4, s = 1$
  10. Data: 0.5, 0.7, 0.9, 1.1
    Ans: $\bar{x} = 0.8, s \approx 0.26$

📚 Recommended Reference Material

For a complete beginner, I highly recommend OpenStax Introductory Statistics. It is free, high-quality, and widely used.

Probability Distributions

Part a) Normal Distribution

Question

Life of an instrument: Mean (μ) = 12 months, Standard Deviation (σ) = 2 months.

  1. Find probability it lasts less than 7 months.
  2. Find probability it lasts between 7 and 12 months.

Goal: Use the Z-score formula to standardize values and find probabilities from the Normal Table.

[Image of standard normal distribution curve]

i) Probability (Life < 7 months)

Step 1: Calculate the Z-score.

The Z-score tells us how many standard deviations "7" is away from the mean "12".

Z = (X - μ) / σ
Z = (7 - 12) / 2
Z = -5 / 2 = -2.5

Step 2: Look up Z = -2.5 in the Standard Normal Table.

The area to the left of Z = -2.5 corresponds to the probability.

Answer: P(X < 7) = 0.0062 (or 0.62%)

ii) Probability (7 < Life < 12 months)

Logic: We know the total area to the left of the mean (12) is 0.5 (50%). We just calculated the small tail area to the left of 7.

To find the area between 7 and 12, we subtract the small tail from the half.

  • Probability (X < 12): 0.5000 (The Mean)
  • Probability (X < 7): 0.0062 (From part i)
Calculation: 0.5000 - 0.0062 = 0.4938
Answer: 49.38% probability.
Topic: The Normal Distribution (Gaussian / Bell Curve) (Click to Expand)

The Normal Distribution is the most important probability distribution in statistics. it describes how values of a variable are distributed and is defined by its characteristic symmetrical shape.

1. Theoretical Foundation

It is a continuous probability distribution symmetrical about the mean, defined by two parameters:

  • 📍 Mean ($\mu$): The center of the distribution.
  • 📏 Standard Deviation ($\sigma$): The spread of the data.
Key Characteristics:
  • Symmetry: Left and right halves are mirror images. Mean = Median = Mode.
  • Asymptotic: The tails approach the horizontal axis but never touch it.

The Empirical Rule (68-95-99.7 Rule)

This rule defines how much data falls within specific standard deviations from the mean:

🎯 68% of data falls within $1\sigma$ of the mean.
🎯 95% of data falls within $2\sigma$ of the mean.
🎯 99.7% of data falls within $3\sigma$ of the mean.

2. Real-World Application: Human Height

Imagine measuring the height of every adult man in a large city:

  • The Mean: If average height is 5'9" (175 cm), most men cluster around this peak.
  • The Spread: Fewer men are 6'2" or 5'4".
  • The Extremes: Finding someone over 7ft or under 4ft is statistically rare.

Other Examples: SAT Scores, IQ tests, Manufacturing weights (e.g., 16oz cereal boxes), and random measurement errors.

3. Quick Learning Resources

Watch these visual guides for a "quick click" of the concepts:

Practice Set: Normal Distribution in Security Management (Click to Expand)

Assume a Normal Distribution for all calculations. Use the Z-score formula: $Z = \frac{X - \mu}{\sigma}$

1. System Latency: $\mu = 50ms, \sigma = 5ms$. Find the probability that a packet takes more than 60ms.
Ans: $Z = 2$. $P(Z > 2) = 1 - 0.9772 = \mathbf{0.0228}$
2. Password Length: $\mu = 10, \sigma = 2$. Find the probability a user chooses a password shorter than 8 characters.
Ans: $Z = -1$. $P(Z < -1) = \mathbf{0.1587}$
3. Download Speed: $\mu = 100 Mbps, \sigma = 10$. Probability speed is between 90 and 110 Mbps.
Ans: $P(-1 < Z < 1) \approx \mathbf{0.6826}$ (Empirical Rule)
4. CPU Usage: $\mu = 60\%, \sigma = 8\%$. Probability usage exceeds 76%.
Ans: $Z = 2$. $P(Z > 2) = \mathbf{0.0228}$
5. Backup Time: $\mu = 120 min, \sigma = 15$. Probability backup takes less than 90 min.
Ans: $Z = -2$. $P(Z < -2) = \mathbf{0.0228}$
6. Staff Training Scores: $\mu = 75, \sigma = 10$. Probability a staff member scores above 95.
Ans: $Z = 2$. $P(Z > 2) = \mathbf{0.0228}$
7. Encryption Latency: $\mu = 200 \mu s, \sigma = 20 \mu s$. Probability latency is between 180 and 220 $\mu s$.
Ans: $\mathbf{0.6826}$ (Within 1 standard deviation)
8. Server Lifespan: $\mu = 5 years, \sigma = 1 year$. Probability it fails before 3 years.
Ans: $Z = -2$. $P(Z < -2) = \mathbf{0.0228}$
9. Message Size: $\mu = 2KB, \sigma = 0.5KB$. Probability size is exactly 2KB.
Ans: 0 (Probability of an exact point in continuous distributions is always 0)
10. Network Jitter: $\mu = 5ms, \sigma = 1ms$. Probability jitter is between 4 and 6 ms.
Ans: $\mathbf{0.6826}$

Part b) Binomial vs. Poisson

Question

Biased coin with p = 0.15 flipped n = 40 times.

Find the probability of getting exactly 5 heads using i) Binomial and ii) Poisson.


i) Binomial Distribution

Formula: P(X=k) = nCk × pk × (1-p)n-k

  • n = 40, k = 5, p = 0.15
  • (1-p) = 0.85

Calculation:

P(X=5) = 40C5 × (0.15)5 × (0.85)35

P(X=5) = 658,008 × 0.0000759 × 0.0034

Binomial Probability: 0.1692 (16.92%)

ii) Poisson Approximation

Condition: Poisson estimates Binomial when 'n' is large and 'p' is small. We use the mean (λ) as the parameter.

Step 1: Calculate Lambda (λ)
λ = n × p
λ = 40 × 0.15 = 6

Formula: P(X=k) = (e × λk) / k!

P(X=5) = (e-6 × 65) / 120

P(X=5) = (0.002479 × 7776) / 120

Poisson Estimate: 0.1606 (16.06%)

Note: The values are close (16.9% vs 16.1%), showing that Poisson is a reasonable approximation here.

Section A: Binomial & Poisson (Discrete Distributions) (Click to Expand)

Practice calculations for discrete events like packet loss, malware arrivals, and system failures.

Binomial: $P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$ | Poisson: $P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!}$
1. DDoS Probability: A firewall has a $1\%$ failure rate per high-load hour. What is the probability of exactly $2$ failures in $10$ high-load hours?
Ans: $n=10, p=0.01, k=2$. Binomial $\approx \mathbf{0.004}$
2. Spam Filtering: An email filter misses $5\%$ of spam. In a batch of $20$ spam emails, find the probability it misses none.
Ans: $n=20, p=0.05, k=0$. $(0.95)^{20} \approx \mathbf{0.358}$
3. Malware Arrival: A server receives malware attempts at a rate of $\lambda = 3$ per hour. Find the probability of receiving exactly $5$ in an hour.
Ans: Poisson with $\lambda=3, k=5$. $P \approx \mathbf{0.1008}$
4. Packet Loss: If packet loss is $0.2\%$, find the probability of $0$ losses in $500$ packets using Poisson.
Ans: $\lambda = 500 \times 0.002 = 1$. $P(0) = e^{-1} \approx \mathbf{0.3679}$
5. User Errors: On average, a user makes $0.5$ config errors a month. Find the probability of at least $1$ error this month.
Ans: $1 - P(0) = 1 - e^{-0.5} \approx \mathbf{0.393}$
Decision Guide: Which Distribution to Use? (Click to Expand)

To identify the correct distribution, look for specific keywords and the nature of the data provided.

1. Normal Distribution

Look for: Continuous data, a provided Mean ($\mu$), and a Standard Deviation ($\sigma$).

Example: "The life of an instrument has a mean of 12 months and a SD of 2 months."
Why? You have two specific parameters for continuous measurements (time).

2. Binomial Distribution

Look for: A fixed number of independent trials ($n$), a constant probability ($p$), and exactly two outcomes (Success/Failure).

Example: "Flip a coin 40 times" or "10 discrete hours with a 1% failure rate."
Why? These deal with "Yes/No" outcomes over a set number of attempts.

3. Poisson Distribution

Look for: An average rate of occurrence ($\lambda$) over an interval (time/space) or a very large $n$ with a tiny $p$.

[Image showing comparison between Binomial and Poisson distribution curves]
Example: "0.5 config errors per month" or "0.2% packet loss in 500 packets."
Why? Used for counting random events that happen at a known average rate.
Summary Cheat Sheet
Distribution Key Given Info
Normal $\mu$ and $\sigma$ (Continuous)
Binomial $n$ and $p$ (Discrete counts)
Poisson $\lambda$ or (Large $n$ + Small $p$)
📚 Essential "Sheet Codes" & Formulas (Exam Prep) (Click to Expand)

1. Normal Distribution (Continuous Data)

Use for time, weight, or length where $\mu$ and $\sigma$ are known.

Z-score Formula: $$Z = \frac{x - \mu}{\sigma}$$

Empirical Rule: 68% ($\pm 1\sigma$), 95% ($\pm 2\sigma$), 99.7% ($\pm 3\sigma$)

⌨️ Excel Code: =NORM.DIST(x, mean, standard_dev, TRUE)

2. Binomial Distribution (Discrete Counts)

Use for fixed trials ($n$) with "Success/Failure" outcomes ($p$).

Probability Formula: $$P(X = k) = \binom{n}{k} \cdot p^k \cdot q^{(n-k)}$$

Note: $q = 1 - p$ and $\binom{n}{k} = \frac{n!}{k!(n-k)!}$

💡 Calculator Tip (Casio): Use the nCr button (SHIFT + ÷).

⌨️ Excel Code: =BINOM.DIST(k, n, p, FALSE)

3. Poisson Distribution (Average Rates)

Use for average rates ($\lambda$) over an interval.

Probability Formula: $$P(X = k) = \frac{e^{-\lambda} \cdot \lambda^k}{k!}$$

Constants: $e \approx 2.718$ | $\lambda$ is often $n \times p$

⚠️ "At Least 1" Logic: $P(X \ge 1) = 1 - P(0)$

⌨️ Excel Code: =POISSON.DIST(k, mean, FALSE)

Quick Decision Matrix

Keywords Use Distribution Essential Math
"Mean", "Standard Deviation", "Continuous" Normal $Z = \frac{x-\mu}{\sigma}$
"Success/Failure", "Fixed ($n$)", "Yes/No" Binomial nCr Calculations
"On average", "Per hour", "Arrival rate" Poisson $\lambda$ (Lambda)

📚 Recommended Reference

For probability distributions, OpenStax Statistics is excellent.

  • Chapter 6: The Normal Distribution (Z-scores)
  • Chapter 4: Discrete Random Variables (Binomial/Poisson)
  • View Online Textbook

Naive Bayes Classification

Problem Statement

[cite_start]Question 3 [cite: 1]

A company wants to classify incoming emails as "Spam" or "Not Spam" (Ham) using a Naïve Bayes classifier. Given the training dataset below, predict the label for the new email:

"Congratulations! You have won money quickly"

Email ID Text (Bag of Words) Label
1Congratulations, you have won a free prizeSpam
2Monthly meeting scheduled for MondayNot Spam
3Earn money quickly with this simple trickSpam
4Project report attached. Review by FridayNot Spam

Goal: Calculate the posterior probability for both classes (Spam and Not Spam) and assign the class with the higher probability.

Step 1: Pre-processing & Priors

First, we calculate the Prior Probabilities based on the document counts.

  • Total Documents: 4
  • Spam Documents: 2 (Email 1, Email 3)
  • Not Spam Documents: 2 (Email 2, Email 4)
P(Spam) = 2/4 = 0.5
P(Not Spam) = 2/4 = 0.5

Step 2: Vocabulary & Word Counts

We tokenize the training data to create a Vocabulary (V) and count words per class.

Spam Bag of Words Congratulations, you, have, won, a, free, prize, Earn, money, quickly, with, this, simple, trick
Not Spam Bag of Words Monthly, meeting, scheduled, for, Monday, Project, report, attached, Review, by, Friday
  • Total Words in Spam (NSpam): 14
  • Total Words in Not Spam (NNotSpam): 11
  • Vocabulary Size (|V|): 25 (14 unique spam words + 11 unique non-spam words)

Step 3: Likelihoods with Laplace Smoothing

Test Sentence Tokens: "Congratulations", "You", "have", "won", "money", "quickly".

Note: We perform case-insensitive matching (e.g., "You" = "you").

Formula: $P(w|Class) = \frac{count(w, Class) + 1}{N_{Class} + |V|}$

Class: Spam

Denominator: $14 + 25 = 39$

  • "Congratulations": (1+1)/39 = 2/39
  • "You": (1+1)/39 = 2/39
  • "have": (1+1)/39 = 2/39
  • "won": (1+1)/39 = 2/39
  • "money": (1+1)/39 = 2/39
  • "quickly": (1+1)/39 = 2/39

Class: Not Spam

Denominator: $11 + 25 = 36$

  • "Congratulations": (0+1)/36 = 1/36
  • "You": (0+1)/36 = 1/36
  • "have": (0+1)/36 = 1/36
  • "won": (0+1)/36 = 1/36
  • "money": (0+1)/36 = 1/36
  • "quickly": (0+1)/36 = 1/36

Step 4: Final Probability Scores

Score(Spam): $0.5 \times (\frac{2}{39})^6$

Calculation: $0.5 \times (0.0513)^6 \approx \mathbf{9.1 \times 10^{-9}}$

Score(Not Spam): $0.5 \times (\frac{1}{36})^6$

Calculation: $0.5 \times (0.0278)^6 \approx \mathbf{2.3 \times 10^{-10}}$

Conclusion

Since the score for Spam is significantly higher than Not Spam, the email is classified as:

[cite_start]

SPAM [cite: 1]

Topic: Understanding Naïve Bayes (Theory & Logic) (Click to Expand)

1. Theoretical Foundation

Naïve Bayes is a supervised machine learning algorithm used for classification. It is favored for its speed and effectiveness, especially in text-based data.

The Core Formula (Bayes' Theorem)

$$P(C|x) = \frac{P(x|C) \cdot P(C)}{P(x)}$$

  • $P(C|x)$ (Posterior): Prob. of class given features.
  • $P(x|C)$ (Likelihood): Prob. of features given class.
  • $P(C)$ (Prior): Baseline prob. of the class.
  • $P(x)$ (Evidence): Total prob. of the features.

Why is it "Naïve"?

It assumes conditional independence. It treats every feature (e.g., words in an email) as if it has no relationship with any other feature.

The Apple Example: If an object is red, round, and 3 inches wide, Naïve Bayes treats these three as independent factors, even though they usually occur together in an apple.

2. Real-World Case: Spam Filtering

In cybersecurity, this is the "gold standard" for identifying malicious intent in communications.

Folder High Frequency Keywords
Spam Winner, Free, Cash, Urgent, Money
Ham (Legit) Meeting, Project, Attached, Lunch

Other Common Uses:

  • 🎭 Sentiment Analysis: Movie/Product reviews.
  • 🩺 Medical Diagnosis: Symptoms to disease.
  • ☁️ Weather Prediction: Sunny vs. Rainy.
  • 🚨 Security: Intrusion detection.
Section 4: Naïve Bayes & Probability Logic (Click to Expand)

Essential for Spam Filtering and Intrusion Detection Systems (IDS).

16. Spam Prediction: Based on the dataset provided, predict if "Congratulations! You have won money quickly" is Spam or Ham.
Ans: Spam. Keywords appear predominantly in the Spam labels in the training set.
17. Prior Probability: Find $P(Spam)$ and $P(Not Spam)$.
Ans: $P(Spam) = 0.5$; $P(Not Spam) = 0.5$.
18. Conditional Probability: Find $P("money" | Spam)$.
Ans: 1 out of 2 Spam emails contains "money". $P = \mathbf{0.5}$.
19. Bayesian Updating (The False Positive Paradox): If a scan is 99% accurate but only 1 in 10,000 files is infected, what is the probability a file is actually infected if the scan is positive?
Ans: $\approx \mathbf{0.98\%}$. This highlights why high accuracy isn't enough when the "base rate" of infection is very low.
20. Independence Assumption: What is the core "naïve" assumption in this classifier?
Ans: It assumes all features (words) are independent of each other given the class label.
Advanced: Naïve Bayes Mechanics & Smoothing (Click to Expand)

1. The Mathematical Foundation

Naïve Bayes uses Bayes' Theorem to calculate the "Posterior Probability." Because it assumes features are independent, the math simplifies into a product of individual probabilities.

The Chain Rule for Multiple Features:

$P(C|x_1, \dots, x_n) \propto P(C) \prod_{i=1}^{n} P(x_i|C)$

2. Laplace Smoothing (Solving the Zero Frequency Problem)

If a word in the test set never appeared in your training data, the probability $P(x|C)$ becomes $0$. Since we multiply these values, one zero ruins the entire calculation.

Laplace Smoothing Formula:

$$P(x_i|C) = \frac{\text{count}(x_i, C) + \alpha}{\text{count}(C) + \alpha \cdot |V|}$$

  • $\alpha$: Usually $1$ (Add-one smoothing).
  • $|V|$: Total unique features (Vocabulary size).

3. Types of Naïve Bayes Models

Choosing the right model depends on the type of data you are analyzing:

[Image comparing Gaussian Multinomial and Bernoulli Naive Bayes models]
Model Type Best Used For...
Gaussian Continuous data (measurements like height, weight, or temperature) following a Normal Distribution.
Multinomial Text classification based on word counts (how many times "Cash" appears).
Bernoulli Binary features. It only cares if a word is Present or Absent (0 or 1).

Key Advantages Summary:

  • Fast: Independent calculations are computationally cheap.
  • Robust: Handles missing data and small datasets well.
  • Scalable: Performs exceptionally well with high-dimensional data like text.

Bayes' Theorem: Medical Diagnosis

Problem Statement

Question 4

A new test is developed to detect a rare genetic disorder. The disorder affects 0.1% of the population. The test has:

  • Sensitivity (true positive rate): 98%
  • Specificity (true negative rate): 97%

Apply Bayes' theorem to identify the probability that a person has the disorder if their test result is positive.


Goal: Calculate the Posterior Probability, which is P(Disease | Positive Test).

Step 1: Define Events & Probabilities

Let's convert the percentages into probabilities (decimals) and define our terms.

Term Symbol Value
Prevalence (Prior) P(Disease) 0.1% = 0.001
No Disease P(No Disease) 1 - 0.001 = 0.999
Sensitivity P(Pos | Disease) 98% = 0.98
Specificity P(Neg | No Disease) 97% = 0.97
Crucial Step: Calculate the False Positive Rate.
P(Pos | No Disease) = 1 - Specificity = 1 - 0.97 = 0.03

Step 2: Apply Bayes' Theorem

The formula for the probability of having the disease given a positive test is:

P(Disease | Pos) = P(Pos | Disease) × P(Disease)
                         P(Pos)

Expand the denominator P(Pos):
P(Pos) = [True Positives] + [False Positives]
P(Pos) = [P(Pos|Disease) × P(Disease)] + [P(Pos|No Disease) × P(No Disease)]

Step 3: Calculation

1. Numerator (True Positives)

0.98 × 0.001

= 0.00098

2. False Positives

0.03 × 0.999

= 0.02997

3. Final Division

Total Probability of Positive Test = 0.00098 + 0.02997 = 0.03095

P(Disease | Pos) = 0.00098 ÷ 0.03095

≈ 0.03166

Final Answer

The probability that a person actually has the disorder given a positive test result is approximately:

3.17%

(Note: This low probability, despite high sensitivity/specificity, is due to the "Base Rate Fallacy" — the disease is extremely rare.)

Discrete Random Variables

Q5) Probability Mass Function

Given: A machine learning model predicts the number of errors (X) in a document. X takes values {0, 1, 2, 3} with the following PMF:

  • P(X=0) = k
  • P(X=1) = 2k
  • P(X=2) = 3k
  • P(X=3) = 4k
  • P(X=other) = 0

a) Find the value of k

Rule: The sum of all probabilities in a valid probability mass function must equal 1.

$\sum P(X=x) = 1$

$P(0) + P(1) + P(2) + P(3) = 1$

$k + 2k + 3k + 4k = 1$

$10k = 1$

Answer: $k = \frac{1}{10} = 0.1$

b) Compute Expectation E(X)

Formula: $E(X) = \sum [x \cdot P(X=x)]$

x P(x) x · P(x)
0 0.1 0
1 0.2 0.2
2 0.3 0.6
3 0.4 1.2

Total Sum = $0 + 0.2 + 0.6 + 1.2$

Answer: $E(X) = 2.0$

c) Meaning of E(X) in context

The expected value represents the long-term average.

Interpretation: On average, we expect to find 2 errors per document if we examine a large number of documents using this model.

d) Compute Variance(2X + 3)

Step 1: Calculate $E(X^2)$

$E(X^2) = \sum [x^2 \cdot P(x)]$

  • $0^2(0.1) = 0$
  • $1^2(0.2) = 0.2$
  • $2^2(0.3) = 4(0.3) = 1.2$
  • $3^2(0.4) = 9(0.4) = 3.6$

$E(X^2) = 0 + 0.2 + 1.2 + 3.6 = \mathbf{5.0}$

Step 2: Calculate Variance of X, $Var(X)$

Formula: $Var(X) = E(X^2) - [E(X)]^2$

$Var(X) = 5.0 - (2.0)^2$

$Var(X) = 5.0 - 4.0 = \mathbf{1.0}$

Step 3: Apply Variance Property

Property: $Var(aX + b) = a^2 Var(X)$

Here, $a = 2$ and $b = 3$.

$Var(2X + 3) = 2^2 \times Var(X)$

$= 4 \times 1.0$

Final Answer: The Variance is 4.0.

Joint Probability & Expectation

Q6) Joint Probability Distribution

Experiment: A fair coin is tossed 3 times.

  • X: Random variable for the number of tails.
  • Y: Random variable for winnings based on the position of the 1st tail.

Rules for Y:

  • 1st Tail on Toss 1: Win $3 (Y=3)
  • 1st Tail on Toss 2: Win $2 (Y=2)
  • 1st Tail on Toss 3: Win $1 (Y=1)
  • No Tail: Lose $2 (Y=-2)

Step 1: Analyze Sample Space

There are $2^3 = 8$ possible outcomes. Let's list them and determine X and Y for each.

Outcome X (Count Tails) 1st Tail Pos Y (Winnings)
H H H0None-2
H H T13rd1
H T H12nd2
H T T22nd2
T H H11st3
T H T21st3
T T H21st3
T T T31st3

i) Joint Probability Function P(X, Y)

We group the outcomes by their (X, Y) pairs. Since each individual outcome has a probability of 1/8:

X \ Y -2 1 2 3 Total P(X)
0 1/8 (HHH) 0 0 0 1/8
1 0 1/8 (HHT) 1/8 (HTH) 1/8 (THH) 3/8
2 0 0 1/8 (HTT) 2/8 (THT, TTH) 3/8
3 0 0 0 1/8 (TTT) 1/8

Interpretation: The table shows the probability of obtaining a specific number of tails (X) and a specific winning amount (Y) simultaneously. For example, P(X=2, Y=3) = 2/8 means there is a 25% chance of getting exactly 2 tails where the first tail occurs on the first toss.

ii) Marginal Probability Function for X

This is obtained by summing the probabilities across the rows (as shown in the "Total P(X)" column above).

  • P(X=0) = 1/8 = 0.125
  • P(X=1) = 3/8 = 0.375
  • P(X=2) = 3/8 = 0.375
  • P(X=3) = 1/8 = 0.125

iii) Expectation E(X)

Formula: $E(X) = \sum [x \cdot P(X=x)]$

$E(X) = (0 \times \frac{1}{8}) + (1 \times \frac{3}{8}) + (2 \times \frac{3}{8}) + (3 \times \frac{1}{8})$

$E(X) = 0 + \frac{3}{8} + \frac{6}{8} + \frac{3}{8}$

$E(X) = \frac{12}{8} = 1.5$

Answer: $E(X) = 1.5$

Probability & Conditional Events

Question 2

A study on ethics in the workplace by the Ethics Resource Center and Kronos, Inc., revealed that 35% of employees admit to keeping quiet when they see co-worker misconduct. Suppose 75% of employees who admit to keeping quiet when they see co-worker misconduct call in sick when they are well. In addition, suppose that 40% of the employees who call in sick when they are well admit to keeping quiet when they see co-worker misconduct. If an employee is randomly selected, determine the following probabilities:

  1. The employee calls in sick when well and admits to keeping quiet when seeing co-worker misconduct.
  2. The employee admits to keeping quiet when seeing co-worker misconduct or calls in sick when well.
  3. Given that the employee calls in sick when well, he or she does not keep quiet when seeing co-worker misconduct.
  4. The employee neither keeps quiet when seeing co-worker misconduct nor calls in sick when well.
  5. The employee admits to keeping quiet when seeing co-worker misconduct and does not call in sick when well.

Let's define the events based on the text:
  • Event Q: Employee keeps quiet about misconduct. ($P(Q) = 0.35$)
  • Event S: Employee calls in sick when well.
  • Conditional 1: "75% of employees who keep quiet... call in sick" $\rightarrow P(S|Q) = 0.75$
  • Conditional 2: "40% of employees who call in sick... keep quiet" $\rightarrow P(Q|S) = 0.40$

a) P(S and Q)

This asks for the intersection: The employee calls in sick AND keeps quiet.

Formula: $P(S \cap Q) = P(S|Q) \times P(Q)$

Calculation: $0.75 \times 0.35 = \mathbf{0.2625}$

Answer: 0.2625 (or 26.25%)
Intermediate Step: Find P(S)
We need the total probability of calling in sick $P(S)$ for the next steps.
Since $P(Q|S) = \frac{P(S \cap Q)}{P(S)}$, we can rearrange to find $P(S)$:
$P(S) = \frac{P(S \cap Q)}{P(Q|S)} = \frac{0.2625}{0.40} = \mathbf{0.65625}$

b) P(Q or S)

This asks for the union: Keeps quiet OR calls in sick.

Formula: $P(Q \cup S) = P(Q) + P(S) - P(Q \cap S)$

Calculation: $0.35 + 0.65625 - 0.2625 = \mathbf{0.74375}$

Answer: 0.74375 (or 74.38%)

c) P(not Q | S)

Given they called in sick ($S$), what is the probability they do NOT keep quiet ($Q'$)?

Formula: $P(Q'|S) = 1 - P(Q|S)$

Calculation: $1 - 0.40 = \mathbf{0.60}$

Answer: 0.60 (or 60%)

d) P(not Q and not S)

Neither keeps quiet nor calls in sick. This is the complement of the union ($Q \cup S$).

Formula: $P(Q' \cap S') = 1 - P(Q \cup S)$

Calculation: $1 - 0.74375 = \mathbf{0.25625}$

Answer: 0.25625 (or 25.63%)

e) P(Q and not S)

Admits to keeping quiet ($Q$) but does NOT call in sick ($S'$). This is "Q only".

Formula: $P(Q \cap S') = P(Q) - P(Q \cap S)$

Calculation: $0.35 - 0.2625 = \mathbf{0.0875}$

Answer: 0.0875 (or 8.75%)

📚 Recommended Reference

For conditional probability and Bayes' theorem, OpenStax Statistics is a great resource.

Probability & Statistics Analysis

Part a) Poisson Distribution

Question 3 a)

Based on past experience, it is assumed that the number of flaws per foot in rolls of grade 2 paper follows a Poisson distribution with a mean of 0.2 flaw per foot. What is the probability that in a:

  1. 1-foot roll, there will be at least 2 flaws?
  2. 50-foot roll, there will be greater than or equal to 5 flaws and less than or equal to 8 flaws

i) 1-foot roll, at least 2 flaws

Parameter: Since the roll is 1 foot long, the mean ($\lambda$) remains 0.2.

Goal: Find $P(X \ge 2)$. Using the complement rule: $1 - P(X < 2) = 1 - [P(0) + P(1)]$

Formula: $P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!}$

  • $P(0) = \frac{e^{-0.2} (0.2)^0}{0!} = 0.8187$
  • $P(1) = \frac{e^{-0.2} (0.2)^1}{1!} = 0.1637$
Calculation: $1 - (0.8187 + 0.1637) = 1 - 0.9824$
Probability: 0.0176 (or 1.76%)

ii) 50-foot roll, 5 to 8 flaws

Parameter: For 50 feet, the new mean is $\lambda = 0.2 \times 50 = 10$.

Goal: Find $P(5 \le X \le 8)$. This is the sum of probabilities for X = 5, 6, 7, and 8.

Using cumulative probability logic: $P(X \le 8) - P(X \le 4)$

Calculation via Poisson CDF with $\lambda=10$:
Probability: 0.3036 (or 30.36%)

Part b) Law of Total Probability

Question 3 b)

An individual has 3 different email accounts. Most of her messages, in fact 70%, come into account #1, whereas 20% come into account #2 and the remaining 10% into account #3. Of the messages into account #1, only 1% are spam, whereas the corresponding percentages for accounts #2 and #3 are 2% and 5%, respectively. What is the probability that a randomly selected message is spam?


Calculation

We use the Law of Total Probability to find the overall probability of Spam ($S$).

Let $A_i$ be the event the message is from Account $i$.

  • Account 1: $P(S|A_1)P(A_1) = 0.01 \times 0.70 = 0.007$
  • Account 2: $P(S|A_2)P(A_2) = 0.02 \times 0.20 = 0.004$
  • Account 3: $P(S|A_3)P(A_3) = 0.05 \times 0.10 = 0.005$
Summing these up:
$0.007 + 0.004 + 0.005 = 0.016$
Probability: 0.016 (or 1.6%)

📚 Recommended Reference Material

For Poisson distributions and Total Probability, OpenStax Introductory Statistics is an excellent free resource.

  • Chapter 4.6: Poisson Distribution - Explains the formula for events occurring over time/space.
  • Chapter 3.2: Independent and Mutually Exclusive Events (Total Probability rules).
  • View Online Textbook (Poisson)

Joint & Hypergeometric Distributions

4. a) Joint Probability Mass Function

Given: Joint PMF $P(X=x, Y=y) = k(x^2 + y^2)$.

Note: The specific range for x and y was missing from your prompt. I will solve this assuming the standard small range $x \in \{1, 2\}$ and $y \in \{1, 2\}$ to demonstrate the method.


i) Find the value of constant k

Rule: The sum of all probabilities in a joint distribution must equal 1.

$\sum \sum P(x, y) = 1$

Using range $x \in \{1, 2\}, y \in \{1, 2\}$:

Sum = P(1,1) + P(1,2) + P(2,1) + P(2,2)
1 = k(1²+1²) + k(1²+2²) + k(2²+1²) + k(2²+2²)
1 = k(2) + k(5) + k(5) + k(8)
1 = 20k

Answer: $k = \frac{1}{20}$ (or 0.05)

ii) Find Marginal Probability Functions

For X (Sum over Y):

  • $P_X(1) = P(1,1) + P(1,2) = 2k + 5k = 7k = \mathbf{7/20}$
  • $P_X(2) = P(2,1) + P(2,2) = 5k + 8k = 13k = \mathbf{13/20}$

For Y (Sum over X):

  • $P_Y(1) = P(1,1) + P(2,1) = 2k + 5k = 7k = \mathbf{7/20}$
  • $P_Y(2) = P(1,2) + P(2,2) = 5k + 8k = 13k = \mathbf{13/20}$

4. b) Hypergeometric Distribution

Given: Lot of 10 items, 3 defective. Sample of 4 drawn without replacement.

Find: The probability distribution of X (number of defectives).


iii) Probability Calculation

Parameters: Total ($N=10$), Defective ($K=3$), Good ($7$), Sample ($n=4$).

Formula: $P(X=x) = \frac{\binom{K}{x} \binom{N-K}{n-x}}{\binom{N}{n}}$

Denominator (Total Ways): $\binom{10}{4} = \frac{10 \times 9 \times 8 \times 7}{4 \times 3 \times 2 \times 1} = 210$

X (Defectives) Calculation Probability
0 $\frac{\binom{3}{0}\binom{7}{4}}{210} = \frac{1 \times 35}{210}$ 0.1667 (1/6)
1 $\frac{\binom{3}{1}\binom{7}{3}}{210} = \frac{3 \times 35}{210}$ 0.5000 (1/2)
2 $\frac{\binom{3}{2}\binom{7}{2}}{210} = \frac{3 \times 21}{210}$ 0.3000 (3/10)
3 $\frac{\binom{3}{3}\binom{7}{1}}{210} = \frac{1 \times 7}{210}$ 0.0333 (1/30)

📚 Recommended Reference Material

For Joint distributions and Sampling Without Replacement, refer to OpenStax Introductory Statistics.

Weather Prediction Analysis

Q5) If the weather is Snowy...

Question: If the weather is Snowy, then the player will play or not?


Step 1: The Full Dataset

Here are all the parameters provided in your image. We combine the left and right columns into a single list to see the total history.

# Weather Player Play?
1SunnyYes
2RainyNo
3CloudyYes
4SunnyNo
5SunnyYes
6SnowyNo
7RainyNo
8CloudyYes
9SunnyYes
10SnowyNo
11CloudyYes
12RainyNo
13SnowyNo
14SnowyYes

Step 2: Filter for "Snowy" Condition

The question asks specifically about the condition "If the weather is Snowy". Therefore, we filter the dataset to look only at the highlighted rows where Weather = Snowy. The Sunny, Rainy, and Cloudy days are ignored for this specific calculation because they are not the current weather condition.

Filtered Instances Weather Player Play?
6 Snowy No
10 Snowy No
13 Snowy No
14 Snowy Yes

Step 3: Calculate Probabilities

From the filtered table above:

  • Total "Snowy" days = 4
  • Days where Play = "No" = 3
  • Days where Play = "Yes" = 1

$P(\text{Play=No} | \text{Snowy}) = \frac{3}{4} = 0.75$ (75%)

$P(\text{Play=Yes} | \text{Snowy}) = \frac{1}{4} = 0.25$ (25%)

Conclusion

Since the probability of No (0.75) is significantly higher than Yes (0.25), the prediction is:

The player will NOT play.

📚 Recommended Reference Material

This problem demonstrates Conditional Probability, often used in Naïve Bayes classifiers.

  • OpenStax Introductory Statistics:
    • Chapter 3.1: Terminology (Conditionals).
    • Chapter 3.2: Independent and Mutually Exclusive Events.
  • View Probability Chapter

Bayes' Theorem & Normal Distribution

Part a) Incidence of a Rare Disease

Given:

  • Prevalence $P(D) = 1/1000 = 0.001$
  • Sensitivity (Positive if Disease) $P(Pos|D) = 0.99$
  • False Positive Rate (Positive if Healthy) $P(Pos|D') = 0.02$

Step 1: Calculate Probability of Positive Test

Using the Law of Total Probability:

$P(Pos) = P(Pos|D)P(D) + P(Pos|D')P(D')$

Note: $P(D') = 1 - 0.001 = 0.999$

$P(Pos) = (0.99 \times 0.001) + (0.02 \times 0.999)$
$P(Pos) = 0.00099 + 0.01998$
$P(Pos) = 0.02097$

Step 2: Apply Bayes' Theorem

We need to find $P(D|Pos)$.

$$P(D|Pos) = \frac{P(Pos|D) \times P(D)}{P(Pos)}$$

$$P(D|Pos) = \frac{0.00099}{0.02097}$$

Probability: 0.0472 (or 4.72%)

Part b) Electrical Resistors

Given: Normal Distribution with Mean ($\mu$) = 40 ohms, Standard Deviation ($\sigma$) = 2 ohms.


i) Percentage Exceeding 43 Ohms

Find $P(X > 43)$.

Z-Score Formula: $Z = \frac{X - \mu}{\sigma}$

$Z = \frac{43 - 40}{2} = \frac{3}{2} = 1.5$

From Z-table, area to the left of Z=1.5 is approx 0.9332.

Area to the right = $1 - 0.9332 = 0.0668$

Percentage: 6.68%

ii) Percentage Exceeding 43 Ohms (Nearest Ohm)

If measured to the nearest ohm, a value "exceeds 43" if the rounded integer is 44 or higher.

The boundary for rounding to 44 starts at 43.5.

So, we calculate $P(X \ge 43.5)$.

Z-Score: $Z = \frac{43.5 - 40}{2} = \frac{3.5}{2} = 1.75$

From Z-table, area to the left of Z=1.75 is approx 0.9599.

Area to the right = $1 - 0.9599 = 0.0401$

Percentage: 4.01%

📚 Recommended Reference Material

OpenStax Introductory Statistics: