Questions and Answers
Part a) Analysis of Weights
A sample of eleven weights (in kg) is given: 60, 72, 65, 68, 70, 100, 62, 75, 78, 80, 83.
- Calculate Q1, Q2, Q3, and IQR.
- Determine potential outliers using the 1.5 × IQR rule.
Goal: To find the quartiles (spread) and check for any unusual values (outliers).
i) Calculate Q1, Q2, Q3, and IQR
Step 1: Sort the data.
(There are 11 numbers total)
Step 2: Find the Median (Q2).
The exact middle number is the 6th number.
Q2 (Median) = 72
Step 3: Find the First Quartile (Q1).
Q1 is the middle of the lower half (numbers left of 72): 60, 62, 65, 68, 70.
Q1 = 65
Step 4: Find the Third Quartile (Q3).
Q3 is the middle of the upper half (numbers right of 72): 75, 78, 80, 83, 100.
Q3 = 80
Step 5: Calculate Interquartile Range (IQR).
This is the spread of the middle 50%.
IQR = Q3 - Q1 = 80 - 65
IQR = 15
ii) Check for Outliers
The Rule: Values are outliers if they are more than 1.5 × IQR away from Q1 or Q3.
- Fence Distance: 1.5 × 15 = 22.5
- Lower Limit: Q1 - 22.5 = 65 - 22.5 = 42.5
- Upper Limit: Q3 + 22.5 = 80 + 22.5 = 102.5
Conclusion: Looking at our data (60 to 100):
- Is anything below 42.5? No.
- Is anything above 102.5? No. (100 is close, but safe!)
There are no outliers.
Part A: Basic Computational Speed (10 Questions) (Click to Expand)
Find $Q1, Q2, Q3, IQR$, and check for Outliers.
- Data: 10, 12, 15, 18, 20, 22, 25, 30, 80
Ans: $Q1=13.5, Q2=20, Q3=27.5, IQR=14$. Outlier: 80. - Data: 5, 5, 6, 7, 8, 8, 9, 10, 12, 15
Ans: $Q1=6, Q2=8, Q3=10, IQR=4$. Outlier: None. - Data: 100, 110, 120, 130, 140, 150, 160, 250
Ans: $Q1=115, Q2=135, Q3=155, IQR=40$. Outlier: 250. - Data: 40, 42, 45, 47, 49, 50, 52, 55, 58, 60, 110
Ans: $Q1=45, Q2=50, Q3=58, IQR=13$. Outlier: 110. - Data: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Ans: $Q1=3, Q2=5.5, Q3=8, IQR=5$. Outlier: None. - Data: 22, 24, 26, 28, 30, 32, 34, 36, 100, 110
Ans: $Q1=26, Q2=31, Q3=36, IQR=10$. Outliers: 100, 110. - Data: 50, 55, 60, 65, 70, 75, 80
Ans: $Q1=55, Q2=65, Q3=75, IQR=20$. Outlier: None. - Data: 0, 45, 46, 47, 48, 49, 50, 95
Ans: $Q1=45.5, Q2=47.5, Q3=49.5, IQR=4$. Outliers: 0, 95. - Data: 15, 15, 15, 15, 15, 15, 15, 45
Ans: $Q1=15, Q2=15, Q3=15, IQR=0$. Outlier: 45. - Data: 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32
Ans: $Q1=16, Q2=22, Q3=28, IQR=12$. Outlier: None.
Part B: Conceptual ISM Application (10 Questions) (Click to Expand)
Think about these in the context of network latency or login attempts.
- Latency spikes: 20ms, 22ms, 21ms, 23ms, 20ms, 200ms. Is 200ms a statistical outlier?
Ans: $Q1=20, Q3=23, IQR=3$. Upper bound = $27.5$. Yes, it is an outlier. - Login failures: 2, 3, 1, 4, 2, 30, 2, 1. Is 30 a brute force indicator (outlier)?
Ans: $Q1=1, Q3=3, IQR=2$. Upper bound = $6$. Yes, 30 is an outlier. - Data Packet sizes: 500, 510, 505, 520, 10, 515. Is 10 a fragmented/malicious packet?
Ans: $Q1=500, Q3=515, IQR=15$. Lower bound = $477.5$. Yes, 10 is an outlier. - System Uptime (days): 300, 305, 310, 20, 315, 320.
Ans: $Q1=300, Q3=315, IQR=15$. Lower bound = $277.5$. 20 is an outlier. - User Access Times (hr): 9, 10, 11, 12, 13, 14, 23. Is the 11 PM access an outlier?
Ans: $Q1=10, Q3=14, IQR=4$. Upper bound = $20$. Yes, 23 is an outlier. - Sensor Temperatures: 25, 26, 25, 27, 26, 45, 25.
Ans: $Q1=25, Q3=27, IQR=2$. Upper bound = $30$. Yes, 45 is an outlier. - File Sizes (MB): 5, 7, 6, 8, 5, 50.
Ans: $Q1=5, Q3=8, IQR=3$. Upper bound = $12.5$. Yes, 50 is an outlier. - Encryption Time (sec): 1.2, 1.3, 1.2, 1.4, 3.5.
Ans: $Q1=1.2, Q3=1.4, IQR=0.2$. Upper bound = $1.7$. Yes, 3.5 is an outlier. - Download Speeds: 10, 12, 11, 13, 1, 12.
Ans: $Q1=10, Q3=12, IQR=2$. Lower bound = $7$. Yes, 1 is an outlier. - Weekly Alerts: 100, 110, 105, 120, 115, 500.
Ans: $Q1=105, Q3=120, IQR=15$. Upper bound = $142.5$. Yes, 500 is an outlier.
Section A, B, & C: Theoretical & Case Study Practice (Click to Expand)
Section A: Theoretical Foundation
Question 1: Sensitivity
If a dataset of system response times has a mean of 50ms and a median of 45ms, and a single system crash causes one entry to be 10,000ms, which value ($Q2$ or the mean) will change more significantly?
Answer: The mean will change more significantly because it factors in every data point. The median ($Q2$) only shifts to the next adjacent value, remaining largely unchanged. This prevents a "false baseline" during system crashes.
Question 2: The "Whisker" Logic
If the whiskers are very short compared to the box, what does that tell a security analyst about the predictability of the system?
Answer: Short whiskers indicate very little variance at the extremes. The system is highly predictable; even a small deviation beyond these whiskers is likely a legitimate anomaly worth investigating.
Section B: Computational Analysis
Question 3: Raw Data Analysis (Packet sizes: 12, 15, 17, 20, 22, 25, 30, 35, 40, 150)
- a) $Q1, Q2, Q3$: $Q2=23.5$, $Q1=17$, $Q3=35$.
- b) $IQR$: $35 - 17 = 18$.
- c) Outlier Check: Upper Fence $= 35 + (1.5 \times 18) = 62$. Since $150 > 62$, 150 is an outlier.
Question 4: Grouped Data Analysis (N=100)
| Time (Min) | Frequency (f) | Cumulative Freq (cf) |
|---|---|---|
| 0 - 10 | 15 | 15 |
| 10 - 20 (Median Class) | 45 | 60 |
| 20 - 30 | 30 | 90 |
| 30 - 40 | 10 | 100 |
$Q2$ Calculation: $10 + \left( \frac{50 - 15}{45} \right) \times 10 = 17.78$ mins.
Section C: Application & Case Study
Question 5: Comparing Encryption Algorithms
- Predictability: Protocol X ($IQR=10$) is more predictable than Protocol Y ($IQR=35$).
- Is 90% an outlier for Y? Upper Fence $= 45 + (1.5 \times 35) = 97.5$. Since $90 < 97.5$, it is not an outlier.
- Approval: Protocol X is approved for deployment because its tight CPU usage prevents resource starvation.
Part b) Astronaut Training Times
The following data shows the time (in minutes) it took for a group of astronauts to complete various training simulations: 18, 25, 19, 32, 27, 22, 29, 30, 24, 26.
Calculate the mean and standard deviation of these times.
Goal: Calculate the Mean (average) and Standard Deviation (spread).
(n = 10 numbers)
1. Calculate the Mean
Sum of all numbers = 252
Mean = Sum ÷ Count = 252 ÷ 10
Mean = 25.2 minutes
2. Calculate Standard Deviation
We look at how far each number is from the mean (25.2) and square it.
| Data Point (x) | Distance (x - 25.2) | Squared Distance |
|---|---|---|
| 18 | -7.2 | 51.84 |
| 25 | -0.2 | 0.04 |
| 19 | -6.2 | 38.44 |
| 32 | 6.8 | 46.24 |
| 27 | 1.8 | 3.24 |
| 22 | -3.2 | 10.24 |
| 29 | 3.8 | 14.44 |
| 30 | 4.8 | 23.04 |
| 24 | -1.2 | 1.44 |
| 26 | 0.8 | 0.64 |
| SUM | 189.6 |
Formula Step 1 (Variance): Sum ÷ (n - 1)
189.6 ÷ 9 = 21.067
Formula Step 2 (Standard Deviation): Square root of Variance
√21.067 ≈ 4.59
Standard Deviation = 4.59 minutes
Practice: Mean & Sample Standard Deviation (10 Questions) (Click to Expand)
Formulas Used:
Mean: $\bar{x} = \frac{\sum x}{n}$
Sample Standard Deviation: $s = \sqrt{\frac{\sum (x - \bar{x})^2}{n-1}}$
Find the Mean ($\bar{x}$) and Sample Standard Deviation ($s$) for the following sets:
- Data: 5, 10, 15, 20, 25
Ans: $\bar{x} = 15, s \approx 7.91$ - Data: 100, 102, 104, 106
Ans: $\bar{x} = 103, s \approx 2.58$ - Data: 2, 4, 6, 8, 10, 12
Ans: $\bar{x} = 7, s \approx 3.74$ - Data: 50, 50, 50, 50 (Zero variance test)
Ans: $\bar{x} = 50, s = 0$ - Data: 1, 9, 1, 9
Ans: $\bar{x} = 5, s \approx 4.62$ - Data: 12, 15, 18, 21, 24
Ans: $\bar{x} = 18, s \approx 4.74$ - Data: 45, 47, 52, 48, 50
Ans: $\bar{x} = 48.4, s \approx 2.70$ - Data: 10, 20, 30, 40, 50, 60, 70
Ans: $\bar{x} = 40, s \approx 21.60$ - Data: 3, 3, 4, 5, 5
Ans: $\bar{x} = 4, s = 1$ - Data: 0.5, 0.7, 0.9, 1.1
Ans: $\bar{x} = 0.8, s \approx 0.26$
📚 Recommended Reference Material
For a complete beginner, I highly recommend OpenStax Introductory Statistics. It is free, high-quality, and widely used.
- Download PDF for Free: OpenStax Introductory Statistics
- Read: Chapter 2.3 (Mean/Median) and Chapter 2.7 (Standard Deviation/Outliers)
Probability Distributions
Part a) Normal Distribution
Life of an instrument: Mean (μ) = 12 months, Standard Deviation (σ) = 2 months.
- Find probability it lasts less than 7 months.
- Find probability it lasts between 7 and 12 months.
Goal: Use the Z-score formula to standardize values and find probabilities from the Normal Table.
[Image of standard normal distribution curve]i) Probability (Life < 7 months)
Step 1: Calculate the Z-score.
The Z-score tells us how many standard deviations "7" is away from the mean "12".
Z = (7 - 12) / 2
Z = -5 / 2 = -2.5
Step 2: Look up Z = -2.5 in the Standard Normal Table.
The area to the left of Z = -2.5 corresponds to the probability.
ii) Probability (7 < Life < 12 months)
Logic: We know the total area to the left of the mean (12) is 0.5 (50%). We just calculated the small tail area to the left of 7.
To find the area between 7 and 12, we subtract the small tail from the half.
- Probability (X < 12): 0.5000 (The Mean)
- Probability (X < 7): 0.0062 (From part i)
Answer: 49.38% probability.
Topic: The Normal Distribution (Gaussian / Bell Curve) (Click to Expand)
The Normal Distribution is the most important probability distribution in statistics. it describes how values of a variable are distributed and is defined by its characteristic symmetrical shape.
1. Theoretical Foundation
It is a continuous probability distribution symmetrical about the mean, defined by two parameters:
- 📍 Mean ($\mu$): The center of the distribution.
- 📏 Standard Deviation ($\sigma$): The spread of the data.
- Symmetry: Left and right halves are mirror images. Mean = Median = Mode.
- Asymptotic: The tails approach the horizontal axis but never touch it.
The Empirical Rule (68-95-99.7 Rule)
This rule defines how much data falls within specific standard deviations from the mean:
2. Real-World Application: Human Height
Imagine measuring the height of every adult man in a large city:
- The Mean: If average height is 5'9" (175 cm), most men cluster around this peak.
- The Spread: Fewer men are 6'2" or 5'4".
- The Extremes: Finding someone over 7ft or under 4ft is statistically rare.
Other Examples: SAT Scores, IQ tests, Manufacturing weights (e.g., 16oz cereal boxes), and random measurement errors.
3. Quick Learning Resources
Watch these visual guides for a "quick click" of the concepts:
Practice Set: Normal Distribution in Security Management (Click to Expand)
Assume a Normal Distribution for all calculations. Use the Z-score formula: $Z = \frac{X - \mu}{\sigma}$
Ans: $Z = 2$. $P(Z > 2) = 1 - 0.9772 = \mathbf{0.0228}$
Ans: $Z = -1$. $P(Z < -1) = \mathbf{0.1587}$
Ans: $P(-1 < Z < 1) \approx \mathbf{0.6826}$ (Empirical Rule)
Ans: $Z = 2$. $P(Z > 2) = \mathbf{0.0228}$
Ans: $Z = -2$. $P(Z < -2) = \mathbf{0.0228}$
Ans: $Z = 2$. $P(Z > 2) = \mathbf{0.0228}$
Ans: $\mathbf{0.6826}$ (Within 1 standard deviation)
Ans: $Z = -2$. $P(Z < -2) = \mathbf{0.0228}$
Ans: 0 (Probability of an exact point in continuous distributions is always 0)
Ans: $\mathbf{0.6826}$
Part b) Binomial vs. Poisson
Biased coin with p = 0.15 flipped n = 40 times.
Find the probability of getting exactly 5 heads using i) Binomial and ii) Poisson.
i) Binomial Distribution
Formula: P(X=k) = nCk × pk × (1-p)n-k
- n = 40, k = 5, p = 0.15
- (1-p) = 0.85
Calculation:
P(X=5) = 40C5 × (0.15)5 × (0.85)35
P(X=5) = 658,008 × 0.0000759 × 0.0034
ii) Poisson Approximation
Condition: Poisson estimates Binomial when 'n' is large and 'p' is small. We use the mean (λ) as the parameter.
λ = n × p
λ = 40 × 0.15 = 6
Formula: P(X=k) = (e-λ × λk) / k!
P(X=5) = (e-6 × 65) / 120
P(X=5) = (0.002479 × 7776) / 120
Note: The values are close (16.9% vs 16.1%), showing that Poisson is a reasonable approximation here.
Section A: Binomial & Poisson (Discrete Distributions) (Click to Expand)
Practice calculations for discrete events like packet loss, malware arrivals, and system failures.
Decision Guide: Which Distribution to Use? (Click to Expand)
To identify the correct distribution, look for specific keywords and the nature of the data provided.
1. Normal Distribution
Look for: Continuous data, a provided Mean ($\mu$), and a Standard Deviation ($\sigma$).
Why? You have two specific parameters for continuous measurements (time).
2. Binomial Distribution
Look for: A fixed number of independent trials ($n$), a constant probability ($p$), and exactly two outcomes (Success/Failure).
Why? These deal with "Yes/No" outcomes over a set number of attempts.
3. Poisson Distribution
Look for: An average rate of occurrence ($\lambda$) over an interval (time/space) or a very large $n$ with a tiny $p$.
[Image showing comparison between Binomial and Poisson distribution curves]Why? Used for counting random events that happen at a known average rate.
| Distribution | Key Given Info |
|---|---|
| Normal | $\mu$ and $\sigma$ (Continuous) |
| Binomial | $n$ and $p$ (Discrete counts) |
| Poisson | $\lambda$ or (Large $n$ + Small $p$) |
📚 Essential "Sheet Codes" & Formulas (Exam Prep) (Click to Expand)
1. Normal Distribution (Continuous Data)
Use for time, weight, or length where $\mu$ and $\sigma$ are known.
Empirical Rule: 68% ($\pm 1\sigma$), 95% ($\pm 2\sigma$), 99.7% ($\pm 3\sigma$)
⌨️ Excel Code: =NORM.DIST(x, mean, standard_dev, TRUE)
2. Binomial Distribution (Discrete Counts)
Use for fixed trials ($n$) with "Success/Failure" outcomes ($p$).
Note: $q = 1 - p$ and $\binom{n}{k} = \frac{n!}{k!(n-k)!}$
💡 Calculator Tip (Casio): Use the nCr button (SHIFT + ÷).
⌨️ Excel Code: =BINOM.DIST(k, n, p, FALSE)
3. Poisson Distribution (Average Rates)
Use for average rates ($\lambda$) over an interval.
Constants: $e \approx 2.718$ | $\lambda$ is often $n \times p$
⚠️ "At Least 1" Logic: $P(X \ge 1) = 1 - P(0)$
⌨️ Excel Code: =POISSON.DIST(k, mean, FALSE)
Quick Decision Matrix
| Keywords | Use Distribution | Essential Math |
|---|---|---|
| "Mean", "Standard Deviation", "Continuous" | Normal | $Z = \frac{x-\mu}{\sigma}$ |
| "Success/Failure", "Fixed ($n$)", "Yes/No" | Binomial | nCr Calculations |
| "On average", "Per hour", "Arrival rate" | Poisson | $\lambda$ (Lambda) |
📚 Recommended Reference
For probability distributions, OpenStax Statistics is excellent.
- Chapter 6: The Normal Distribution (Z-scores)
- Chapter 4: Discrete Random Variables (Binomial/Poisson)
- View Online Textbook
Naive Bayes Classification
Problem Statement
A company wants to classify incoming emails as "Spam" or "Not Spam" (Ham) using a Naïve Bayes classifier. Given the training dataset below, predict the label for the new email:
"Congratulations! You have won money quickly"
| Email ID | Text (Bag of Words) | Label |
|---|---|---|
| 1 | Congratulations, you have won a free prize | Spam |
| 2 | Monthly meeting scheduled for Monday | Not Spam |
| 3 | Earn money quickly with this simple trick | Spam |
| 4 | Project report attached. Review by Friday | Not Spam |
Goal: Calculate the posterior probability for both classes (Spam and Not Spam) and assign the class with the higher probability.
Step 1: Pre-processing & Priors
First, we calculate the Prior Probabilities based on the document counts.
- Total Documents: 4
- Spam Documents: 2 (Email 1, Email 3)
- Not Spam Documents: 2 (Email 2, Email 4)
P(Not Spam) = 2/4 = 0.5
Step 2: Vocabulary & Word Counts
We tokenize the training data to create a Vocabulary (V) and count words per class.
| Spam Bag of Words | Congratulations, you, have, won, a, free, prize, Earn, money, quickly, with, this, simple, trick |
| Not Spam Bag of Words | Monthly, meeting, scheduled, for, Monday, Project, report, attached, Review, by, Friday |
- Total Words in Spam (NSpam): 14
- Total Words in Not Spam (NNotSpam): 11
- Vocabulary Size (|V|): 25 (14 unique spam words + 11 unique non-spam words)
Step 3: Likelihoods with Laplace Smoothing
Test Sentence Tokens: "Congratulations", "You", "have", "won", "money", "quickly".
Note: We perform case-insensitive matching (e.g., "You" = "you").
Formula: $P(w|Class) = \frac{count(w, Class) + 1}{N_{Class} + |V|}$
Class: Spam
Denominator: $14 + 25 = 39$
- "Congratulations": (1+1)/39 = 2/39
- "You": (1+1)/39 = 2/39
- "have": (1+1)/39 = 2/39
- "won": (1+1)/39 = 2/39
- "money": (1+1)/39 = 2/39
- "quickly": (1+1)/39 = 2/39
Class: Not Spam
Denominator: $11 + 25 = 36$
- "Congratulations": (0+1)/36 = 1/36
- "You": (0+1)/36 = 1/36
- "have": (0+1)/36 = 1/36
- "won": (0+1)/36 = 1/36
- "money": (0+1)/36 = 1/36
- "quickly": (0+1)/36 = 1/36
Step 4: Final Probability Scores
Score(Spam): $0.5 \times (\frac{2}{39})^6$
Calculation: $0.5 \times (0.0513)^6 \approx \mathbf{9.1 \times 10^{-9}}$
Score(Not Spam): $0.5 \times (\frac{1}{36})^6$
Calculation: $0.5 \times (0.0278)^6 \approx \mathbf{2.3 \times 10^{-10}}$
Conclusion
Since the score for Spam is significantly higher than Not Spam, the email is classified as:
[cite_start]SPAM [cite: 1]
Topic: Understanding Naïve Bayes (Theory & Logic) (Click to Expand)
1. Theoretical Foundation
Naïve Bayes is a supervised machine learning algorithm used for classification. It is favored for its speed and effectiveness, especially in text-based data.
The Core Formula (Bayes' Theorem)
$$P(C|x) = \frac{P(x|C) \cdot P(C)}{P(x)}$$
- $P(C|x)$ (Posterior): Prob. of class given features.
- $P(x|C)$ (Likelihood): Prob. of features given class.
- $P(C)$ (Prior): Baseline prob. of the class.
- $P(x)$ (Evidence): Total prob. of the features.
Why is it "Naïve"?
It assumes conditional independence. It treats every feature (e.g., words in an email) as if it has no relationship with any other feature.
2. Real-World Case: Spam Filtering
In cybersecurity, this is the "gold standard" for identifying malicious intent in communications.
| Folder | High Frequency Keywords |
|---|---|
| Spam | Winner, Free, Cash, Urgent, Money |
| Ham (Legit) | Meeting, Project, Attached, Lunch |
Other Common Uses:
- 🎭 Sentiment Analysis: Movie/Product reviews.
- 🩺 Medical Diagnosis: Symptoms to disease.
- ☁️ Weather Prediction: Sunny vs. Rainy.
- 🚨 Security: Intrusion detection.
Section 4: Naïve Bayes & Probability Logic (Click to Expand)
Essential for Spam Filtering and Intrusion Detection Systems (IDS).
Advanced: Naïve Bayes Mechanics & Smoothing (Click to Expand)
1. The Mathematical Foundation
Naïve Bayes uses Bayes' Theorem to calculate the "Posterior Probability." Because it assumes features are independent, the math simplifies into a product of individual probabilities.
The Chain Rule for Multiple Features:
$P(C|x_1, \dots, x_n) \propto P(C) \prod_{i=1}^{n} P(x_i|C)$
2. Laplace Smoothing (Solving the Zero Frequency Problem)
If a word in the test set never appeared in your training data, the probability $P(x|C)$ becomes $0$. Since we multiply these values, one zero ruins the entire calculation.
$$P(x_i|C) = \frac{\text{count}(x_i, C) + \alpha}{\text{count}(C) + \alpha \cdot |V|}$$
- $\alpha$: Usually $1$ (Add-one smoothing).
- $|V|$: Total unique features (Vocabulary size).
3. Types of Naïve Bayes Models
Choosing the right model depends on the type of data you are analyzing:
[Image comparing Gaussian Multinomial and Bernoulli Naive Bayes models]| Model Type | Best Used For... |
|---|---|
| Gaussian | Continuous data (measurements like height, weight, or temperature) following a Normal Distribution. |
| Multinomial | Text classification based on word counts (how many times "Cash" appears). |
| Bernoulli | Binary features. It only cares if a word is Present or Absent (0 or 1). |
Key Advantages Summary:
- Fast: Independent calculations are computationally cheap.
- Robust: Handles missing data and small datasets well.
- Scalable: Performs exceptionally well with high-dimensional data like text.
Bayes' Theorem: Medical Diagnosis
Problem Statement
A new test is developed to detect a rare genetic disorder. The disorder affects 0.1% of the population. The test has:
- Sensitivity (true positive rate): 98%
- Specificity (true negative rate): 97%
Apply Bayes' theorem to identify the probability that a person has the disorder if their test result is positive.
Goal: Calculate the Posterior Probability, which is P(Disease | Positive Test).
Step 1: Define Events & Probabilities
Let's convert the percentages into probabilities (decimals) and define our terms.
| Term | Symbol | Value |
|---|---|---|
| Prevalence (Prior) | P(Disease) | 0.1% = 0.001 |
| No Disease | P(No Disease) | 1 - 0.001 = 0.999 |
| Sensitivity | P(Pos | Disease) | 98% = 0.98 |
| Specificity | P(Neg | No Disease) | 97% = 0.97 |
P(Pos | No Disease) = 1 - Specificity = 1 - 0.97 = 0.03
Step 2: Apply Bayes' Theorem
The formula for the probability of having the disease given a positive test is:
P(Pos)
Expand the denominator P(Pos):
P(Pos) = [True Positives] + [False Positives]
P(Pos) = [P(Pos|Disease) × P(Disease)] + [P(Pos|No Disease) × P(No Disease)]
Step 3: Calculation
1. Numerator (True Positives)
0.98 × 0.001
= 0.00098
2. False Positives
0.03 × 0.999
= 0.02997
3. Final Division
Total Probability of Positive Test = 0.00098 + 0.02997 = 0.03095
P(Disease | Pos) = 0.00098 ÷ 0.03095
≈ 0.03166
Final Answer
The probability that a person actually has the disorder given a positive test result is approximately:
3.17%
(Note: This low probability, despite high sensitivity/specificity, is due to the "Base Rate Fallacy" — the disease is extremely rare.)
Discrete Random Variables
Q5) Probability Mass Function
Given: A machine learning model predicts the number of errors (X) in a document. X takes values {0, 1, 2, 3} with the following PMF:
- P(X=0) = k
- P(X=1) = 2k
- P(X=2) = 3k
- P(X=3) = 4k
- P(X=other) = 0
a) Find the value of k
Rule: The sum of all probabilities in a valid probability mass function must equal 1.
$\sum P(X=x) = 1$
$P(0) + P(1) + P(2) + P(3) = 1$
$k + 2k + 3k + 4k = 1$
$10k = 1$
b) Compute Expectation E(X)
Formula: $E(X) = \sum [x \cdot P(X=x)]$
| x | P(x) | x · P(x) |
|---|---|---|
| 0 | 0.1 | 0 |
| 1 | 0.2 | 0.2 |
| 2 | 0.3 | 0.6 |
| 3 | 0.4 | 1.2 |
Total Sum = $0 + 0.2 + 0.6 + 1.2$
c) Meaning of E(X) in context
The expected value represents the long-term average.
d) Compute Variance(2X + 3)
Step 1: Calculate $E(X^2)$
$E(X^2) = \sum [x^2 \cdot P(x)]$
- $0^2(0.1) = 0$
- $1^2(0.2) = 0.2$
- $2^2(0.3) = 4(0.3) = 1.2$
- $3^2(0.4) = 9(0.4) = 3.6$
$E(X^2) = 0 + 0.2 + 1.2 + 3.6 = \mathbf{5.0}$
Step 2: Calculate Variance of X, $Var(X)$
Formula: $Var(X) = E(X^2) - [E(X)]^2$
$Var(X) = 5.0 - (2.0)^2$
$Var(X) = 5.0 - 4.0 = \mathbf{1.0}$
Step 3: Apply Variance Property
Property: $Var(aX + b) = a^2 Var(X)$
Here, $a = 2$ and $b = 3$.
$Var(2X + 3) = 2^2 \times Var(X)$
$= 4 \times 1.0$
Joint Probability & Expectation
Q6) Joint Probability Distribution
Experiment: A fair coin is tossed 3 times.
- X: Random variable for the number of tails.
- Y: Random variable for winnings based on the position of the 1st tail.
Rules for Y:
- 1st Tail on Toss 1: Win $3 (Y=3)
- 1st Tail on Toss 2: Win $2 (Y=2)
- 1st Tail on Toss 3: Win $1 (Y=1)
- No Tail: Lose $2 (Y=-2)
Step 1: Analyze Sample Space
There are $2^3 = 8$ possible outcomes. Let's list them and determine X and Y for each.
| Outcome | X (Count Tails) | 1st Tail Pos | Y (Winnings) |
|---|---|---|---|
| H H H | 0 | None | -2 |
| H H T | 1 | 3rd | 1 |
| H T H | 1 | 2nd | 2 |
| H T T | 2 | 2nd | 2 |
| T H H | 1 | 1st | 3 |
| T H T | 2 | 1st | 3 |
| T T H | 2 | 1st | 3 |
| T T T | 3 | 1st | 3 |
i) Joint Probability Function P(X, Y)
We group the outcomes by their (X, Y) pairs. Since each individual outcome has a probability of 1/8:
| X \ Y | -2 | 1 | 2 | 3 | Total P(X) |
|---|---|---|---|---|---|
| 0 | 1/8 (HHH) | 0 | 0 | 0 | 1/8 |
| 1 | 0 | 1/8 (HHT) | 1/8 (HTH) | 1/8 (THH) | 3/8 |
| 2 | 0 | 0 | 1/8 (HTT) | 2/8 (THT, TTH) | 3/8 |
| 3 | 0 | 0 | 0 | 1/8 (TTT) | 1/8 |
Interpretation: The table shows the probability of obtaining a specific number of tails (X) and a specific winning amount (Y) simultaneously. For example, P(X=2, Y=3) = 2/8 means there is a 25% chance of getting exactly 2 tails where the first tail occurs on the first toss.
ii) Marginal Probability Function for X
This is obtained by summing the probabilities across the rows (as shown in the "Total P(X)" column above).
- P(X=0) = 1/8 = 0.125
- P(X=1) = 3/8 = 0.375
- P(X=2) = 3/8 = 0.375
- P(X=3) = 1/8 = 0.125
iii) Expectation E(X)
Formula: $E(X) = \sum [x \cdot P(X=x)]$
$E(X) = (0 \times \frac{1}{8}) + (1 \times \frac{3}{8}) + (2 \times \frac{3}{8}) + (3 \times \frac{1}{8})$
$E(X) = 0 + \frac{3}{8} + \frac{6}{8} + \frac{3}{8}$
$E(X) = \frac{12}{8} = 1.5$
Probability & Conditional Events
Question 2
A study on ethics in the workplace by the Ethics Resource Center and Kronos, Inc., revealed that 35% of employees admit to keeping quiet when they see co-worker misconduct. Suppose 75% of employees who admit to keeping quiet when they see co-worker misconduct call in sick when they are well. In addition, suppose that 40% of the employees who call in sick when they are well admit to keeping quiet when they see co-worker misconduct. If an employee is randomly selected, determine the following probabilities:
- The employee calls in sick when well and admits to keeping quiet when seeing co-worker misconduct.
- The employee admits to keeping quiet when seeing co-worker misconduct or calls in sick when well.
- Given that the employee calls in sick when well, he or she does not keep quiet when seeing co-worker misconduct.
- The employee neither keeps quiet when seeing co-worker misconduct nor calls in sick when well.
- The employee admits to keeping quiet when seeing co-worker misconduct and does not call in sick when well.
- Event Q: Employee keeps quiet about misconduct. ($P(Q) = 0.35$)
- Event S: Employee calls in sick when well.
- Conditional 1: "75% of employees who keep quiet... call in sick" $\rightarrow P(S|Q) = 0.75$
- Conditional 2: "40% of employees who call in sick... keep quiet" $\rightarrow P(Q|S) = 0.40$
a) P(S and Q)
This asks for the intersection: The employee calls in sick AND keeps quiet.
Formula: $P(S \cap Q) = P(S|Q) \times P(Q)$
Calculation: $0.75 \times 0.35 = \mathbf{0.2625}$
We need the total probability of calling in sick $P(S)$ for the next steps.
Since $P(Q|S) = \frac{P(S \cap Q)}{P(S)}$, we can rearrange to find $P(S)$:
$P(S) = \frac{P(S \cap Q)}{P(Q|S)} = \frac{0.2625}{0.40} = \mathbf{0.65625}$
b) P(Q or S)
This asks for the union: Keeps quiet OR calls in sick.
Formula: $P(Q \cup S) = P(Q) + P(S) - P(Q \cap S)$
Calculation: $0.35 + 0.65625 - 0.2625 = \mathbf{0.74375}$
c) P(not Q | S)
Given they called in sick ($S$), what is the probability they do NOT keep quiet ($Q'$)?
Formula: $P(Q'|S) = 1 - P(Q|S)$
Calculation: $1 - 0.40 = \mathbf{0.60}$
d) P(not Q and not S)
Neither keeps quiet nor calls in sick. This is the complement of the union ($Q \cup S$).
Formula: $P(Q' \cap S') = 1 - P(Q \cup S)$
Calculation: $1 - 0.74375 = \mathbf{0.25625}$
e) P(Q and not S)
Admits to keeping quiet ($Q$) but does NOT call in sick ($S'$). This is "Q only".
Formula: $P(Q \cap S') = P(Q) - P(Q \cap S)$
Calculation: $0.35 - 0.2625 = \mathbf{0.0875}$
📚 Recommended Reference
For conditional probability and Bayes' theorem, OpenStax Statistics is a great resource.
- Chapter 3: Probability Topics (Conditional Probability, Contingency Tables)
- View Online Textbook
Probability & Statistics Analysis
Part a) Poisson Distribution
Based on past experience, it is assumed that the number of flaws per foot in rolls of grade 2 paper follows a Poisson distribution with a mean of 0.2 flaw per foot. What is the probability that in a:
- 1-foot roll, there will be at least 2 flaws?
- 50-foot roll, there will be greater than or equal to 5 flaws and less than or equal to 8 flaws
i) 1-foot roll, at least 2 flaws
Parameter: Since the roll is 1 foot long, the mean ($\lambda$) remains 0.2.
Goal: Find $P(X \ge 2)$. Using the complement rule: $1 - P(X < 2) = 1 - [P(0) + P(1)]$
Formula: $P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!}$
- $P(0) = \frac{e^{-0.2} (0.2)^0}{0!} = 0.8187$
- $P(1) = \frac{e^{-0.2} (0.2)^1}{1!} = 0.1637$
Probability: 0.0176 (or 1.76%)
ii) 50-foot roll, 5 to 8 flaws
Parameter: For 50 feet, the new mean is $\lambda = 0.2 \times 50 = 10$.
Goal: Find $P(5 \le X \le 8)$. This is the sum of probabilities for X = 5, 6, 7, and 8.
Using cumulative probability logic: $P(X \le 8) - P(X \le 4)$
Probability: 0.3036 (or 30.36%)
Part b) Law of Total Probability
An individual has 3 different email accounts. Most of her messages, in fact 70%, come into account #1, whereas 20% come into account #2 and the remaining 10% into account #3. Of the messages into account #1, only 1% are spam, whereas the corresponding percentages for accounts #2 and #3 are 2% and 5%, respectively. What is the probability that a randomly selected message is spam?
Calculation
We use the Law of Total Probability to find the overall probability of Spam ($S$).
Let $A_i$ be the event the message is from Account $i$.
- Account 1: $P(S|A_1)P(A_1) = 0.01 \times 0.70 = 0.007$
- Account 2: $P(S|A_2)P(A_2) = 0.02 \times 0.20 = 0.004$
- Account 3: $P(S|A_3)P(A_3) = 0.05 \times 0.10 = 0.005$
$0.007 + 0.004 + 0.005 = 0.016$
Probability: 0.016 (or 1.6%)
📚 Recommended Reference Material
For Poisson distributions and Total Probability, OpenStax Introductory Statistics is an excellent free resource.
- Chapter 4.6: Poisson Distribution - Explains the formula for events occurring over time/space.
- Chapter 3.2: Independent and Mutually Exclusive Events (Total Probability rules).
- View Online Textbook (Poisson)
Joint & Hypergeometric Distributions
4. a) Joint Probability Mass Function
Given: Joint PMF $P(X=x, Y=y) = k(x^2 + y^2)$.
Note: The specific range for x and y was missing from your prompt. I will solve this assuming the standard small range $x \in \{1, 2\}$ and $y \in \{1, 2\}$ to demonstrate the method.
i) Find the value of constant k
Rule: The sum of all probabilities in a joint distribution must equal 1.
$\sum \sum P(x, y) = 1$
Using range $x \in \{1, 2\}, y \in \{1, 2\}$:
1 = k(1²+1²) + k(1²+2²) + k(2²+1²) + k(2²+2²)
1 = k(2) + k(5) + k(5) + k(8)
1 = 20k
Answer: $k = \frac{1}{20}$ (or 0.05)
ii) Find Marginal Probability Functions
For X (Sum over Y):
- $P_X(1) = P(1,1) + P(1,2) = 2k + 5k = 7k = \mathbf{7/20}$
- $P_X(2) = P(2,1) + P(2,2) = 5k + 8k = 13k = \mathbf{13/20}$
For Y (Sum over X):
- $P_Y(1) = P(1,1) + P(2,1) = 2k + 5k = 7k = \mathbf{7/20}$
- $P_Y(2) = P(1,2) + P(2,2) = 5k + 8k = 13k = \mathbf{13/20}$
4. b) Hypergeometric Distribution
Given: Lot of 10 items, 3 defective. Sample of 4 drawn without replacement.
Find: The probability distribution of X (number of defectives).
iii) Probability Calculation
Parameters: Total ($N=10$), Defective ($K=3$), Good ($7$), Sample ($n=4$).
Formula: $P(X=x) = \frac{\binom{K}{x} \binom{N-K}{n-x}}{\binom{N}{n}}$
Denominator (Total Ways): $\binom{10}{4} = \frac{10 \times 9 \times 8 \times 7}{4 \times 3 \times 2 \times 1} = 210$
| X (Defectives) | Calculation | Probability |
|---|---|---|
| 0 | $\frac{\binom{3}{0}\binom{7}{4}}{210} = \frac{1 \times 35}{210}$ | 0.1667 (1/6) |
| 1 | $\frac{\binom{3}{1}\binom{7}{3}}{210} = \frac{3 \times 35}{210}$ | 0.5000 (1/2) |
| 2 | $\frac{\binom{3}{2}\binom{7}{2}}{210} = \frac{3 \times 21}{210}$ | 0.3000 (3/10) |
| 3 | $\frac{\binom{3}{3}\binom{7}{1}}{210} = \frac{1 \times 7}{210}$ | 0.0333 (1/30) |
📚 Recommended Reference Material
For Joint distributions and Sampling Without Replacement, refer to OpenStax Introductory Statistics.
- Chapter 4.4: Geometric and Hypergeometric Distributions.
- View Online Textbook
Weather Prediction Analysis
Q5) If the weather is Snowy...
Question: If the weather is Snowy, then the player will play or not?
Step 1: The Full Dataset
Here are all the parameters provided in your image. We combine the left and right columns into a single list to see the total history.
| # | Weather | Player Play? |
|---|---|---|
| 1 | Sunny | Yes |
| 2 | Rainy | No |
| 3 | Cloudy | Yes |
| 4 | Sunny | No |
| 5 | Sunny | Yes |
| 6 | Snowy | No |
| 7 | Rainy | No |
| 8 | Cloudy | Yes |
| 9 | Sunny | Yes |
| 10 | Snowy | No |
| 11 | Cloudy | Yes |
| 12 | Rainy | No |
| 13 | Snowy | No |
| 14 | Snowy | Yes |
Step 2: Filter for "Snowy" Condition
The question asks specifically about the condition "If the weather is Snowy". Therefore, we filter the dataset to look only at the highlighted rows where Weather = Snowy. The Sunny, Rainy, and Cloudy days are ignored for this specific calculation because they are not the current weather condition.
| Filtered Instances | Weather | Player Play? |
|---|---|---|
| 6 | Snowy | No |
| 10 | Snowy | No |
| 13 | Snowy | No |
| 14 | Snowy | Yes |
Step 3: Calculate Probabilities
From the filtered table above:
- Total "Snowy" days = 4
- Days where Play = "No" = 3
- Days where Play = "Yes" = 1
$P(\text{Play=No} | \text{Snowy}) = \frac{3}{4} = 0.75$ (75%)
$P(\text{Play=Yes} | \text{Snowy}) = \frac{1}{4} = 0.25$ (25%)
Conclusion
Since the probability of No (0.75) is significantly higher than Yes (0.25), the prediction is:
The player will NOT play.
📚 Recommended Reference Material
This problem demonstrates Conditional Probability, often used in Naïve Bayes classifiers.
- OpenStax Introductory Statistics:
- Chapter 3.1: Terminology (Conditionals).
- Chapter 3.2: Independent and Mutually Exclusive Events.
- View Probability Chapter
Bayes' Theorem & Normal Distribution
Part a) Incidence of a Rare Disease
Given:
- Prevalence $P(D) = 1/1000 = 0.001$
- Sensitivity (Positive if Disease) $P(Pos|D) = 0.99$
- False Positive Rate (Positive if Healthy) $P(Pos|D') = 0.02$
Step 1: Calculate Probability of Positive Test
Using the Law of Total Probability:
$P(Pos) = P(Pos|D)P(D) + P(Pos|D')P(D')$
Note: $P(D') = 1 - 0.001 = 0.999$
$P(Pos) = 0.00099 + 0.01998$
$P(Pos) = 0.02097$
Step 2: Apply Bayes' Theorem
We need to find $P(D|Pos)$.
$$P(D|Pos) = \frac{P(Pos|D) \times P(D)}{P(Pos)}$$
$$P(D|Pos) = \frac{0.00099}{0.02097}$$
Part b) Electrical Resistors
Given: Normal Distribution with Mean ($\mu$) = 40 ohms, Standard Deviation ($\sigma$) = 2 ohms.
i) Percentage Exceeding 43 Ohms
Find $P(X > 43)$.
Z-Score Formula: $Z = \frac{X - \mu}{\sigma}$
$Z = \frac{43 - 40}{2} = \frac{3}{2} = 1.5$
From Z-table, area to the left of Z=1.5 is approx 0.9332.
Area to the right = $1 - 0.9332 = 0.0668$
ii) Percentage Exceeding 43 Ohms (Nearest Ohm)
If measured to the nearest ohm, a value "exceeds 43" if the rounded integer is 44 or higher.
The boundary for rounding to 44 starts at 43.5.
So, we calculate $P(X \ge 43.5)$.
Z-Score: $Z = \frac{43.5 - 40}{2} = \frac{3.5}{2} = 1.75$
From Z-table, area to the left of Z=1.75 is approx 0.9599.
Area to the right = $1 - 0.9599 = 0.0401$
📚 Recommended Reference Material
OpenStax Introductory Statistics:
- Chapter 3.2: Conditional Probability and Bayes' Theorem.
- Chapter 6.2: Using the Normal Distribution (Z-scores).
- View Normal Distribution Chapter