Practical Statistics for Data Science
This guide explains core statistical concepts with a focus on real-world data science applications. Each topic includes definitions, formulas (rendered with MathJax), and practical examples.
1. Types of Analysis
In data science, we use different types of statistical analysis depending on the question we’re trying to answer:
- Descriptive Analysis: Summarizes data (e.g., mean, median, charts).
- Inferential Analysis: Draws conclusions about a population using a sample (e.g., hypothesis testing, confidence intervals).
- Predictive Analysis: Uses models to forecast future outcomes (e.g., regression, machine learning).
- Exploratory Data Analysis (EDA): Investigates data to find patterns, anomalies, or relationships before formal modeling.
Practical Example:
A data scientist at an e-commerce company uses descriptive analysis to report average monthly sales, then uses inferential analysis to test if a new website layout increases conversion rates.
2. Population and Sampling
- Population: The complete set of items or individuals you’re interested in studying. Denoted as \(N\).
- Sample: A subset of the population used to make inferences about the whole. Denoted as \(n\).
We use samples because collecting data from an entire population is often impractical or too expensive.
Example:
If you want to know the average height of all adults in a country (population), you might measure 1,000 randomly selected adults (sample).
3. Sampling Methods
How you select your sample affects the reliability of your conclusions.
Common Sampling Techniques:
- Simple Random Sampling: Every member has an equal chance of being selected.
- Stratified Sampling: Population divided into subgroups (strata), then random samples taken from each.
- Cluster Sampling: Population divided into clusters; entire clusters are randomly selected.
- Systematic Sampling: Select every \(k\)-th individual from a list.
- Convenience Sampling: Use readily available data (biased, not recommended for inference).
Practical Tip:
In customer surveys, stratified sampling ensures representation across age groups, regions, or user tiers.
4. Sample Size
Choosing the right sample size balances cost, time, and accuracy.
A larger sample reduces sampling error but increases cost. The required size depends on:
- Desired confidence level (e.g., 95%)
- Margin of error (\(E\))
- Population variability (\(\sigma\))
For estimating a population mean:
\[
n = \left( \frac{Z \cdot \sigma}{E} \right)^2
\]
Where \(Z\) is the z-score (e.g., 1.96 for 95% confidence).
If \(\sigma\) is unknown, use a pilot study or conservative estimate.
Example:
Want to estimate average app usage time within ±5 minutes with 95% confidence. If \(\sigma \approx 30\) min:
\[
n = \left( \frac{1.96 \cdot 30}{5} \right)^2 \approx 138.3 \Rightarrow \text{Use } n = 139
\]
5. Variables
Variables represent measurable attributes. Understanding their type guides analysis.
Types of Variables:
- Numeric (Quantitative):
- Continuous: Can take any value (e.g., height, temperature).
- Discrete: Countable integers (e.g., number of users, defects).
- Categorical (Qualitative):
- Nominal: No order (e.g., gender, color).
- Ordinal: Ordered categories (e.g., satisfaction: low, medium, high).
Practical Impact:
You’d use linear regression for continuous outcomes but logistic regression for binary (categorical) outcomes.
6. Branches of Statistics
- Descriptive Statistics: Summarizes and visualizes data (mean, variance, histograms).
- Inferential Statistics: Makes predictions or inferences about populations from samples (confidence intervals, p-values).
- Bayesian Statistics: Updates probability estimates as more evidence becomes available (uses prior + likelihood → posterior).
Data science heavily relies on both descriptive and inferential methods, with growing use of Bayesian approaches in machine learning.
7. Moments
Moments describe the shape of a distribution.
- 1st Moment: Mean (\(\mu\)) — central tendency.
\[
\mu = \frac{1}{n} \sum_{i=1}^n x_i
\]
- 2nd Moment: Variance (\(\sigma^2\)) — spread.
\[
\sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2
\]
- 3rd Moment: Skewness — asymmetry.
Positive skew: tail on right; negative skew: tail on left.
- 4th Moment: Kurtosis — "tailedness".
High kurtosis = heavy tails (more outliers).
Use in Practice:
Skewness tells you if you need to transform data (e.g., log-transform) before applying linear models.
8. 5-Number Summary
A quick way to describe the distribution of numeric data:
- Minimum
- First Quartile (\(Q_1\)) — 25th percentile
- Median (\(Q_2\)) — 50th percentile
- Third Quartile (\(Q_3\)) — 75th percentile
- Maximum
Used to build box plots, which visualize spread and detect outliers.
Outliers are often defined as values below \(Q_1 - 1.5 \cdot IQR\) or above \(Q_3 + 1.5 \cdot IQR\), where \(IQR = Q_3 - Q_1\).
Example:
Dataset: [1, 3, 5, 7, 9, 11, 13]
→ Min=1, \(Q_1=3\), Median=7, \(Q_3=11\), Max=13
→ IQR = 8 → Outlier thresholds: -9 and 23 → No outliers.
9. Distributions
A distribution shows how values are spread across possible outcomes.
Common Distributions in Data Science:
- Normal (Gaussian): Bell-shaped, symmetric. Defined by \(\mu\) and \(\sigma\).
\[
f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
\]
- Binomial: Number of successes in \(n\) trials (e.g., click/no-click).
Parameters: \(n\), \(p\) (probability of success).
- Poisson: Counts of rare events in fixed intervals (e.g., website visits per hour).
Parameter: \(\lambda\) (mean rate).
- Uniform: All outcomes equally likely (e.g., random number generator).
Many statistical methods assume normality, so checking distribution shape is crucial.
10. Distribution Comparison
Comparing distributions helps detect differences between groups or changes over time.
Methods:
- Visual: Overlay histograms, KDE plots, or box plots.
- Statistical Tests:
- t-test: Compare means of two groups (assumes normality).
- Mann-Whitney U test: Non-parametric alternative to t-test.
- Kolmogorov-Smirnov test: Compares entire distributions.
- Chi-square test: For categorical distributions.
Practical Use:
A/B testing: Compare conversion rate distributions between control and treatment groups using a chi-square test (for proportions) or t-test (for means).
11. Correlation
Measures the strength and direction of a linear relationship between two numeric variables.
Pearson Correlation Coefficient (\(r\)):
\[
r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}
\]
Range: \(-1 \leq r \leq 1\)
- \(r = 1\): Perfect positive linear relationship
- \(r = 0\): No linear relationship
- \(r = -1\): Perfect negative linear relationship
Important: Correlation ≠ Causation!
Other Types:
- Spearman’s \(\rho\): Rank-based; works for monotonic (not necessarily linear) relationships.
- Kendall’s \(\tau\): Another rank correlation, robust for small samples.
Example:
In housing data, square footage and price often have \(r \approx 0.7\) — strong positive correlation.
But adding more square footage doesn’t *guarantee* higher price (location matters too!).