Unit 3: Statistics
Measures of Central Tendency
Measures of central tendency are statistical values that describe the center point of a dataset. The three most common measures are the mean, median, and mode. These measures help summarize a set of data by providing a representative value.
1. Mean
The mean, or arithmetic average, is calculated by summing all the values in a dataset and dividing by the number of values. Mathematically, it is represented as:
where represents each value in the dataset and is the total number of values.
Example:
Given the dataset :
2. Median
The median is the middle value of a dataset when arranged in ascending order. If the number of observations is odd, the median is the middle number. If even, it is the average of the two middle numbers.
Example:
For the dataset :
- Arrange in order:
- The median is (the middle value).
For the dataset :
- Arrange in order:
- The median is .
3. Mode
The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), more than one mode (bimodal or multimodal), or no mode at all.
Example:
For the dataset , the mode is . For , both and are modes.
Measures of Dispersion
Dispersion measures how much the values in a dataset spread out. Key measures include range, variance, and standard deviation.
1. Range
The range is the difference between the highest and lowest values in a dataset.
Example:
For the dataset :
2. Variance
Variance measures the average squared deviation of each number from the mean. It is calculated as:
Example:
For the dataset :
- Mean
- Variance:
3. Standard Deviation
The standard deviation is the square root of the variance and indicates the average distance of each data point from the mean.
Example:
Continuing from the previous example:
Coefficient of Variation
The coefficient of variation (CV) is a standardized measure of dispersion, expressed as a percentage. It is calculated by dividing the standard deviation by the mean:
Example:
For the dataset :
- Mean
- Standard deviation
Calculating CV:
Moments
Moments are quantitative measures related to the shape of a dataset's distribution. The moment about the mean is given by:
1. First Moment
The first moment is always zero since it measures deviation from the mean.
2. Second Moment
The second moment gives us the variance, providing insight into the dataset's spread.
3. Third Moment
The third moment helps to measure the skewness of the distribution, indicating asymmetry.
4. Fourth Moment
The fourth moment relates to the kurtosis of the distribution, providing information about the "tailedness" of the distribution.
Skewness and Kurtosis
1. Skewness
Skewness quantifies the asymmetry of a distribution. A distribution can be positively skewed (long tail on the right), negatively skewed (long tail on the left), or symmetric.
The formula for skewness is given by:
Example:
For a dataset that is positively skewed, the skewness value will be greater than 0.
2. Kurtosis
Kurtosis measures the "tailedness" of the distribution. High kurtosis indicates heavy tails, while low kurtosis indicates light tails.
The formula for kurtosis is:
Example:
For a normal distribution, the kurtosis is 3. Distributions with kurtosis greater than 3 are considered leptokurtic (heavy-tailed), while those with less than 3 are platykurtic (light-tailed).
Curve Fitting
Curve fitting involves constructing a curve that best approximates the data points in a dataset. Common methods include fitting a straight line, parabola, and other related curves.
1. Fitting of Straight Line
The equation of a straight line is given by:
where is the slope and is the y-intercept. The least squares method is commonly used to minimize the sum of the squared differences between the observed and predicted values.
Example:
Given points , the best fit line can be determined using:
2. Fitting of Parabola
The general equation of a parabola is:
Fitting a parabola is similar to fitting a straight line but involves minimizing the sum of squared differences in a quadratic context.
Example:
To fit a parabola through given points, one would solve for , , and by using systems of equations derived from the points.
3. Related Curves
Other related curves such as exponential and logarithmic can also be fitted to data using similar methods. The choice of curve depends on the nature of the data and its distribution.
Correlation and Regression
Correlation measures the strength and direction of a relationship between two variables. It is quantified using the correlation coefficient , which ranges from -1 to 1.
1. Correlation Coefficient
The Pearson correlation coefficient is defined as:
Example:
For datasets and , calculate to determine the strength of their relationship.
2. Simple Linear Regression
Simple linear regression is used to model the relationship between two variables by fitting a linear equation. The model predicts the dependent variable based on the independent variable :
where is the slope and is the y-intercept.
Example:
Using the least squares method, determine and to minimize the error in predictions.
3. Multiple Linear Regression
Multiple linear regression extends simple linear regression to include multiple independent variables:
where is the intercept, and are the coefficients for each independent variable.
Example:
Using datasets with multiple features, apply the least squares method to estimate the coefficients.
Probability Distribution
Probability distributions describe how probabilities are distributed over the values of a random variable. Common types include the normal distribution, binomial distribution, and Poisson distribution.
1. Normal Distribution
The normal distribution is characterized by its bell-shaped curve, defined by the mean and standard deviation .
Example:
For a normal distribution with and , the probability density function is given by:
2. Binomial Distribution
The binomial distribution models the number of successes in independent Bernoulli trials, with the probability of success . The probability mass function is:
where is the number of successes.
Example:
If and , calculate the probability of getting successes.
3. Poisson Distribution
The Poisson distribution models the number of events occurring in a fixed interval of time or space, with a known average rate . The probability mass function is given by:
Example:
For , find the probability of observing events.