Study Material
Semester-04
M3
Unit-03

Unit 3: Statistics

Measures of Central Tendency

Measures of central tendency are statistical values that describe the center point of a dataset. The three most common measures are the mean, median, and mode. These measures help summarize a set of data by providing a representative value.

1. Mean

The mean, or arithmetic average, is calculated by summing all the values in a dataset and dividing by the number of values. Mathematically, it is represented as:

Mean(ΞΌ)=βˆ‘i=1nxin\text{Mean} (\mu) = \frac{\sum_{i=1}^{n} x_i}{n}

where xix_i represents each value in the dataset and nn is the total number of values.

Example:

Given the dataset {2,4,6,8,10}\{2, 4, 6, 8, 10\}:

Mean=2+4+6+8+105=305=6\text{Mean} = \frac{2 + 4 + 6 + 8 + 10}{5} = \frac{30}{5} = 6

2. Median

The median is the middle value of a dataset when arranged in ascending order. If the number of observations is odd, the median is the middle number. If even, it is the average of the two middle numbers.

Example:

For the dataset {3,1,4,2,5}\{3, 1, 4, 2, 5\}:

  1. Arrange in order: {1,2,3,4,5}\{1, 2, 3, 4, 5\}
  2. The median is 33 (the middle value).

For the dataset {1,2,3,4}\{1, 2, 3, 4\}:

  1. Arrange in order: {1,2,3,4}\{1, 2, 3, 4\}
  2. The median is 2+32=2.5\frac{2 + 3}{2} = 2.5.

3. Mode

The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), more than one mode (bimodal or multimodal), or no mode at all.

Example:

For the dataset {1,2,2,3,4}\{1, 2, 2, 3, 4\}, the mode is 22. For {1,1,2,2,3}\{1, 1, 2, 2, 3\}, both 11 and 22 are modes.

Measures of Dispersion

Dispersion measures how much the values in a dataset spread out. Key measures include range, variance, and standard deviation.

1. Range

The range is the difference between the highest and lowest values in a dataset.

Range=Maxβˆ’Min\text{Range} = \text{Max} - \text{Min}

Example:

For the dataset {3,7,5,9,1}\{3, 7, 5, 9, 1\}:

Range=9βˆ’1=8\text{Range} = 9 - 1 = 8

2. Variance

Variance measures the average squared deviation of each number from the mean. It is calculated as:

Οƒ2=βˆ‘i=1n(xiβˆ’ΞΌ)2n\sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}

Example:

For the dataset {2,4,6}\{2, 4, 6\}:

  1. Mean ΞΌ=4\mu = 4
  2. Variance:
Οƒ2=(2βˆ’4)2+(4βˆ’4)2+(6βˆ’4)23=4+0+43=83β‰ˆ2.67\sigma^2 = \frac{(2-4)^2 + (4-4)^2 + (6-4)^2}{3} = \frac{4 + 0 + 4}{3} = \frac{8}{3} \approx 2.67

3. Standard Deviation

The standard deviation is the square root of the variance and indicates the average distance of each data point from the mean.

Οƒ=Οƒ2\sigma = \sqrt{\sigma^2}

Example:

Continuing from the previous example:

Οƒ=83β‰ˆ1.63\sigma = \sqrt{\frac{8}{3}} \approx 1.63

Coefficient of Variation

The coefficient of variation (CV) is a standardized measure of dispersion, expressed as a percentage. It is calculated by dividing the standard deviation by the mean:

CV=σμ×100\text{CV} = \frac{\sigma}{\mu} \times 100

Example:

For the dataset {2,4,6}\{2, 4, 6\}:

  1. Mean ΞΌ=4\mu = 4
  2. Standard deviation Οƒβ‰ˆ1.63\sigma \approx 1.63

Calculating CV:

CV=1.634Γ—100β‰ˆ40.75%\text{CV} = \frac{1.63}{4} \times 100 \approx 40.75\%

Moments

Moments are quantitative measures related to the shape of a dataset's distribution. The nthn^{th} moment about the mean is given by:

ΞΌn=βˆ‘i=1n(xiβˆ’ΞΌ)nn\mu_n = \frac{\sum_{i=1}^{n} (x_i - \mu)^n}{n}

1. First Moment

The first moment is always zero since it measures deviation from the mean.

2. Second Moment

The second moment gives us the variance, providing insight into the dataset's spread.

3. Third Moment

The third moment helps to measure the skewness of the distribution, indicating asymmetry.

4. Fourth Moment

The fourth moment relates to the kurtosis of the distribution, providing information about the "tailedness" of the distribution.

Skewness and Kurtosis

1. Skewness

Skewness quantifies the asymmetry of a distribution. A distribution can be positively skewed (long tail on the right), negatively skewed (long tail on the left), or symmetric.

The formula for skewness SS is given by:

S=n(nβˆ’1)(nβˆ’2)βˆ‘i=1n(xiβˆ’ΞΌΟƒ)3S = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left(\frac{x_i - \mu}{\sigma}\right)^3

Example:

For a dataset that is positively skewed, the skewness value will be greater than 0.

2. Kurtosis

Kurtosis measures the "tailedness" of the distribution. High kurtosis indicates heavy tails, while low kurtosis indicates light tails.

The formula for kurtosis KK is:

K=n(n+1)(nβˆ’1)(nβˆ’2)(nβˆ’3)βˆ‘i=1n(xiβˆ’ΞΌΟƒ)4βˆ’3(nβˆ’1)2(nβˆ’2)(nβˆ’3)K = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^{n} \left(\frac{x_i - \mu}{\sigma}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}

Example:

For a normal distribution, the kurtosis is 3. Distributions with kurtosis greater than 3 are considered leptokurtic (heavy-tailed), while those with less than 3 are platykurtic (light-tailed).

Curve Fitting

Curve fitting involves constructing a curve that best approximates the data points in a dataset. Common methods include fitting a straight line, parabola, and other related curves.

1. Fitting of Straight Line

The equation of a straight line is given by:

y=mx+cy = mx + c

where mm is the slope and cc is the y-intercept. The least squares method is commonly used to minimize the sum of the squared differences between the observed and predicted values.

Example:

Given points (x1,y1),(x2,y2),…,(xn,yn)(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n), the best fit line can be determined using:

m=n(βˆ‘xy)βˆ’(βˆ‘x)(βˆ‘y)n(βˆ‘x2)βˆ’(βˆ‘x)2m = \frac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2}

2. Fitting of Parabola

The general equation of a parabola is:

y=ax2+bx+cy = ax^2 + bx + c

Fitting a parabola is similar to fitting a straight line but involves minimizing the sum of squared differences in a quadratic context.

Example:

To fit a parabola through given points, one would solve for aa, bb, and cc by using systems of equations derived from the points.

3. Related Curves

Other related curves such as exponential and logarithmic can also be fitted to data using similar methods. The choice of curve depends on the nature of the data and its distribution.

Correlation and Regression

Correlation measures the strength and direction of a relationship between two variables. It is quantified using the correlation coefficient rr, which ranges from -1 to 1.

1. Correlation Coefficient

The Pearson correlation coefficient is defined as:

r=n(βˆ‘xy)βˆ’(βˆ‘x)(βˆ‘y)[nβˆ‘x2βˆ’(βˆ‘x)2][nβˆ‘y2βˆ’(βˆ‘y)2]r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}

Example:

For datasets X={x1,x2,x3}X = \{x_1, x_2, x_3\} and Y={y1,y2,y3}Y = \{y_1, y_2, y_3\}, calculate rr to determine the strength of their relationship.

2. Simple Linear Regression

Simple linear regression is used to model the relationship between two variables by fitting a linear equation. The model predicts the dependent variable yy based on the independent variable xx:

y=mx+cy = mx + c

where mm is the slope and cc is the y-intercept.

Example:

Using the least squares method, determine mm and cc to minimize the error in predictions.

3. Multiple Linear Regression

Multiple linear regression extends simple linear regression to include multiple independent variables:

y=b0+b1x1+b2x2+…+bkxky = b_0 + b_1x_1 + b_2x_2 + \ldots + b_kx_k

where b0b_0 is the intercept, and b1,b2,…,bkb_1, b_2, \ldots, b_k are the coefficients for each independent variable.

Example:

Using datasets with multiple features, apply the least squares method to estimate the coefficients.

Probability Distribution

Probability distributions describe how probabilities are distributed over the values of a random variable. Common types include the normal distribution, binomial distribution, and Poisson distribution.

1. Normal Distribution

The normal distribution is characterized by its bell-shaped curve, defined by the mean ΞΌ\mu and standard deviation Οƒ\sigma.

Example:

For a normal distribution with ΞΌ=0\mu = 0 and Οƒ=1\sigma = 1, the probability density function is given by:

f(x)=12πσ2eβˆ’(xβˆ’ΞΌ)22Οƒ2f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

2. Binomial Distribution

The binomial distribution models the number of successes in nn independent Bernoulli trials, with the probability of success pp. The probability mass function is:

P(X=k)=(nk)pk(1βˆ’p)nβˆ’kP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}

where kk is the number of successes.

Example:

If n=10n = 10 and p=0.5p = 0.5, calculate the probability of getting k=5k = 5 successes.

3. Poisson Distribution

The Poisson distribution models the number of events occurring in a fixed interval of time or space, with a known average rate Ξ»\lambda. The probability mass function is given by:

P(X=k)=Ξ»keβˆ’Ξ»k!P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}

Example:

For Ξ»=3\lambda = 3, find the probability of observing k=2k = 2 events.