Statistics is described as the following:

“The practise of science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.”

SAY WHAT??

In other words: statistics is a discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.

The two major areas of statistics are known as descriptive statistics and inferential statistics. People working on statistics gather data about individuals or elements of a sample, and then analyze this data to generate descriptive statistics. They can then use these observed characteristics of the sample data, which are properly called “statistics,” to make inferences (or educated guesses) about the unmeasured characteristics of the broader population. This is usually done by using those properties to test hypotheses and draw conclusions.

**Descriptive statistics**

Descriptive statistics mostly focus on the central tendency, variability, and distribution of sample data.

**Central tendency**

Central tendency means the estimate of the characteristics, a typical element of a sample or population, and includes descriptive statistics such as mean, median, and mode.

**Mean, Median and Mode**

- Mean: balance point
- Median: middle value when ordered
- Mode: most frequent

**Variability**

Variability refers to a set of statistics that show how much difference there is among the elements of a sample or population along the characteristics measured, and includes metrics such as range, variance, and standard deviation.

**Variance**

Variance = The average of the squared distance of the mean. Variance is interesting in a mathematical sense, but the standard deviation is often a much better measure of how spread out the distribution is.

Two important measurements closely related to the variance are the covariance and the correlation.

**Covariance**

Covariance is a measure of how much two random variables vary together. Variance is similar to covariance in that variance shows you how much one variable varies. Covariance tells you how two variables vary together. If two variables are independent, their covariance is 0. However, a covariance of 0 does not imply that the variables are independent.

**Correlation**

Correlation is a standardized version of covariance. A correlation called Pearson’s correlation is often used to express strength between two variances. The Pearson’s coefficient consists of the covariance between the two random variables divided by the standard deviation of the first random variable times the standard deviation of the second random variable. A correlation will always be between -1 / +1. A positive correlation coefficient (between 0 and 1) indicates that an increase in the first variable would correspond to an increase in the second variable, thus implying a direct relationship between the variables. A negative correlation (between -1 and 0) indicates an inverse relationship whereas one variable increases, the second variable decreases.

**Standard deviation**

Standard deviation = The square root of the variance

**Distribution**

The distribution refers to the overall “shape” of the data, which can be depicted on a chart such as a histogram or dot plot, and includes properties such as the probability distribution function, skewness, and kurtosis. Descriptive statistics can also describe differences between observed characteristics of the elements of a data set. Descriptive statistics help us understand the collective properties of the elements of a data sample and form the basis for testing hypotheses and making predictions using inferential statistics.

**Skewness**

Skewness = A measure that describes the contrast of one tail versus the other tail. For example, if there are more high values in your distribution than low values then your distribution is ‘skewed’ towards the high values.

**Kurtosis**

Kurtosis = A measure of how ‘fat’ the tails in the distribution are.

The higher the moment, the harder it is to estimate with samples. Larger samples are required in order to obtain good estimates.

**Inferential statistics**

Inferential statistics are tools to draw conclusions about the characteristics of a population based on the characteristics of a sample and to decide how certain they can be of the reliability of those conclusions. Based on the sample size and distribution the probability can be calculated that the statistics (which measure the central tendency, variability, distribution, and relationships between characteristics within a data sample), provide an accurate picture of the corresponding parameters of the whole population from which the sample is drawn.

So, inferential statistics are used to make generalizations about large groups, such as estimating average demand for a product by surveying a sample of consumers’ buying habits, or to attempt to predict future events.

A regression analysis is a common method for statistical inference. A regression analysis attempts to determine the strength and character of a relationship (or correlation) between a dependent variable and (a series of) independent variables. The output of a regression model can be analyzed for statistical significance (p-value, more about that later), which refers to the claim that a result from findings generated by testing or experimentation is not likely to have occurred randomly or by chance but are instead likely to be related to a specific cause confirmed by the data. Having statistical significance can be important for any business where data is important and where data is used for important business decisions.

Note: A regression can also be used to predict future behaviour by doing a predictive analysis, I will talk about that later.

**Other inferential statistics**

**T-Test**

A T-Test is a relatively basic test that can be used to test for differences between two means. Statistical significance in a t-test shows whether the differences between the two means are likely to have occurred randomly or by chance or whether the differences are likely to be related to a specific cause confirmed by the data.

**ANOVA**

An ANOVA (Analysis of Variance) is somewhat related to a t-test, since it’s also comparing means. The difference between a t-test and an ANOVA is you can compare multiple means at the same time. ANOVA is a statistical test based on comparison of variances within all groups to variances between groups. Statistical difference in this matter is the same as described above.

**Hypothesis testing**

In order to do one of the tests above, two hypotheses are usually stated where there is a H0 (Null hypothesis) and Ha (Alternative Hypothesis). We can make four different decisions with hypothesis testing:

- Reject H0 and H0 is not true (no error)
- Reject H0 and H0 is true (Type 1 Error)
- Do not reject H0 and H0 is true (no error)
- Do not reject H0 and H0 is not true (Type 2 error)

Type 1 error is also called Alpha error. Type 2 error is also called Beta error.

So now you know this, what happens next?

**The P-Value**

It’s all about the significance of the p-value. A p-value is the probability of finding equal or more extreme results when the null hypothesis (H0) is true. In other words, a low p-value ( < 0.05) means that we have compelling evidence to reject the null hypothesis.

If the p-value is lower than 5% (p < 0.05), we often reject H0 and accept Ha. We say that p < 0.05 is statistically significant, because there is less than 5% chance that we are wrong in rejecting the null hypothesis.

**What else?**

In order to get to this point, you’ll spend most of your time doing exploratory data analysis, checking for missing variables, transforming variables, checking data, skewness, distribution, correlations, means, and other interesting findings. You might have to search for outliers, which can affect your mean, which can heavily affect your results. It’s up to you what you do with it, depending on many variables. How many outliers are there, from what variables? Are those variables important or not? Sometimes it is all about the outliers and they tell you the story you’re looking for. Inferential statistics is a complex (problem-solving) process.

**Why do you need to know this?**

As shown above statistics is full of strong definitions about what it means, how to use it and what to do with it. Sometimes it takes a while to get your head around it. For me it was like a light bulb. I was suddenly able to flick a switch after being left in the dark for a long time. Once that happens, it becomes more natural to use it and reading about it is not as scary anymore. For me it really has been a deal breaker in understanding statistics. It helped me develop data analytics skills, but also data science skills, since a lot of those terms are used regularly.

And as always, feel free to ask me any questions about it.

Thanks!

Laura