Patterns and Trends
Data vs Evidence
- The terms data and evidence are often used interchangeably, but in scientific inquiries the terms refer to two different concepts.
- Data is essentially pure information.
- The results of a scientific inquiry (e.g. a table or spreadsheet) with no interpretation is an example of data.
- There is no context or information attached.
- Evidence is data with context.
- While data can exist independently, when conclusions and analyses are made from it, it is evidence.
- The data becomes evidence for a statement.
- Data is only evidence when there is an opinion, viewpoint, or argument that it reinforces or refutes.
- Data has no meaning alone, it must be in the form of evidence to be of any use.
Statistics in Scientific Research
Mean, Median, and Standard Deviation
- Mean refers to the average of a dataset
- Mean is calculated using $\frac{\text{Sum of values}}{\text{Number of values}},$ and is represented by either $\mu$ or $\bar{x}\newcommand{orange}{\color{orange}}$
- Median refers to the middle value of the dataset, and is the term at position $\frac{n}{2}$ where $n$ is the number of values in the dataset
- Median is represented by $Med(X),$ where $X$ is the relevant variable
- Standard deviation is the amount of variation in a dataset
- Standard Deviation can be calculated using $\orange{\sqrt{\frac{\sum{(x_{i}-\mu)^{2}}}{n}}}$ where n is the number of values in the dataset, and $x_i$ represents each value of the variable
- Standard Deviation is represented by $\sigma$, and is the square root of variance
Statistics Tests
F-Test
When is it used?
- When you have 2 numerical datasets, and want to compare their variances (how much they deviate from their respective means)
What does it tell us?
- The further the result of the F-test is from 1, the stronger the evidence for unequal population variances.
- Therefore, higher F-statistics can be interpreted as less correlation between the two variables.
How is an F-statistic calculated?
Define the null hypothesis $(H_0)$ as “The two variables have equal variance,” and the alternate hypothesis $(H_1)$ as “the two variables have unequal variances.”
- Don’t actually write that if you’re asked. Instead, write “variable 1 is dependent on variable 2.” for the null, and “variable 1 is not dependent on variable 2.”
Calculate the statistic using $\orange{F=\frac{\sigma_1}{\sigma_2}}$ where $\sigma_1$ is the is the larger variance.
How can a conclusion be drawn from the F-test?
- Use an f-statistic table to determine the critical F-value of the dataset:
- The significance/alpha level will be written above the table, usually in the form of $\alpha=0.05$ (where 0.05, or 5%, is the significance level). If the question doesn’t give you the significance level, assume 0.05.
- The numerator’s degrees of freedom (number of values of variable 1, minus 1) is along the top
- The denominator’s degrees of freedom (number of values of variable 2, minus 1) is along the left side
- You’ll be given the table for any in-class test. There’s also one here.
- If the calculated F-statistic is lower than the critical F value, accept the null hypothesis. If the calculated f-statistic is greater than the critical value, reject the null hypothesis.
T-Test
When is it used?
- To compare 2 normally-distributed variables with unknown variances
- Can be used alongside the F-Test
What does it tell us?
- T-test determines whether there is a significant difference between th means of 2 groups of data.
- Results from a T-test can be used as evidence for correlation between 2 variables.
How is the T-Statistic calculated?
- Identify the mean $(\bar{x}),$ standard deviation $(\sigma),$ and number of values $(n)$ for each group.
- Establish a null hypothesis stating that mean 1 $(\bar{x}_1)$ and mean 2 $(\bar{x}_2)$ are equal.
- This will usually be phrased as $H_0:$ There is no difference between variables 1 and 2.
- Following this with an alternate hypothesis is also a good idea.
- Use the formula for T-test with 2 variables:
$$\orange{T=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{{\sigma_1}^2}{n_1}+\frac{{\sigma2}^2}{n_2}}}}$$
If you only have 1 variable, use the 1-var t-test:
$$\orange{T=\frac{\bar{x}-H_0}{\frac{\sigma}{\sqrt{n}}}}$$
How is the T-statistic interpreted?
- Usually, a p-value is used to interpret the t-statistic. However, critical T-values can also be found using yet another table.
- This time, a few extra steps are needed:
- Determine your significance/alpha level (assume $\alpha=0.05$) unless told otherwise.
- Determine if your test is 1-tailed or 2-tailed:
- Rephrase the question as an equation (for example, from “25% of packets are too heavy” to “Too heavy > 25%”)
- If the equation has “greater than” or “less than”, you need a 1-tailed t-test
- If the equation has “equals”, you need a 2-tailed t-test
- Now we move to the Empirical rule: because your data is normally distributed, you need to determine how many standard deviations from the mean your data can fall.
- For example, an alpha value of 0.05 (or 5%) means data needs to fall between $\bar{x}-2\sigma$ and $\bar{x}+2\sigma$ (because 95% of values are $-2\leq z\leq2$).
- If your test is 2-tailed, half your alpha level (because you’re only looking at 1 side of a symmetrical distribution)
- Use a t-score table to determine the critical t-value. If your calculated T-value is greater than the critical value,
How can conclusions be drawn?
- If the calculated t-value is greater than the critical value, reject the null hypothesis. Otherwise, accept the null hypothesis.
I still don’t get it :/
Crash Course Statistics has a good video on T-tests that explains it far better than I have here.
Chi-Squared Test $(\chi^2)$
When is it used?
- To determine whether a categorical variable fits an expected distribution.
- Can only be used for discrete variables, such as frequency.
What does it tell us?
- Chi-Squared determines whether the difference between the observed and theoretical distributions is significant enough to be meaningful.
- Variation between observed and expected might be chanc/randomness, but a Chi-Squared test will determine the likelihood of this.
How is Chi-Squared calculated?
$$\orange{\chi^{2}=\frac{(O_1-E_1)^{2}}{E_1}+\frac{(O_2-E_2)^{2}}{E_2}+…}$$
Where $O$ is the observed value, and $E$ is the expected value.
How can $\chi^2$ be interpreted?
- Identify degrees of freedom: (number of rows minus 1) times (number of columns minus 1)
- Identify alpha/significance level (usually $\alpha=0.05$)
- Use a chi-squared table to determine the critical value.
- If the calculated chi-squared value is greater than the critical value, reject the null hypothesis. Otherwise, accept it.
Crash course?
Analysis of Variance (ANoVA)
When is it used?
- Analysis of Variance is used to analyse variance (🙄)
- It compares the amount of variance within each group to the variance between groups.
What does it tell us?
- If variance within each group is high, but between groups is low, it’s likely caused by an external influence.
- If variance within groups is low but between groups is high, it’s likely that the property being measured is dependent on the group the sample was taken from.
How is ANoVA calculated?
- Calculate the mean of each group $(x_g,\text{ where g is the group number})$, as well as the mean of all the groups combined $(\bar{x})$
- Calculate the SSR (sum of squares regression) using the formula $SSR=n\left[\left(x_1-\bar{x}^2\right)+\left(x_2-\bar{x}^2\right)+…\right]$ where n is the sample size of all groups combines, and $x_1,x_2,…$ are the means of their respective groups.
- Calculate the SSE (sum of squares error) using the formula $SSE=\sum{(x_{ij}-x_i)^2}$ where $x_{ij}$ is value j in group i, and $x_i$ is the mean of group i
- Calculate the SST (sum of squares total) as SSR+SSE
How is ANoVA interpreted?
- Conveniently, we can use the same method as we did for the F-test (even the same distribution tables)
- The significance/alpha level will be written above the table, usually in the form of $\alpha=0.05$ (where 0.05, or 5%, is the significance level). If the question doesn’t give you the significance level, assume 0.05.
- The numerator’s degrees of freedom (number of groups, minus 1) is along the top
- The numerator’s degrees of freedom (number of values across all groups, minus 1) is along the left side
- You’ll be given the table for any in-class test. There’s also one here.
- If the calculated ANoVA is lower than the critical value, accept the null hypothesis. If the calculated ANoVA is greater than the critical value, reject the null hypothesis.