Math243 Statistics - Notes on Describing Data

MTH243

Statistics Notes on Describing Data

Spring 2022

Chapter 2 Describing Data

Categories
Frequency, count
Proportion, relative frequency
Bar/Column Chart
Pie Chart
Notation for a Proportion
Two-way tables for two categorical variables
Multiple proportions in two-way tables
- total proportions
- row proportions
- column proportions
Difference in proportions
Comparative column charts

Key Concepts

Use a dotplot or histogram to determine the shape of a distribution
Understanding and calculating mean and median
Identify the approximate locations of the mean and the median on a dotplot or histogram
Understanding the impact of skewness and outliers on the mean and median

Data set examples

For each quantitative variable:

Frequency Table
- 6 to 20 classes
- Decide on number of class; find start and width
- Classes should be the same width
- There is no "best" histogram
Relative and Cumilative frequency
Temperature Count Relative Cumulative
96.4 ≤ x < 96.82 1 1/25 0.04
96.82 ≤ x < 97.24 6 6/25 0.28
97.24 ≤ x < 97.66 3 3/25 0.4
97.66 ≤ x < 98.08 6 6/25 0.64
98.08 ≤ x < 98.5 6 6/25 0.88
98.5 ≤ x < 98.92 3 3/25 1

Temperature	Count	Relative	Cumulative
96.4 ≤ x < 96.82	1	1/25	0.04
96.82 ≤ x < 97.24	6	6/25	0.28
97.24 ≤ x < 97.66	3	3/25	0.4
97.66 ≤ x < 98.08	6	6/25	0.64
98.08 ≤ x < 98.5	6	6/25	0.88
98.5 ≤ x < 98.92	3	3/25	1

The skew is to the tail.

`bar(x)` = Mean = Arithmetic Mean = Average = Sum / Count
spreadsheet function =AVERAGE(data)
`m` = Median = middle when ranked
locator = `(n+1)/2`
spreadsheet function =MEDIAN(data)
Mode = most frequently occurring = good for qualitative data, among others
The Law of Large Numbers and the Mean
Larger samples and/or repeated sampling will get `barx` closer to `mu`.
Interpreting the mean
- Context
- The average human body temperature is 97.6.
  OK.
- The average U.S. household has 2.53 people. (ref)
  Wait. How can you have half of a person?
  Right. OK.
  The average U.S. household has between 2 and 3 people.
  Thanks

Use technology to compute summary statistics for a quantitative variable
Recognize the uses and meaning of the standard deviation
Compute and interpret a z-score
Interpret a five number summary or percentiles
Use the range, the interquartile range, and the standard deviation as measures of spread
Describe the advantages and disadvantages of the different measures of center and spread

`s` = Standard Deviation of a Sample = `sqrt((sum (x-barx)^2)/(n-1))`
spreadsheet function =STDEV(data) or =STDEV.S(data)
`sigma` = Standard Deviation of a Population
Range = max - min
Examples: Mortality Rate by Country, in Sheets, Docs, StatKey, other
Counting StDs and z-scores
Can standard deviation help us find outliers? Unusual values?
We consider the extreme 5% of the data, this is often outside 2 standard deviations from the mean.

Goals

Identify outliers in a dataset based on the IQR method
Use a boxplot to describe data for a single quantitative variable
Use a side-by-side graph to visualize a relationship between quantitative and categorical variables
Examine a relationship between quantitative and categorical variables using comparative summary statistics

Percentiles with Quantitative Data
- Percentile
- Percentile location, aka data index/position
- Percentile value
Examples
- P₁₀ is the 10th percentile, that is, the position in the data set were 10% of the data is at or below that point.
- Suppose we have a sample of size n=20. Then the position of P₁₀ is 0.10*(20+1)=2.1, or the number in the 2nd position when ranked.
- Consider the Presidents ages:
  Example: Age at Inauguration of 20th Century US Presidents
  42, 43, 46, 47, 51, 51, 51, 52, 54, 54, 55, 55, 56, 56, 60, 61, 62, 64, 69, 70
  So, P₁₀=43
  According to MS Excel or Google Sheets, P₁₀ = PERCENTILE.INC(data,0.1) = 45.7
  Close, but different from my method. There are other ways to identify percentile values that will be close to but different from each of these.
  Oh, well.
Quartile
- percentiles
- locators
- 5 different quartiles
- Quartile vs Quarter

Interpretaion

5 number summary and the Box Plot
Extended InterQuartile Range
aka Outlier Fences, lower and upper
Lower Fence = `Q_1-1.5*IQR`
Upper Fence = `Q_3+1.5*IQR`
These values fence in "usual" data, hence give us a measure for outliers, aka, "unusual" data.