Part 1. DATA PRESENTATION: DESCRIPTIVE DATA ANALYSIS

Size: px
Start display at page:

Download "Part 1. DATA PRESENTATION: DESCRIPTIVE DATA ANALYSIS"

Transcription

1 22S:101 Biostatistics: J. Huang 1 Part 1. DATA PRESENTATION: DESCRIPTIVE DATA ANALYSIS Numerical Data Data Presentation I: Tables Data Presentation II: Graphs

2 22S:101 Biostatistics: J. Huang 2 1. Types of numerical data 1. Nominal data 2. Ordinal data 3. Ranked data 4. Discrete data 5. Continuous data

3 22S:101 Biostatistics: J. Huang Nominal data: the values of the data fall into unordered categories. Example: Outcomes indicating whether an individual had Kaposi s sarcoma for the first 2560 AIDS patients reported to the Center for Disease Control in Atlanta, Georgia (page 8). 1.2 Ordinal data: if the order of the nominal data is important, such data is called ordinal data. Notice that the magnitude is not important. Example 1: Eastern Cooperative Oncology Group s classification of patient performance status (page 9). Example 1: Disease status 0: normal 1: mild 2: moderate 3: severe

4 22S:101 Biostatistics: J. Huang Ranked data: the data are ranked from highest to lowest according to magnitude. Example: the leading causes of death in the United States, Rank Cause of Death Total Deaths 1 Disease of Heart 765,156 2 Malignant neoplasms 485,048 3 Cerebrovascular disease 150,517 4 Accidents and adverse effects 97,100 5 Chronic obstructive pulmonary disease 82,853 6 Pneumonia and influenza 77,662 7 Diabetes mellitus 40,368 8 Suicide 30,407 9 Chronic liver disease and cirrhosis 26, Nephritis, nephrotic syndrome, and 22,392 nephrosis

5 22S:101 Biostatistics: J. Huang Ranked data: the data are ranked from highest to lowest according to magnitude. Example: the leading causes of death in the United States, 1992 (page 10). Rank Cause of Death Total Deaths 1 Disease of Heart 717,706 2 Malignant neoplasms 520,578 3 Cerebrovascular disease 143,769 4 Chronic obstructive pulmonary disease 91,938 5 Accidents and adverse effects 86,777 6 Pneumonia and influenza 75,719 7 Diabetes mellitus 50,067 8 Human immunodeficiency virus infection 33,566 9 Suicide 30, Homocidde and legal intervention 25,488

6 22S:101 Biostatistics: J. Huang Discrete data: both ordering and magnitude are important. Example: The number of students in each statistics class at UI. The number of winder days with temperature below 0 in the last 100 years. The number of exons in each gene in human. The number of years of survival of AIDS patients

7 22S:101 Biostatistics: J. Huang Continuous data: such data can in principle take any values (within certain range). Example: The height and weight of an individual in Iowa City Temperature Time The expression value of a gene (or EST) in a cdna microarray experiment. Blood pressure Cholesterol level

8 22S:101 Biostatistics: J. Huang 8 2. Data Presentation I: Tables Frequency Tables: distributions Frequency and relative frequency There are many types of tables. But the frequency table is probably the most basic and important table. (Exercise 2.16): A frequency distribution for the serum zinc levels of 462 males between the ages of 15 and 17 is displayed below. The data are in file serzinc. The 462 serum zinc measurements, which were recorded in micrograms per deciliter, are saved under the variable name zinc.

9 22S:101 Biostatistics: J. Huang United States: Males Ages Serum Zinc Level Number of (microgram/dl) Males

10 22S:101 Biostatistics: J. Huang 10 Table of relative frequencies United States: Males Ages Serum Zinc Level Number of (microgram/dl) Males % % % % % % % % % % %

11 22S:101 Biostatistics: J. Huang 11 Table 2.4 (page 12) Cigarette consumption per person 18 years of age or older, United States, Number of Year Cigarettes

12 22S:101 Biostatistics: J. Huang Data Presentation II: Graphs Histogram Boxplot Scatterplot Cumulative frequency plot Line graph: Time series plot

13 22S:101 Biostatistics: J. Huang Data Summary Numerical Summary Measures (Chapter 3) Measures of central tendency: mean, median, mode Measures of dispersion: standard deviation, interquartile range, range

14 22S:101 Biostatistics: J. Huang 14 Mean or Sample mean: the arithmetic average of the observations. (only makes sense for discrete or continuous data)

15 22S:101 Biostatistics: J. Huang 15 Table 3.1 Forced expiratory volumes in 1 second for 13 adolescents suffering from asthma Subjects FEV(liters/second) Gender x = x 1 + x x == = 2.95 Caution: Outliers tend to have large effects on the sample mean.

16 22S:101 Biostatistics: J. Huang 16 Table 3.1 Forced expiratory volumes in 1 second for 13 adolescents suffering from asthma Subjects FEV(liters/sec) x = ( )/13 = 5.73.

17 22S:101 Biostatistics: J. Huang 17 Median or Sample median: the number in the middle. 2.15, 2.25, 2.30, 2.60, 2.68, 2.75, , 3.00, 3.38, 3.50, 4.02, 4.05 Median is resistant to outliers.

18 22S:101 Biostatistics: J. Huang 18 Measures of Dispersion Range: The difference between the largest observation and the smallest observation. Interquartile range: 75% percentile - 25% percentile. For the FEV1 data in Table 3.1, the interquartile range = = 0.78liters. Determine percentiles = 3.25, = Choose the smallest number that is np (n is the sample size). Then this number gives the location of the desired pth percentile.

19 22S:101 Biostatistics: J. Huang 19 Variance: average of the squared deviations from the mean. s 2 = 1 n (x i x) 2. n 1 X X-mean(X) (X-mean(X))^2 [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] Total i=1 For the FEV1 measurements in Table 3.1, the variance is s 2 = = liter 2. Standard deviation: Square root of the variance.

20 22S:101 Biostatistics: J. Huang 20 s = s 2 = 0.39 liter 2 = liters.

21 22S:101 Biostatistics: J. Huang Grouped Data Table 3.3 Duration of transfusion therapy for ten patients with sickle cell disease Subject Duration (years) x = 8.6 years.

22 22S:101 Biostatistics: J. Huang 22 Table 3.3 (Grouped) Duration of transfusion therapy for ten patients with sickle cell disease Number of Subjects Duration (years)

23 22S:101 Biostatistics: J. Huang 23 Table 3.4 Absolute frequencies of serum cholesterol levels for 1067 U.S. males, aged 25 to 34 years, Cholesterol level Number (mg/100ml) of Men Midpoint x = mg/100ml. s 2 = 1, (mg/100ml) 2. s = s 2 = mg/100ml.

24 22S:101 Biostatistics: J. Huang 24 Calculation of Grouped Mean Cholesterol level (# of Men) times (mg/100ml) # of Men Midpoint Midpoint Total Mean

25 22S:101 Biostatistics: J. Huang 25 Calculation of Grouped Variance Choleterol level (# of Men) times (mg/100ml) # of Men Midpoint (Midpoint-Mean)^ Total Var sd

26 22S:101 Biostatistics: J. Huang 26 k: number of intervals in the table. m i : the midpoint of the ith interval. f i : the frequency associated with the ith interval. x = k i=1 m if i k i=1 f. i s 2 = k i=1 (m i x) 2 f i [ k i=1 f i] 1.