Exploratory data analysis

INTRODUCTION:

Statistics is the science of collecting, summarising, presenting and interpreting data, and using them to estimate the magnitude of associations and test hypotheses.

B Kirkwood and J Sterne (Essential Medical Statistics)

Process for exploratory data analysis

Define types of data (variables)

A sample consists of observations or measurements. Any aspect of an individual that is measured or recorded is called a variable. Examples of this are age; gender; diagnosis, serum amylase, CD4 count, all of which are called variables.

It is often useful to define the types of variables, as different statistical methods are applicable to each.

There are two broad categories of variables:

CATEGORICALVARIABLES:

  1. Binary: Allocation of observations to one of only two possible categories. For example, exposed and non-exposed categories.
  2. Nominal: Allocation of observations into more than two categories. For example: classification of disease; marital status.;
  3. Ordinal: Allocation of observations into more than two categories that can be ordered. For example: classification according to mild, moderate and severe.

Frequency distributions:

Data can be presented in various forms depending on the type of data collected.

A frequency distribution is a table showing how often each value (or set of values) of the variable in question occurs in a data set.

A frequency table is used to summarise categorical or numerical data.,/p>

Frequencies are also presented as relative frequencies, that is, the percentage of the total number in the sample.

1.Frequency table (categorical data)

To summarise categorical data, count the number of observations in each category. These counts are called frequencies. In the following examples tabulations were produced in STATA using the dataset “famdata.dta”.

Example: A one-way frequency table:

Gender Frequency Percent
Female 118 48.96
Male 123 51.04
Total 241 100.0
STATA command: tab gender

Example: A two-way frequency table, also referred as 2x2 cross- tabulation or contingency table:

A frequency table with two categorical variables is called a contingency table because the figures found in the rows are contingent upon (dependent upon) those found in the columns.

Smoke Female Male Total
No 56 (47.46%) 36 (29.27%) 92
Yes 62 (52.54%) 87 (70.73%) 149
Total 118 (100%) 123 (100%) 241
STATA command: tab smoke gender, col
2. Frequency distributions (numerical data)

This is a table showing the number of observations at different values or within certain ranges.

For a discrete variable the frequencies may be tabulated either for each value of the variable or for groups of values. With continuous variables, groups have to be formed.

The cumulative percentage for a value is the percentage less than or equal to that value

Example: Frequency distribution of household size (discrete variable).

Household size Frequency Percent Cumulative percent
1 6 2.5% 2.5%
2 37 15.4% 17.9%
3 101 41.9% 59.8% *
4 61 25.3% 85.1%
5 25 10.4% 95.5%
6 11 4.6% 100%
* Approximately 60% of the sample have less than 4 household members
STATA command: tab household

Example: Frequency distribution of age in years (continuous variable).

Age Group Frequency Percent Cumulative percent
15-19 44 18.26 18.26
20-29 60 24.90 43.15 *
30-39 74 30.71 73.86
40-49 51 21.16 95.02
50-59 12 4.98 100.00
Total 241 100.00  
* 43.2% of the farm workers are below 30 years of age
STATA command: tab agegroup