CHOOSING STUDY SUBJECTS

Prof. Rodney Ehrlich
Senior Lecturer, Infectious Diseases Epidemiology Unit
School of Public Health and Family Medicine
UCT Faculty of Health Sciences

LEARNING OUTCOMES

At the end of this module, you will be able to:
  1. define the population to which you seek to generalise your results;
  2. identify your sampling frame (or recruitment pool);
  3. choose a sampling strategy (or randomisation strategy);
  4. understand the distinction between random sampling error and systematic error (bias) and how to minimise these.

Note: You may see the Power Point presentation used at the lecture.

Population:

The term "population" can be considered as a statistical term, being any coherent group to which you wish to generalise your results. Typically, members of this group will share some characteristic(s) of common interest, such as geography (for example, Cape Town), or some health-relevant characteristic (such as “Tik” users). It may be defined by health services, recognising a fundamental distinction between “patient” and “population” based studies.

What are your patients representative of?

You have to compare your patients to the population at large, in other words, with respect to population level characteristics. Further, your patients have to be compared to patients with same condition or diagnosis who are not hospitalised (or who do not attend that health service).

Selecting the population you want to study:

In order to select the required population, both inclusion and exclusion criteria have to be set:

Demographic criteria: notably, age, gender, race, socioeconomic status;
Clinical criteria: for example, the risk factor or outcome of interest, stable, no aggravating factors, contraindications, to mention but a few.
Administrative criteria: availability as opposed to difficulty;
Ethical criteria: for example, the ability to consent, access to records.
There is always a trade-off between efficiency and generalisability.

It is well to bear in mind some finer points of terminology regarding prevalence and incidence in hospital patients. We have to use population measures, not health service measures. Thus,

“20% of patients attending Dermatology OPD have eczema”. This is a proportion, and not a prevalence.
“10% of admissions to ICU were for acute MI” . This is a proportion, not an incidence.

Why do we need to sample?

In the first place, one simply cannot afford to study whole population. In any case, there is no need to study whole population, as sampling is an efficient (meaning involving a lower cost) way to get the same information.

Which sampling strategy should one adopt?

The sampling strategy that one would adopt would depend on various factors. These are:

Different types of sampling:

Two main types of sampling are in use. These are convenience (or availability) sampling and probability sampling. Convenience sampling tends to be haphazard – there is no logic underpinning it other than availability of subjects. These may be volunteers, and/or consecutive (individuals in records in a filing system or patients arriving for treatment).

Probability sampling is important when one wants truly representative estimates of population, found mainly in descriptive studies or surveys. It is the basis of statistical procedures. Four such types of sampling are generally recognised:

Simple random sampling

Stratified random sampling :

Systematic random sampling :

Cluster random sampling:

According to Last, cluster sampling is a sampling method in which each unit selected is a group of persons rather than an individual. (Cluster sampling is important in public health research).

Sampling error:

Random sampling error is an inevitable consequence of sampling. This is where statistical techniques come in - to estimate the size of sampling error and the size of the sample required to minimise the effect of this sampling error. (Error does not mean “mistake” here). One distinguishes between two types of sampling errors:

Sampling bias ("systematic error") is not inevitable and has to be minimised. Examples are volunteer bias, non-response bias, clinic bias, etc. - any situation where the sample is different in some important way from the rest of the population so that it cannot be representative no matter how big the sample size.

Random sampling error:

The random sampling error has the following attributes:

Systematic sampling error / bias:

In contrast to random sampling errors, systematic sampling errors are characterised as follows:

Systematic errors are of two types: selection bias, which arises before data collection, and measurement bias, which arises during data collection.

Generalisation:

This means that you are applying your conclusions to populations other than the one you have studied.

Some common confusions:

1. "Random sampling" vs. "randomization".

Random sampling is the selection of a sample from a population in such a way that at each point of selection, every member of the population has a known chance of being chosen.

Randomisation is not a form of sampling. It is the assignment of subjects in a trial to one of the arms (groups) of the trial in such a way that at the point of assignment each subject has a known chance of being assigned to one arm or the other.

2. Cluster sampling" vs. "stratified sampling"

Stratified random sampling is a way of pre-dividing the population into groups (strata) and then doing simple random sampling within each group. The objective is to ensure that the sample is representative across the strata (e.g. on socioeconomic status). For this purpose, each stratum should be internally homogeneous but different from other strata.

Cluster random sampling is done whether there are convenient pre-existing units such as schools, clinics, villages, blocks of houses, and so on, which can be sampled randomly rather than sampling individuals. One can then choose everyone in that cluster or do a simple random sample within the cluster. It is done for efficiency and convenience.

The objective is to try to choose clusters that are internally heterogeneous but (across clusters) similar to each other. This is to avoid having too many subjects that are alike in the sample. These “clusters” of individuals reduce the independence of subjects from each other that is one of the desirable features of simple random sampling. This loss of statistical independence in moving from simple random sampling of individuals to cluster sampling, for a given sample size, is called the “cluster effect”. It has to be adjusted for in choosing your sample size and in your statistical analysis.

(One can also stratify clusters - you are then introducing stratified cluster sampling with its particular objective.)

References:

  1. Hulley: pp. 25-35 (Choosing the study subjects: specification, sampling, and recruitment).