Epidemiology involves measuring the occurrence of disease and quantifying associations between diseases and exposures.
Measures of Disease Occurrence
Disease occurrence can be measured by frequencies (counts) but is better described by rates, which are composed of three elements: the number of people affected (numerator), the number of people in the source or base population (i.e., the population at risk) from which the affected persons come, and the time period covered. The denominator of the rate is the total persontime experienced by the source population. Rates allow more informative comparisons between populations of different sizes than counts alone. Risk, the probability of an individual developing disease within a specified time period, is a proportion, ranging from 0 to 1, and is not a rate per se. Attack rate, the proportion of people in a population who are affected within a specified time period, is technically a measure of risk, not a rate.
Diseasespecific morbidity includes incidence, which refers to the number of persons who are newly diagnosed with the disease of interest. Prevalence refers to the number of existing cases. Mortality refers to the number of persons who die.
Incidence is defined as the number of newly diagnosed cases within a specified time period, whereas the incidence rate is this number divided by the total persontime experienced by the source population (table 1). For cancer, rates are usually expressed as annual rates per 100,000 people. Rates for other more common diseases may be expressed per a smaller number of people. For example, birth defect rates are usually expressed per 1,000 live births. Cumulative incidence, the proportion of people who become cases within a specified time period, is a measure of average risk for a population.
Table 1. Measures of disease occurrence: Hypothetical population observed for a fiveyear period
Newly diagnosed cases 
10 
Previously diagnosed living cases 
12 
Deaths, all causes* 
5 
Deaths, disease of interest 
3 
Persons in population 
100 
Years observed 
5 
Incidence 
10 persons 
Annual incidence rate 

Point prevalence (at end of year 5) 
(10 + 12  3) = 19 persons 
Period prevalence (fiveyear period) 
(10 + 12) = 22 persons 
Annual death rate 

Annual mortality rate 
*To simplify the calculations, this example assumes that all deaths occurred at the end of the fiveyear period so that all 100 persons in the population were alive for the full five years.
Prevalence includes point prevalence, the number of cases of disease at a point in time, and period prevalence, the total number of cases of a disease known to have existed at some time during a specified period.
Mortality, which concerns deaths rather than newly diagnosed cases of disease, reflects factors that cause disease as well as factors related to the quality of medical care, such as screening, access to medical care, and availability of effective treatments. Consequently, hypothesisgenerating efforts and aetiological research may be more informative and easier to interpret when based on incidence rather than on mortality data. However, mortality data are often more readily available on large populations than incidence data.
The term death rate is generally accepted to mean the rate for deaths from all causes combined, whereas mortality rate is the rate of death from one specific cause. For a given disease, the casefatality rate (technically a proportion, not a rate) is the number of persons dying from the disease during a specified time period divided by the number of persons with the disease. The complement of the casefatality rate is the survival rate. The fiveyear survival rate is a common benchmark for chronic diseases such as cancer.
The occurrence of a disease may vary across subgroups of the population or over time. A disease measure for an entire population, without consideration of any subgroups, is called a crude rate. For example, an incidence rate for all age groups combined is a crude rate. The rates for the individual age groups are the agespecific rates. To compare two or more populations with different age distributions, ageadjusted (or, agestandardized) rates should be calculated for each population by multiplying each agespecific rate by the per cent of the standard population (e.g., one of the populations under study, the 1970 US population) in that age group, then summing over all age groups to produce an overall ageadjusted rate. Rates can be adjusted for factors other than age, such as race, gender or smoking status, if the categoryspecific rates are known.
Surveillance and evaluation of descriptive data can provide clues to disease aetiology, identify highrisk subgroups that may be suitable for intervention or screening programmes, and provide data on the effectiveness of such programmes. Sources of information that have been used for surveillance activities include death certificates, medical records, cancer registries, other disease registries (e.g., birth defects registries, endstage renal disease registries), occupational exposure registries, health or disability insurance records and workmen’s compensation records.
Measures of Association
Epidemiology attempts to identify and quantify factors that influence disease. In the simplest approach, the occurrence of disease among persons exposed to a suspect factor is compared to the occurrence among persons unexposed. The magnitude of an association between exposure and disease can be expressed in either absolute or relative terms. (See also "Case Study: Measures").
Absolute effects are measured by rate differences and risk differences (table 2). A rate difference is one rate minus a second rate. For example, if the incidence rate of leukaemia among workers exposed to benzene is 72 per 100,000 personyears and the rate among nonexposed workers is 12 per 100,000 personyears, then the rate difference is 60 per 100,000 personyears. A risk difference is a difference in risks or cumulative incidence and can range from 1 to 1.
Table 2. Measures of association for a cohort study
Cases 
Personyears at risk 
Rate per 100,000 

Exposed 
100 
20,000 
500 
Unexposed 
200 
80,000 
250 
Total 
300 
100,000 
300 
Rate Difference (RD) = 500/100,000  250/100,000
= 250/100,000 per year
(146.06/100,000  353.94/100,000)*
Rate ratio (or relative risk) (RR) =
Attributable risk in the exposed (AR_{e}) = 100/20,000  200/80,000
= 250/100,000 per year
Attributable risk per cent in the exposed (AR_{e}%) =
Population attributable risk (PAR) = 300/100,000  200/80,000
= 50/100,000 per year
Population attributable risk per cent (PAR%) =
* In parentheses 95% confidence intervals computed using the formulas in the boxes.
Relative effects are based on ratios of rates or risk measures, instead of differences. A rate ratio is the ratio of a rate in one population to the rate in another. The rate ratio has also been called the risk ratio, relative risk, relative rate, and incidence (or mortality) rate ratio. The measure is dimensionless and ranges from 0 to infinity. When the rate in two groups is similar (i.e., there is no effect from the exposure), the rate ratio is equal to unity (1). An exposure that increased risk would yield a rate ratio greater than unity, while a protective factor would yield a ratio between 0 and 1. The excess relative risk is the relative risk minus 1. For example, a relative risk of 1.4 may also be expressed as an excess relative risk of 40%.
In casecontrol studies (also called casereferent studies), persons with disease are identified (cases) and persons without disease are identified (controls or referents). Past exposures of the two groups are compared. The odds of being an exposed case is compared to the odds of being an exposed control. Complete counts of the source populations of exposed and unexposed persons are not available, so disease rates cannot be calculated. Instead, the exposed cases can be compared to the exposed controls by calculation of relative odds, or the odds ratio (table 3).
Table 3. Measures of association for casecontrol studies: Exposure to wood dust and adenocarcinoma of the nasal cavity and paranasal sinues
Cases 
Controls 

Exposed 
18 
55 
Unexposed 
5 
140 
Total 
23 
195 
Relative odds (odds ratio) (OR) =_{ }
Attributable risk per cent in the exposed (_{}) = _{}
Population attributable risk per cent (PAR%) =_{ }
where_{ } = proportion of exposed controls = 55/195 = 0.28
* In parentheses 95% confidence intervals computed using the formulas in the box overleaf.
Source: Adapted from Hayes et al. 1986.
Relative measures of effect are used more frequently than absolute measures to report the strength of an association. Absolute measures, however, may provide a better indication of the public health impact of an association. A small relative increase in a common disease, such as heart disease, may affect more persons (large risk difference) and have more of an impact on public health than a large relative increase (but small absolute difference) in a rare disease, such as angiosarcoma of the liver.
Significance Testing
Testing for statistical significance is often performed on measures of effect to evaluate the likelihood that the effect observed differs from the null hypothesis (i.e., no effect). While many studies, particularly in other areas of biomedical research, may express significance by pvalues, epidemiological studies typically present confidence intervals (CI) (also called confidence limits). A 95% confidence interval, for example, is a range of values for the effect measure that includes the estimated measure obtained from the study data and that which has 95% probability of including the true value. Values outside the interval are deemed to be unlikely to include the true measure of effect. If the CI for a rate ratio includes unity, then there is no statistically significant difference between the groups being compared.
Confidence intervals are more informative than pvalues alone. A pvalue’s size is determined by either or both of two reasons. Either the measure of association (e.g., rate ratio, risk difference) is large or the populations under study are large. For example, a small difference in disease rates observed in a large population may yield a highly significant pvalue. The reasons for the large pvalue cannot be identified from the pvalue alone. Confidence intervals, however, allow us to disentangle the two factors. First, the magnitude of the effect is discernible by the values of the effect measure and the numbers encompassed by the interval. Larger risk ratios, for example, indicate a stronger effect. Second, the size of the population affects the width of the confidence interval. Small populations with statistically unstable estimates generate wider confidence intervals than larger populations.
The level of confidence chosen to express the variability of the results (the “statistical significance”) is arbitrary, but has traditionally been 95%, which corresponds to a pvalue of 0.05. A 95% confidence interval has a 95% probability of containing the true measure of the effect. Other levels of confidence, such as 90%, are occasionally used.
Exposures can be dichotomous (e.g., exposed and unexposed), or may involve many levels of exposure. Effect measures (i.e., response) can vary by level of exposure. Evaluating exposureresponse relationships is an important part of interpreting epidemiological data. The analogue to exposureresponse in animal studies is “doseresponse”. If the response increases with exposure level, an association is more likely to be causal than if no trend is observed. Statistical tests to evaluate exposureresponse relationships include the Mantel extension test and the chisquare trend test.
Standardization
To take into account factors other than the primary exposure of interest and the disease, measures of association may be standardized through stratification or regression techniques. Stratification means dividing the populations into homogenous groups with respect to the factor (e.g., gender groups, age groups, smoking groups). Risk ratios or odds ratios are calculated for each stratum and overall weighted averages of the risk ratios or odds ratios are calculated. These overall values reflect the association between the primary exposure and disease, adjusted for the stratification factor, i.e., the association with the effects of the stratification factor removed.
A standardized rate ratio (SRR) is the ratio of two standardized rates. In other words, an SRR is a weighted average of stratumspecific rate ratios where the weights for each stratum are the persontime distribution of the nonexposed, or referent, group. SRRs for two or more groups may be compared if the same weights are used. Confidence intervals can be constructed for SRRs as for rate ratios.
The standardized mortality ratio (SMR) is a weighted average of agespecific rate ratios where the weights (e.g., persontime at risk) come from the group under study and the rates come from the referent population, the opposite of the situation in a SRR. The usual referent population is the general population, whose mortality rates may be readily available and based on large numbers and thus are more stable than using rates from a nonexposed cohort or subgroup of the occupational population under study. Using the weights from the cohort instead of the referent population is called indirect standardization. The SMR is the ratio of the observed number of deaths in the cohort to the expected number, based on the rates from the referent population (the ratio is typically multiplied by 100 for presentation). If no association exists, the SMR equals 100. It should be noted that because the rates come from the referent population and the weights come from the study group, two or more SMRs tend not to be comparable. This noncomparability is often forgotten in the interpretation of epidemiological data, and erroneous conclusions can be drawn.
Healthy Worker Effect
It is very common for occupational cohorts to have lower total mortality than the general population, even if the workers are at increased risk for selected causes of death from workplace exposures. This phenomenon, called the healthy worker effect, reflects the fact that any group of employed persons is likely to be healthier, on average, than the general population, which includes workers and persons unable to work due to illnesses and disabilities. The overall mortality rate in the general population tends to be higher than the rate in workers. The effect varies in strength by cause of death. For example, it appears to be less important for cancer in general than for chronic obstructive lung disease. One reason for this is that it is likely that most cancers would not have developed out of any predisposition towards cancer underlying job/career selection at a younger age. The healthy worker effect in a given group of workers tends to diminish over time.
Proportional Mortality
Sometimes a complete tabulation of a cohort (i.e., persontime at risk) is not available and there is information only on the deaths or some subset of deaths experienced by the cohort (e.g., deaths among retirees and active employees, but not among workers who left employment before becoming eligible for a pension). Computation of personyears requires special methods to deal with persontime assessment, including lifetable methods. Without total persontime information on all cohort members, regardless of disease status, SMRs and SRRs cannot be calculated. Instead, proportional mortality ratios (PMRs) can be used. A PMR is the ratio of the observed number of deaths due to a specific cause in comparison to the expected number, based on the proportion of total deaths due to the specific cause in the referent population, multiplied by the number of total deaths in the study group, multiplied by 100.
Because the proportion of deaths from all causes combined must equal 1 (PMR=100), some PMRs may appear to be in excess, but are actually artificially inflated due to real deficits in other causes of death. Similarly, some apparent deficits may merely reflect real excesses of other causes of death. For example, if aerial pesticide applicators have a large real excess of deaths due to accidents, the mathematical requirement that the PMR for all causes combined equal 100 may cause some one or other causes of death to appear deficient even if the mortality is excessive. To ameliorate this potential problem, researchers interested primarily in cancer can calculate proportionate cancer mortality ratios (PCMRs). PCMRs compare the observed number of cancer deaths to the number expected based on the proportion of total cancer deaths (rather than all deaths) for the cancer of interest in the referent population multiplied by the total number of cancer deaths in the study group, multiplied by 100. Thus, the PCMR will not be affected by an aberration (excess or deficit) in a noncancer cause of death, such as accidents, heart disease or nonmalignant lung disease.
PMR studies can better be analysed using mortality odds ratios (MORs), in essence analysing the data as if they were from a casecontrol study. The “controls” are the deaths from a subset of all deaths that are thought to be unrelated to the exposure under study. For example, if the main interest of the study were cancer, mortality odds ratios could be calculated comparing exposure among the cancer deaths to exposure among the cardiovascular deaths. This approach, like the PCMR, avoids the problems with the PMR which arise when a fluctuation in one cause of death affects the apparent risk of another simply because the overall PMR must equal 100. The choice of the control causes of death is critical, however. As mentioned above, they must not be related to the exposure, but the possible relationship between exposure and disease may not be known for many potential control diseases.
Attributable Risk
There are measures available which express the amount of disease that would be attributable to an exposure if the observed association between the exposure and disease were causal. The attributable risk in the exposed (AR_{e}) is the disease rate in the exposed minus the rate in the unexposed. Because disease rates cannot be measured directly in casecontrol studies, the AR_{e} is calculable only for cohort studies. A related, more intuitive, measure, the attributable risk percent in the exposed (AR_{e}%), can be obtained from either study design. The AR_{e}% is the proportion of cases arising in the exposed population that is attributable to the exposure (see table 2 and table 3 for formula). The AR_{e}% is the rate ratio (or the odds ratio) minus 1, divided by the rate ratio (or odds ratio), multiplied by 100.
The population attributable risk (PAR) and the population attributable risk per cent (PAR%), or aetiological fraction, express the amount of disease in the total population, which is comprised of exposed and unexposed persons, that is due to the exposure if the observed association is causal. The PAR can be obtained from cohort studies (table 28.3 ) and the PAR% can be calculated in both cohort and casecontrol studies (table 2 and table 3).
Representativeness
There are several measures of risk that have been described. Each assumes underlying methods for counting events and in the representatives of these events to a defined group. When results are compared across studies, an understanding of the methods used is essential for explaining any observed differences.