Abstract
Point-prevalence surveys (PPSs) are often used to estimate the prevalence of healthcare-associated infections (HAIs). Methods for estimating incidence of HAIs from prevalence have been developed, but application of these methods is often difficult because key quantities, like the average length of infection, cannot be derived directly from the data available in a PPS. We propose a new theory-based method to estimate incidence from prevalence data dealing with these limitations and compare it to other estimation methods in a simulation study. In contrast to previous methods, our method does not depend on any assumptions on the underlying distributions of length of infection and length of stay. As a basis for the simulation study we use data from the second study of nosocomial infections in Germany (Nosokomiale Infektionen in Deutschland, Erfassung und Prävention - NIDEP2) and the European surveillance of HAIs in intensive care units (HAI-Net ICU). The new method compares favourably with the other estimation methods and has the advantage of being consistent in its behaviour across the different setups. It is implemented in an R-package prevtoinc which will be freely available on CRAN (http://cran.r-project.org/).
INTRODUCTION
Epidemiological information on healthcare-associated infections (HAIs) is often acquired by means of point-prevalence surveys (PPSs). Large-scale PPSs are regularly performed by the European Centre for Disease Prevention and Control (ECDC) (1, 2), as well as the US Centers for Disease Prevention and Control (CDC) (3, 4). While the prevalence of HAIs is an important measure in itself, epidemiologists are usually more interested in the incidence of HAIs. For example, estimations of the burden of HAIs often rely on incidence rather than prevalence (5). Therefore, methods of estimating the incidence rates from the data of PPSs are needed. Under general conditions, the incidence and prevalence can be estimated from one another (6). The question of estimating incidence from prevalence in the context of HAIs has been addressed in the 1980s by two articles (7, 8). The method developed by Rhame and Sudderth (7) is the most commonly applied method for estimating incidence from prevalence (1–3, 5, 9–14). This method however has several limitations:
The Rhame-Sudderth formula was developed using a definition of prevalence that included active and cured infections on the day of the PPS and that is different from the one usually applied in PPSs of HAIs. Another problem with the application of the formula is that it requires a method to estimate the average length of stay and the average length of infection based on data available on the day of the PPS. Without estimates of these quantities from other sources, the application of the estimation method is challenging, because usually only the data obtained on the day of the PPS are available.
In this article, we propose a novel approach dealing with these limitations of estimating incidence from prevalence of HAIs. The proposed approach uses state-of-the-art statistical techniques to estimate the average length of infection and average length of stay in the whole patient population from samples of lengths of infection and hospital stay up to the day of the PPS without relying on any assumption about the distributions of these quantities. We evaluated the new method by comparing it with existing procedures in the literature through simulation studies based on data from the second study of nosocomial infections in Germany (Nosokomiale Infektionen in Deutschland, Erfassung und Prävention - NIDEP2) (15) and from the European surveillance of HAIs in intensive care units (HAI-Net ICU) (16, 17), as well as theoretical distributions.
METHODS
Notation
In general, we used the variable X to indicate a randomly sampled duration from the whole population and L for a randomly sampled duration from the PPS (the duration for a randomly selected patient included in the PPS). L is expected to be on average larger than X, due to the phenomenon of length-biased sampling (8, 18). We used A for the observed duration up to a fixed time for a randomly selected patient at that time point. This was applied to the length of stay and the length of infection.
We used X, A and L when it was not important to distinguish between the length of stay and the length of infection from a theoretical perspective.
The different concepts for the durations are illustrated in Fig. 1. The notation used in this article is explained in Table 1.
Rhame and Sudderth formula
In line with previous authors (7, 8), we assumed that the patient population is in steady state, i.e. the distribution of characteristics of our sample of patients does not depend on the specific day of the survey.
The original formula of Rhame and Sudderth (7) for the incidence per admission Ipp (slightly simplified and adapted to our notation) is: where xlos denotes the average length of stay of a patient, xLN-INT is the average length of stay for patients after they acquire their first HAI and Ipp is the estimate of the incidence per admission. In this original formulation, Prname is calculated by counting all patients who had at least one HAI up to the time of the survey (and not just the patients that have an active HAI on the date of the survey) and dividing by the total number of patients.
As pointed out above there are two points that complicate the application of this formula in this form:
often the PPSs only count patients with active infections on the day of the PPS.
In these cases, theoretical considerations then require that the term xLN-INT is replaced by a term xloi which gives the average length of a HAI (see supplement S1).
samples of Xlos, Xloi (or XLN-INT) are often not available and only the length of stay Alos and possible length of infection Aloi up to the day of the PPS are available.
New approach
To estimate the distributions of length of stay and length of infection from the observed lengths of stay up to the day of the PPS, we proceeded in two steps:
We estimated the distributions of length of stay and length of infection up to the day of the PPS (in our notation Alos and Aloi) from the available data,
We calculated from these distributions the expected lengths (xlos and xloi) for the whole population.
For the first part, we used an estimator which ensures the monotonicity of the estimated distribution, because the distribution of A is always monotonously decreasing. This can be demonstrated by the timeline of occupancy of a hypothetical bed: on average there will always be more patients for whom it is the first day of their stay than the second day, more patients for whom it is the second day of their stay than the third and so on. We use a Grenander estimator for discrete distributions described and studied by Jankowski and Wellner (19). This estimator is the maximum likelihood estimator for a discrete monotonously decreasing distribution and is therefore a canonical choice. It is well-studied from a theoretical point and has good properties like consistency and - rate of convergence(19)
Following Freeman and Hutchison (8), in the steady state the relation between prevalence P and incidence rate I can be written as: where xloi is the average length of a HAI in the whole population. To get this equation into the form of Rhame and Sudderth (7), we multiply by the expected length of stay of random patients that are susceptible to an infection (1 − P)xlos and get
To express xloi in terms of Aloi, we note the following formula with Npat the total number of patients at the hospital on the survey day: where (1 − P)Npat is the average number of patients at risk, is the average proportion of patients with HAI on the first day of infection and PNpat is the average number of patients with a HAI.
Both sides of the equation represent the number of average new infections per day; the left hand side as the incidence rate I per patient-day-at-risk times the number of patients at risk (1 − P)Npat and the right hand side as the average number of HAI cases on the first day of infection. Therefore where is the probability of sampling Aloi = 1. By comparing with the original incidence rate formula, this gives us, the simple relation
An alternative, more formal route to the formula is based on renewal theory (8) and specifically Eqn. 2.16 from Haviv (20).
This leads to the estimator: with an estimator of . We call the estimator for x based on this procedure with the Grenander estimator (19) for .
The general method is equally applicable for the estimation of xlos, xloi, xLN-INT and one can construct similar estimators. The derivation of the respective estimators is based on Eqn. 2.16 from Haviv (20).
Design of simulations
To assess the performance of our new estimator, we compared it in a simulation study to a selection of other estimators from the literature. Simulations were performed using R 3.5.1 (21) with the prevtoinc package which will be freely available on CRAN (http://cran.r-project.org/).
In a first step, we assessed the quality of the estimators for xloi, xLN-INT,xlos.
The setup was the following: a distribution for Xloi was chosen and the corresponding distributions for Aloi and Lloi were derived. A sample of n values from Aloi and Lloi was drawn and, based on this sample, all considered estimators of xloi were calculated. We repeated this procedure m times and calculated the root-mean square deviation (RMSD) for each estimator.
An analogous procedure was used to benchmark estimators for xlos and xLN-INT.
We performed repeated simulations to assess the performance of estimators for I based on simulated PPS data as follows: The number of patients n in the PPS was fixed, as well as a distribution for Xloi and a value P = 0.05 was fixed for the prevalence. For each patient, the presence of a HAI was determined by a sample from a Bernoulli distribution with as parameter P. In a next step, for patients with HAIs a joint sample of Aloi and Lloi was sampled from the chosen distribution. To assess the performance of the estimators for Ipp we additionally sampled Alos and Llos jointly for all patients. For a simulation distribution of XLN-INT, assessment of estimators was performed in an analogous way replacing P by Prhame = 0.2. For further parameters of the simulations see supplement S3.
Estimators for comparison
We used the following estimators to benchmark the performance of our new estimator.
pps.median - estimator based on the median duration up to PPS (1) where median(A) is the median of samples of the observed A,
pps.mean - alternative estimator used in (1) based on the mean instead of the median
L.full - estimator based on samples from the PPS with information on L based on the transformation formula
The transformation formula uses the theoretical relationship between X and L derived in Eqn. 7 in (8). When comparing the performance of estimators, one has to keep in mind that L.full uses the information on the whole durations L instead of only information on A.
The estimators can be used to estimate xloi, xLN-INT or xlos depending on which duration up to PPS A we use.
Mixed estimator
We also experimented with the combination of different estimators by weighting. As will be seen in the results section, for small samples the estimator gren has high variance. While it is unbiased (inside the model), it could be advantageous to combine it with a biased estimator with lower variance for small sample sizes. As a specific case, we introduced the following estimator pps.mixed based on the estimators pps.mean and gren:
The function α is chosen as a sigmoid function: . This gives a smooth transition between pps.mean and the new estimator gren with equal weighting α = 0.5 on n = 500. Again this type of estimator can be used for the estimation of xloi, xLN-INT or xlos.
Constructing estimators for I and Ipp
We estimated the theoretical prevalence P by taking the observed prevalence on the day of the PPS as an estimate. We constructed the incidence rate estimator in the general form: and for the incidence proportion per admission: where one uses any of the above estimators for and . A similar estimator could be built by plugging in the corresponding estimators in the original Rhame-Sudderth formula using Prhame and xLN-INT.
Simulation distributions for Xloi, XLN-INT and Xlos
We used three different distributions for Xloi: a geometric distribution shifted to start on 1 with mean 8, a Poisson distribution shifted to start on 1 with mean 8. We selected the two theoretical distributions, Poisson and geometric, to assess the flexibility of the estimators. We also used an empirical distribution of Xloi based on data from the NIDEP2 – study (15). In this study, incidence and prevalence of HAIs were measured on a daily basis in eight German hospitals during two eight-week periods (see supplement S2 for a further description of the data).
For simulation of XLN-INT, we used an empirical distribution of XLN-INT based on the HAI-Net ICU data from 2015, which monitored date of onset of HAI and date of discharge of patients with an ICU-acquired HAI in 1 365 intensive care units (ICUs) from 11 European Union Member States (see supplement S2 for a further description of the data) (15, 16). No information on the end of the HAI was available, which is why we used XLN-INT and the original version of the Rhame-Sudderth formula for this simulation example.
We show the resulting distributions of Xloi in Fig. 2 and the distribution for XLN-INT in Fig. 3.
Based on these distributions for Xloi we calculated the distributions of Aloi and Lloi (see Eqn. 2.14 and 2.16 (20) for the exact relation between these distributions).
For each simulation we then sampled n lengths of infection jointly from Aloi and Lloi. An analogous procedure was applied for sampling lengths of stay (Alos and Llos) and lengths of stay after infection (ALN-INT and LLN-INT). The distributions used for simulating the length of stay are shown in Fig. 3 and were based on data on lengths of stay from the NIDEP2-study (14) and HAI-Net ICU (15, 16).
RESULTS
To assess the quality of the estimators, we measured the RMSD for increasing numbers of HAIs using different distributions (Fig. 4–7).
Simulations for xloi and xLN-INT
In Fig. 4, we present the RMSDs of the estimates of xloi. We show the results for three examples of Aloi distributions. The simulations ranged from n = 50 to n = 1000. The estimators differed in the size of the RMSD, as well as in the convergence to zero along increasing sample sizes. In all three distributions, pps.median had the highest RMSD and generally did not converge to zero. The estimator pps.mean behaved similarly to pps.median in the case of the Poisson distribution. For the NIDEP2 distribution, it did not converge to zero, but stabilized on a lower RMSD compared to the Poisson distribution. In the case of the geometric distribution, pps.mean converged to zero with a low RMSD as could be expected for mathematical reasons (22), because xloi = aloi for this specific distribution. L.full converged towards zero for all distributions and had among the lowest RMSD for all three settings. The RMSD of the new estimator gren converged towards zero in all three settings for large enough sample sizes and the magnitude of the RMSD was similar for all three distributions. The RMSD of pps.mixed for lower sample sizes was similar to the one of pps.mean and for larger sample sizes more like the new gren estimator as expected due to its construction.
Similar plots for the bias (inside the model) and standard deviation can be found in the supplement S4 in Fig. S2 and S3. For boxplots of the estimators of xloixlnint and xlos see Fig. S4–S6 in supplement S5.
Results for the estimation of xLN-INT for the HAI-Net ICU data are shown in Fig. 5. The simulations again ranged from n = 50 to n = 1000.
As previously, pps.median did not converge to zero and had the highest RMSD among the estimators. The estimator pps.mean stabilized at a significantly lower RMSD than pps.median, but did not converge towards zero either. L.full was again the best performing estimator in terms of RMSD. gren and pps.mixed behaved similarly to the estimation of xloi. gren exhibited a lower RMSD than pps.mean as the sample size increased, and the RMSD behaviour of pps.mixed with increasing sample size was between that of pps.mean and gren.
Simulations for I
The results for the RMSDs of the estimators for I are shown in Fig. 6. In this figure the RMSD was divided by the theoretical incidence rate I to estimate the relative size of the error. The sample sizes ranged from n = 500 to n = 20000 patients in the simulated PPS. As expected, the RMSDs behave very similarly to the case of estimation of xloi and xLN-INT and the additional uncertainty in the estimation of P did not change the general patterns for the RMSDs shown in Fig. 4 and Fig. 5 when compared to Fig. 6.
Simulations for xlos
Results for the length of stay in days are shown in Fig. 7. Again we presented the RMSD of the estimators of xlos. We used the empirical distributions of the length of stay from the NIDEP2 and HAI-Net ICU datasets.
For the NIDEP2 distribution of lengths of stay, pps.median again had the highest RMSD and did not converge toward zero. pps.mean did not converge toward zero either but stabilized at a lower RMSD than pps.median. L.full again had the lowest RMSD and pps.mixed and gren also had comparably low RMSD for larger samples (n ≥ 5000).
For the HAI-Net ICU distribution of length of stay, the previous picture with respect to pps.mean and pps.median was reversed. pps.mean had the highest RMSD and did not converge towards zero, and pps.median had a lower RMSD but also did not converge towards zero. For gren and L.full the simulation results were similar to those obtained with the NIDEP2 distribution of length of stay. The estimator pps.mixed had a high RMSD compared to the other estimators for small sample sizes where the pps.mean component was dominant. For larger sample sizes, it behaved similarly to L.full and gren.
Simulations for Ipp
For the incidence proportion of HAIs counted per admission, the RMSDs of the estimators are shown in Fig. 8 In this figure the RMSD was divided by the theoretical incidence proportion Ipp to estimate the relative size of the error. Almost all the estimators behaved very similarly in terms of RMSD. pps.mean and pps.median were the only estimators with a significantly higher RMSD than the other estimators for the HAI-Net ICU distribution. In the case of the NIDEP2 distribution and pps.median, the errors in the estimation of xloi and xlos seemed to cancel out almost exactly, reducing the RMSD for Ipp to levels comparable to the other estimators.
DISCUSSION
We presented a method to estimate incidence from prevalence data available in a typical PPS setup. We used nonparametric estimators for length of stay and length of infection which exploit the monotonicity of Alos and Aloi. By means of a simulation study, we compared these estimators to other estimators that have been applied in previous studies.
The new gren estimator behaved consistently for different distributions, i.e. was more accurate with larger samples and in most cases was comparable to or better than the other estimators based on A. This was in contrast to pps.median, which generally did not converge to the true value. The estimator pps.mean did not perform overall as well as the gren estimator, but its variance for small sample sizes was lower. As expected, L.full performed better than or as well as all other estimators across all settings, but at the price of requiring knowledge of the full durations L which are typically not available in a PPS. We finally proposed the mixed estimator pps.mixed as a good compromise between the low variability of the pps.mean for smaller samples and the consistent behaviour of the new estimator gren for larger samples. Altogether, the incidence estimate based on a PPS can be improved by more than 40% of the theoretical value (in terms of RMSD), compared to other estimators from the literature for a large enough PPS (see Fig. 8).
The new method presented in this article is a modification and update of the Rhame-Sudderth formula and is applicable in the setup of modern PPSs. The Rhame-Sudderth formula was published in the 1980s and, to our knowledge there have only been few methodological contribution addressing the questions of validity of the formula on a theoretical level since its publication. Mandel and Fluss (23) have proposed and studied incidence estimators, which generalize the Rhame-Sudderth, estimator but they depend on the use of the original Rhame-Sudderth prevalence definition and information about the total length of stay Llos for all patients in the survey. There have been attempts to evaluate the Rhame-Sudderth formula (14, 24), but these shared the limitation of using P instead of Prhame, as intended in the original Rhame-Sudderth formula. Few studies distinguish between P and Prhame and use the originally intended combination of prevalence and length of duration definitions. This often leads to the use of xLN-INT as a proxy for length of infection xloi (14, 24). It was suggested that xLN-INT was not a good proxy for average length of infection (13) and instead some ad-hoc measure or external information could be used to estimate average length of infection (9, 11–13). Most of the articles remained critical of their own results. The ECDC-coordinated PPS of healthcare-associated infections and antimicrobial use in European acute care hospitals included data from over 200 000 patients across Europe and used the estimators pps.mean and pps.median (or more precisely a combination of these two) to estimate xLN-INT stratified by participating country (1). Information on xlos was often obtained from external data sources. In the analysis of the latest ECDC-coordinated PPSs in acute care hospitals and long-term care facilities (2016-2017) (14) our proposed method has already been used for sensitivity analysis to compare with the estimator described above. European-level estimates were similar for the different estimators with few exceptions at individual country level.
The United States PPS coordinated by CDC (3) used stratification along factors thought to be predictive of the prevalence of HAIs. The estimators in (2) were based on medians of the durations-up-to-PPS similar to pps.median or external information and in (4) the original Rhame-Sudderth formula was used with the definition of prevalence P instead of Prhame and with a length-biased version of xLN-INT (i.e. lLN-INT), and a length-biased version of xlos(i.e. llos).
A main strength of our method is that we do not make any assumptions on the distributions of Xloi and Xlos. Usually in a PPS we do not know the distribution of Xloi and Xlos. This means that one criterion for selecting an estimator of I or Ipp is that it should behave well irrespective of the form of the unknown distribution. This is a criterion which, among the estimators using only duration-up-to-PPS information, was only fulfilled be the proposed estimator gren and the mixed estimator pps.mixed for larger samples (n ≥ 500). This is supported not only by simulations, but also by theoretical considerations. Using only simulation studies to assess an estimator can be a source of error if the distributions on which the estimators are assessed differ significantly from the underlying distributions encountered in PPSs.
Our method has limitations. One is the requirement of a sufficiently large sample size to get an acceptable estimate. We took the sample size of 500 HAIs as a rule-of-thumb lower limit. It may be applied to smaller samples, but with a risk of lower precision. For a single medium-size hospital, repeated PPSs with aggregation of the results would need to be performed to reliably estimate the incidence of HAIs. Another limitation of our setup is that we counted multiple simultaneous or partially overlapping HAIs as one HAI. However, these in reality only comprise a small fraction of HAIs (1, 15) and therefore can be neglected. In addition, many of the limitations mentioned in the original article by Rhame and Sudderth also apply in this updated version, in particular the lack of explicit representation of outbreaks and the assumption that the risk that a patient acquires a HAI is independent of other patients’ status. The new estimators gren and pps.mixed applied to the length of stay are sensitive to week day patterns in admissions and discharges (data not shown).
Typically, for larger PPSs, data collection takes place on different weekdays for different hospitals or even different wards in the same hospital (1), thus mitigating the influence of these patterns on the estimates. Another issue is that patients on their first day of admission are sometimes underrepresented due to the PPS protocol, when e. g. only patients admitted before a fixed time are included in the PPS. The new estimators are based on the monotonicity assumption for the distribution of Alos, which is violated in this situation. One solution can be to let A denote full days of hospital stay and ignore the patients admitted on the date of the survey for the estimates of average length of stay, but include them in the estimate of the prevalence. Similar problems appear to a lesser extent for the first day of HAI. Other factors that need to be taken into account include the consistency of the application of case definitions for HAIs, and the representativeness of the hospital sample.
In conclusion, the proposed gren estimator and the combined estimator pps.mixed provide better estimates of the length of infection across a range of simulation settings when compared to previously used estimators and, in contrast to these, are grounded in theory. The simulations also serve as a guide of the sample size to include in a PPS required to estimate incidence. The method is shared and easily applicable with the help of the R package prevtoinc.
Abbreviations
- CDC–
- Centers for Disease Control and Prevention
- ECDC–
- European for Disease Control and Prevention
- HAI–
- healthcare-associated infection
- HAI-Net ICU–
- European surveillance of HAIs in intensive care units
- NIDEP2–
- second study of nosocomial infections in Germany (Nosokomiale Infektionen in Deutschland, Erfassung und Prävention)
- PPS–
- point-prevalence survey RMSD - root mean squared deviation