This page describes, in detail, how CrimeSolutions generates outcome-based program ratings. It will walk you through the four distinct phases of the scoring instrument that reviewers use to assess the studies in the evidence base and rate the outcomes:
- Phase 1 assesses each individual study in the evidence base.
- Phase 2 summarizes the overall strength of the evidence base.
- Phase 3 calculates an effect size for each outcome category within a single study.
- Phase 4 synthesizes the evidence for each outcome category across all studies in the evidence base.
See also What We Look for in Program Evaluation Research.
Phase 1: Individual Study Quality Scoring
The first phase assesses each study in the evidence base on several specific items. These items are designed to assess various aspects of the program, study and outcomes — the conceptual framework, fidelity to the program implementation, and the methodological rigor of the study. This individual study rating provides the foundation for determining the overall quality of the program evidence. Each study in the evidence base is scored using the following criteria:
Step 1.1 Benchmark the conceptual framework of the program.
The conceptual framework of an intervention program is a structured representation that outlines the underlying theories, assumptions, and relationships guiding the program's design. It is a blueprint that connects the program’s objectives to the specific activities intended to achieve desired outcomes. Typically, it includes the following elements:
- Problem Statement: Defines the issue the program aims to address.
- Theoretical Foundations: Describes the theories or models that inform the program's approach, explaining why specific strategies or activities are expected to produce the intended outcomes.
- Key Components: Identifies the core elements of the intervention, such as services, resources, or activities, which are central to addressing the identified problem.
- Target Population: Defines the group or community the program is designed to benefit or serve.
- Outcomes: Describes the expected changes resulting from the program.
The following items are used to assess the conceptual framework:
- Program description. Is the program fully described?
- Description of the theory of change. Does the study describe why change should be expected?
- Soundness of the theory of change. Is the theory of change sound?
Overall, for the program to be considered conceptually sound, each of these components must receive a response that meets acceptable standards. This requires clear and sufficient evidence of a program model that effectively links the program's objectives to the specific activities designed to achieve its desired outcomes. A conceptually sound program demonstrates a logical and cohesive framework where the planned intervention directly aligns with the intended goals and is likely to produce the intended outcomes. This is indicated by affirmative answers to three items.
Step 1.2. Benchmark the fidelity of the study used to evaluate the outcome.
Fidelity refers to the degree to which an intervention is implemented as intended by its original design. It is a measure of how closely the actual delivery of the program aligns with its prescribed components, methods and protocols. Maintaining program fidelity is critical for ensuring the program achieves its intended outcomes and that the results can be attributed to the program itself, rather than to variations in its implementation.
The following items are used to assess the fidelity/program implementation:
- Documentation. Did the authors provide any information on the implementation of the program?
- Fidelity measurement. How did the study authors document fidelity? Response options are none, anecdotal, qualitative and quantitative. The latter two are required to meet the acceptable standard.
- Adherence. Was the program implemented with fidelity? Possible responses are adherence to the program appears poor, satisfactory and cannot tell.
For the program to be considered implemented with fidelity, each of these components must receive a response that meets acceptable standards. This indicates that there is strong evidence showing the program was delivered as designed, adhering closely to its prescribed components, methods and protocols. Such alignment ensures the outcomes observed during an evaluation can be confidently attributed to the program itself, rather than to inconsistencies or deviations in how it was implemented.
Step 1.3. Benchmark the methodology of the study used to evaluate the outcome.
Methodological rigor refers to the application of systematic, transparent, and well-validated methods in the design, execution, and analysis of an evaluation study. It ensures the reliability, validity, and credibility of the findings by minimizing biases, errors and inconsistencies throughout the research process.
The following items are used to assess methodology:
- Internal Validity. Do any of the specified threats to validity (including selection bias, contamination bias, regression toward the mean bias, history bias or maturation bias) seriously undermine the credibility of findings? The study must be free of any serious threats to validity to meet acceptable standards.
- Measurement. Is the outcome valid? Response options include no, face validity, construct validity and cannot tell. Either face or construct validity is necessary to meet the acceptable standards.
- Assignment. How are the subjects allocated to different groups in the study? (The study must assign subjects to groups using either random or nonrandom methods to meet acceptable standards.) Responses include:
- Nonrandom: Assignment based on criteria other than random chance.
- Nonrandom: Natural cut point (regression discontinuity).
- Nonrandom: Natural experiment.
- Random assignment.
- Cannot tell.
- Baseline outcome differences. Does the analysis account for the baseline differences in the outcome? The response must be yes to meet acceptable standards.
- Potential confounds. Does the analysis account for other potential confounding variables? The response must be yes to meet acceptable standards.
- Analytic approach. Did the analysis use an approach that is appropriate for the type of data being analyzed? The response must be yes to meet acceptable standards.
- Displacement/diffusion. Did the study assess displacement or diffusion effects as part of the analysis? (This item is only applicable to studies that require a displacement or diffusion analysis.) Possible responses are no; yes, anecdotally; yes, empirically but post hoc; and yes, empirically as an integral or central part of the analysis. The response must be yes to meet acceptable standards.
For the study to be considered methodologically sound, each of these components must receive a response that meets acceptable standards. This indicates sufficient evidence supporting the reliability, validity and credibility of the findings by systematically minimizing biases, errors and inconsistencies throughout the research process.
Step 1.4. Rate the quality of an individual study.
The following items are used to rate the quality of a study:
- Conceptual framework. What the program intended to do.
- Fidelity. What the program did versus what the program developers said they wanted the program to do.
- Methodology. the observed effects that can be attributed to the program rather than to extraneous factors.
For a study to be of optimal quality, reviewers need to know what the intervention intended to achieve (conceptual framework), confirm that the intervention was implemented as planned (fidelity), and be reasonably certain that the intervention caused the change (methodology). If the assessment lacks information about the conceptual framework or fidelity, details regarding the internal processes or components leading to the outcomes are missing. Consequently, such studies are considered of sufficient quality but more information on how the intervention operates is needed. Table 1 shows the characteristics that constitute each quality level, which corresponds to a value in the scoring instrument.
Quality Level | Description |
---|---|
Level 3 | Reviewer:
|
Level 2 | Reviewer:
|
Level 1 |
|
Level 0 | Reviewer:
|
Phase 2: Summarizing the Quality of the Evidence Base
A program outcome may be evaluated by two or more studies that differ in quality. To assess the overall quality of the studies in the evidence base of a program, we aggregate the quality ratings of each individual study weighted by the study design. The steps below illustrate this procedure using data from Table 2.
Quality Score | Quality Description | RCT | Quality Weight | Weighted Rating | Sum of Weighted Ratings | Sum of Weights | Weighted Average Quality Rating |
---|---|---|---|---|---|---|---|
3 | Optimal | Yes | 1.00 | 3.00 | 5.25 | 2.50 | 2.10 |
2 | High | No | 0.75 | 1.50 | - | - | - |
1 | Good | No | 0.75 | 0.75 | - | - | - |
Step 2.1. Calculate the Weighted Rating of Each Study
A weighted rating assigns different levels of importance to individual studies based on specific criteria. This approach allows the results from studies that are more reliable to have a greater influence on overall synthesis. In the example in Table 1, an individual study is weighted by the study design with experimental designs receiving a weight of 1 and quasi-experimental designs receiving a weight of 0.75. The weighted rating is calculated for each study by multiplying the individual study quality score by the quality adjusted weights.
- Study 1 (Optimal Quality): 3 * 1.00 = 3.00
- Study 2 (High Quality): 2 * 0.75 = 1.50
- Study 3 (Good Quality): 1 * 0.75 = 0.75
Step 2.2. Calculate the Weighted Average Quality Rating
A weighted average quality rating is the overall quality score for a set of studies, where each study is given a weight based on a specific characteristic. This rating combines the individual quality scores of the studies in a way that gives more importance to higher quality. The weighted average quality rating is calculated by dividing the sum of weighted ratings by the sum of weights.
- Weighted Average Quality Rating = 5.25 / 2.50 = 2.10
Phase 3: Calculate the Within Study Composite Effect Size
Phase 3 systematically combines and analyzes the relevant program effects of the studies assessed in phases 1 and 2. This is accomplished via meta-analytic techniques. The primary goals of the analysis are to integrate findings, increase statistical power and provide a more precise and comprehensive estimate of the program effect size across specific outcome categories.
There are two problems with combining effects from multiple studies:
- Computing an overall effect across studies assigns more weight to studies with two effects in the same outcome category than to studies with a single effect.
- Combining effects leads to an improper estimate of the precision of the summary effect, because it treats the separate effects as providing independent information, when in fact the effects from the same study are not independent of each other.
To address these two problems, we first compute the mean of the within study effects and use this composite score as the unit of analysis. This allows each study to be represented by one effect rather than treating each study effect as a separate unit of analysis.
Table 3 presents data for an example program that will be used throughout this section when giving examples.
Study ID | Outcome Domain | Effect Size | Variance | Standard Error | Number of Effects | Rho | Composite Effect Size | Variance of Composite Effect Size |
---|---|---|---|---|---|---|---|---|
1 | Arrest | 0.23 | 0.037 | 0.193 | 4 | 1 | 0.35 | 0.038 |
1 | Arrest | 0.22 | 0.023 | 0.151 | 4 | 1 | - | - |
1 | Arrest | 0.30 | 0.052 | 0.228 | 4 | 1 | - | - |
1 | Arrest | 0.64 | 0.044 | 0.211 | 4 | 1 | - | - |
2 | Arrest | 0.20 | 0.067 | 0.249 | 4 | 1 | 0.20 | 0.067 |
Step 3.1. Calculate the Composite Effect
When the outcome only has one effect, no adjustment is necessary. The effect and variance are simply the same as were calculated directly from the study. When the study has more than one effect for a given outcome category, we calculate the composite effect size and the variance of that composite effect size. The composite effect size is simply the average of all the effect sizes from the study. For example, using the data in Table 3, we calculate a composite effect as follows:
- Composite Effect Size = AVERAGE (0.23, 0.22, 0.30, 0.64) = .35
Step 3.2. Calculate the Variance of Composite Effect
The variance of the composite effect size depends on the number of effects. It incorporates the variances of each effect, weighted by the correlation between the effects. Further, because the correlation between the effects is unknown, CrimeSolutions takes a conservative approach by assuming that the effects are perfectly correlated (rho = 1.0). Assuming rho = 1.0, the variance of the composite effect size is generally close to the simple average of all the variances from the study. Table 3 offers an example as to how to calculate the variance of the composite effect.
- Composite Variance =
((1/4) ^2) * ((0.037 + 0.023 + 0.052 + 0.044) +((2 * 1* .193 * .151) +(2 * 1 * .193 * .228) +
(2 * 1 * .193 * .211) +(2 * 1 * .151 * .228) + (2 * 1 * .151 * .211) + (2 * 1 * .228 * .211))) = .038
Phase 4: Calculate an Overall Meta-Analytic Mean Effect Size
Phase 4 calculates the overall meta-analytic effect size across all included studies, where each study only contributes one effect to the meta-analysis per outcome category. (Note: Studies reporting only one effect simply contribute that effect, but studies that reported multiple effects for the same outcome category now only contribute the new composite effect calculated in step 4.4). The steps used to calculate the meta-analytic effect size are illustrated below using data from Table 4.
Study ID | Outcome Domain | Effect Size | Variance | Inverse Variance Weight | Inverse Variance Effect | Sum of Inverse Variance Effect | Sum of Inverse Variant Weight | Meta Effect Size | Meta V |
---|---|---|---|---|---|---|---|---|---|
1 | Arrest | 0.35 | 0.038 | 26.32 | 9.21 | 12.196 | 41.241 | 0.296 | 0.024 |
2 | Arrest | 0.20 | 0.067 | 14.93 | 2.99 | - | - | - | - |
Step 4.1. Calculate the Inverse Variance Weight
An inverse variance weight is a statistical weighting method used to assign more importance to estimates with higher precision (lower variance). It is calculated as the reciprocal of the variance.
- Study 1 Effect: 1/.037 = 27.03
- Study 2 Effect: 1/.047 = 21.28
Step 4.2. Calculate the Inverse Variance Effect
An inverse variance effect is the weighted effect size where individual effect sizes are weighted by the inverse of their variances. It is calculated by multiplying the effect size by the inverse variance weight.
- Study 1 Effect: .35 * 27.03 = 9.46
- Study 2 Effect: .20 * 21.28 = 4.26
Step 4.3. Calculate the Overall Meta-Analytic Effect Size
A meta-analytic effect size is a single, summary statistic that combines the effect sizes from multiple studies on the same topic. It represents the overall magnitude of an effect across studies and is calculated by dividing the sum of the inverse variance effect by the sum of the invariance variance weight.
- Meta-Analytic Effect = 13.715 / 48.304 = .284
Step 4.4. Calculate the Variance of Overall Meta-Analytic Effect Size
The variance of a meta-analytic effect size measures the spread around the overall effect size across studies. It is calculated as the reciprocal of the sum of the inverse variance weights.
- Variance of Meta-Analytic Effect Size = 1 / 48.304 = .021
Step 4.5. Calculate the Standard Error of the Overall Meta-Analytic Effect Size
The standard error of a meta-analytic effect size quantifies the uncertainty or precision of the combined effect size estimate. It is a critical value used to calculate confidence intervals and perform hypothesis tests on the meta-analytic effect size. It is calculated as the square root of the variance.
- Standard Error = SQRT (.021) = .144
Step 4.6. Calculate the Confidence Interval
The confidence interval of the effect size provides a range of plausible values within which the true effect size is likely to fall, with a specified level of confidence (commonly 95%). It reflects both the estimated effect size and the uncertainty around that estimate. In a meta-analysis, the confidence interval is calculated by first determining the margin of error (i.e., multiplying the standard error of the combined effect size by the critical value [Z]) and then adding and subtracting this value from the combined effect size to form the bounds of the interval.
- Lower confidence interval = (0.284 – (1.96 * 0.144)) = .002
- Upper confidence interval = (0.284 + (1.96 * 0.144)) = .566
Step 4.7. Determine the Significance of the Effect
The confidence interval provides a range of plausible values for the true effect size across the studies. If the confidence interval does not include zero, the meta-analytic effect size is statistically significant (e.g., a positive or negative effect is likely real). If the confidence interval includes zero, the meta-analytic effect is not statistically significant. Table 4 provides an example:
- The meta-analytic effect size is .284. The lower bound of the confidence interval is .002 while the upper bound is .566. The confidence interval includes zero, suggesting that the meta-analytic effect size is statistically significant.
Program | Outcome Domain | Meta-Analytic Effect Size | Meta Variance | Standard Error | Lower Bound | Upper Bound | Significant |
---|---|---|---|---|---|---|---|
1 | Arrest | .296 | .024 | .156 | -.009 | .601 | No |
The Final Rating: Combining Quality and Magnitude
In phases 1-4, the quality and magnitude of the effect are kept as separate entities in characterizing the effectiveness of a program’s intervention on a particular outcome. Specifically, the quality component describes the strength of the evidence base, while the effect size component describes the magnitude and statistical significance of the effect.
The last part in the rating process uses these two elements to categorize program outcomes into one of the following five groups:
- Effective.
- The meta-overall mean effect is positive and significant.
- There are no individual negative effects.
- The overall quality rating is greater than 2.1.
- Promising.
- The overall mean effect is positive and significant.
- There are no individual negative effects.
- The overall quality rating is less than 2.1.
- Negative Effect.
- The overall mean effect is negative and significant.
- The overall quality rating is greater than 2.1.
- Ineffective.
- The overall mean effect is not significant.
- The overall quality rating is greater than 2.1.
- Inconclusive Evidence. The outcome does not meet the criteria for effective, promising, negative or ineffective.