What We Look for in Program Evaluation Research

August 29, 2025

CrimeSolutions uses rigorous evaluation research to inform practitioners and policymakers about what works. Because we set a high bar for the scientific rigor of the evaluations we include in CrimeSolutions, many evaluations do not meet our criteria, and we are unable to rate the evaluated programs.

In addition to using strong evaluation methods, researchers should include clear descriptions of their studies so that CrimeSolutions reviewers can easily determine how to assess and characterize elements of the evaluation.

This page will take you step-by-step through the elements that CrimeSolutions reviewers consider when they screen an evaluation and complete the program scoring instrument.[1]

For a more detailed look at the program scoring instrument, including calculations, see Calculating Outcome-based Program Ratings.

Screening Principles

Intervention Focus and Outcomes

Reviewers require that the primary aim of the intervention must fall within the scope of CrimeSolutions, which means that the program must:

Aim to prevent or reduce crime, delinquency, or related problem behaviors (such as aggression, gang involvement, or school attachment).
- Aim to prevent, intervene, or respond to victimization.
- Aim to improve justice systems or processes.
- Target a population of individuals who are involved in the justice system or at-risk of becoming involved in the justice system.

The evaluation must report on at least one of the following eligible outcomes:

Crime, delinquency, or overt problem behaviors, including aggression, gang involvement, or substance use, which may be presented as individual behaviors, community-level behaviors, or crime rates.
- Victimization.
- Justice system practices or policies.
- Risk factors for crime and delinquency, including school failure, psychological problems, or mental illness.

In addition, the evaluation must include at least one behavioral outcome.

Study Strength and Design

If a program’s goals and evaluated outcomes fit within the scope of CrimeSolutions, reviewers use the program scoring instrument to assess the study’s strength and evaluate the direction and magnitude of the outcome findings.

The Program Scoring Instrument

The program scoring instrument is designed to assess programs at the outcome level, since programs may impact outcomes differently. The scoring instrument has five parts:

Conceptual Framework
Fidelity of Program Implementation
Internal Validity
Outcome Measures
Effect Sizes

Conceptual Framework

The conceptual framework of an intervention program outlines the underlying theories, assumptions, and relationships guiding the program's design and connects the program’s objectives to the specific activities intended to achieve the desired outcomes.

In the conceptual framework section, reviewers consider the program’s description, theory of change, and duration.

Program Description. A full description serves as a guide for understanding the implementation of the program. The description should delineate five items:

A list of key activities.
The frequency or duration of key activities.
The targeted population.
the targeted behavior(s) (i.e., the intent of the program).
The setting.

If all five items are present, the program is considered fully described.

Theory of Change. Reviewers consider whether the evaluation describes why and how any change should be expected from the intervention. The theory of change may be explicit or implicit. An explicit program theory is formally documented and communicated as part of the program description. An implicit program theory generally is unstated and appeals to common sense. Evaluators are strongly encouraged to describe the theory of change, regardless of whether the program itself makes the theory of change implicit or explicit. Once a reviewer determines if the theory of change is described, and whether the change is explicit or implicit, they evaluate whether the theory of change is sound. This means reviewers assess whether the theory of change is supported by substantial and robust empirical data from testing through observations or experiments and whether the findings consistently support the theoretical framework.

Duration. Reviewers enter the intervention’s duration in hours, days, months, or years, but they also consider that the duration can vary significantly depending on various factors, including its nature, scope, objectives, and the needs of the target population. The duration selected does not affect the outcome ratings, but it is included as descriptive information in CrimeSolutions program profiles.

Fidelity of Program Implementation

In this section, reviewers consider program fidelity, which is the degree to which an implemented program adheres to its original design. Fidelity is important because programs may fail to show an effect because they fail to deliver the intervention as specified. Reviewers assess the degree to which the program was implemented as intended by its developers or designers based on three factors:

Documentation. Did the authors provide any information on fidelity?
Measurement. How did the authors document fidelity? Reviewers rank this element by indicating that there was no measure or that the measure was anecdotal, qualitative, or quantitative.
Adherence. Was the intervention implemented with fidelity? Reviewers score this element by indicating that adherence to the program appears satisfactory, poor, or that they were unable to assess adherence based on the information available.

Internal Validity

Reviewers determine whether one or more of five threats to internal validity are present in the study — selection bias, contamination bias, regression toward the mean bias, history bias, and maturation — or they indicate that not enough information was provided to make this determination. The five threats are defined as follows:

Selection bias occurs when the process of selecting or assigning subjects to different groups within a study results in systematic differences between the groups. It can distort the relationship between the intervention and the outcomes, making it difficult to draw accurate conclusions about cause and effect within the study.
Contamination occurs when subjects in one group within a study are inadvertently exposed to elements associated with another group, leading to mixing or blending of interventions or conditions. It can occur through interactions between participants, a commingling of study materials or protocols, or unintended spill-over effects from adjacent groups.
Regression toward the mean is the tendency for extreme values of a variable to move closer to the mean (average) upon subsequent measurement.
History bias occurs when an external event or historical factor influences the outcomes being measured. History bias is not a threat to internal validity if it affects both groups equally.
Maturation occurs when the observed changes in the study subjects can be attributed to natural developmental processes rather than to the intervention or treatment being studied. Maturation is not a threat to internal validity if it affects both groups equally.

Outcome Measures

Reviewers assess the validity and reliability of each outcome measure described in the evaluation and selected for review. Outcome measures can take various forms depending on the nature of the study. Valid and reliable outcome measures are important to ensure the accuracy and credibility of research findings.

Outcome Category. Research assistants assign each outcome to an outcome category selected from a list of outcomes managed by CrimeSolutions. The categories are entered in the program profile to help CrimeSolutions visitors search for relevant programs.

Data source. Reviewers note the source of the outcome data by selecting from a list of options, including self-report; reports from parents, teachers, peers, or other observers; official/administrative records (such as police or court records); and specimen or medical tests (such as urinalysis). If the data source is unknown, reviewers indicate that not enough information was provided to determine the source.

Direction. Researchers note the direction of the outcome by indicating whether a higher value of the outcome represents a positive or negative result.

Measurement validity. Reviewers determine whether the measure is valid, and, if so, whether it has face or construct validity or if there is not enough information to make that determination. The type of validity reviewers select does not affect the outcome scores.

Effect Size

This section of the instrument collects information needed to calculate the effect size of each assessed outcome.[2] There are various types of effect sizes, depending on the analytic technique used and scale of the outcome. For continuous measures, the most reported effect size metric is the standardized mean difference — the difference between the means on an outcome variable across two groups, represented in standard deviation units. For dichotomous outcomes, the most reported effect size metric is the odds ratio — the difference in the odds that an outcome will occur across two groups. Some other types of standardized effect size measures are the correlation coefficient, relative risk, and risk ratio.

Post-intervention assessment period. Reviewers note the post-intervention assessment period or indicate that not enough information was provided to identify it. The assessment period generally is the amount of time between the beginning of the intervention and the assessment. Reviewers enter the specific assessment period in either hours, days, months, or years.

Assigned sample size. Reviewers note the assigned sample size of the intervention and comparison groups.

Analytic sample size. Reviewers determine the analytic sample size, noting the sample size of the intervention and comparison groups.

Assignment type. Reviewers assess whether the subjects are allocated randomly or nonrandomly to different groups in the study or indicate that they are unable to determine the assignment type. A random assignment type increases the strength of the program’s evaluation score.

Assignment level. Reviewers determine whether the assignment was at the individual or cluster level (e.g., facility, school, county, state).

Baseline outcome differences. The outcomes of the treatment and control groups should be equivalent at baseline in an evaluation to ensure that any observed differences in outcomes after the intervention can be attributed to the intervention itself, rather than to preexisting differences between the groups. For this item, reviewers consider whether the evaluation design or analysis account for any observed group differences at baseline.

Potential confounding variables. Controlling for confounds in a program evaluation is crucial because confounding variables can distort the true relationship between the program and its outcomes. Reviewers determine if the evaluation design or analysis account for potential confounding variables driving.

Analytic approach. The Reviewers assess whether the evaluator used an appropriate approach to analysis, given the type of data in the study. For example, continuous outcomes — values that can fall within a range — can be analyzed using techniques such as regression or analysis of variance (ANOVA) models, while dichotomous outcomes — those with two categories — require specialized methods such as the chi-square test or logistic regression.

Crime displacement. Reviewers confirm if it was relevant and necessary for the evaluator to assess crime displacement and diffusion effects. This question applies primarily to place-based studies. If it was relevant, the reviewers then determine if the evaluator assessed for crime displacement and diffusion effects and adjusted the main outcome effects for any observed displacement or diffusion effects.

Author-reported outcome data. The reviewers indicate whether the evaluator’s findings for an outcome were statistically significant, favoring the treatment group; nonsignificant, favoring the treatment group; nonsignificant, favoring the comparison group; or significant, favoring the comparison.

Outcome measure scaling. Reviewers assess whether the scale of the outcome measure is continuous or dichotomous.

Continuous effect size. If the effect size is continuous, reviewers select the most appropriate effect size calculation method from one of the following options:

Unstandardized regression coefficient, standard error of the unstandardized regression coefficient, and unadjusted standard deviations.
Standardized regression coefficient, standard error of the standardized regression coefficient, and unadjusted standard deviations.
Adjusted means, unadjusted standard deviations, squared multiple correlation between the predictors and the outcome (ancova).
Adjusted means, unadjusted standard deviations, ancova mean squared error.
Adjusted means, unadjusted standard deviations, no information about the standard error of the adjusted effect.
T-statistic of the treatment-comparison means difference from a covariate-adjusted analysis.
Unadjusted means and standard deviations.
Unadjusted means and standard errors of means.
Independent groups t-test (t-statistic).
Independent groups t-test (exact p-value).
F-statistic from a one-way analysis of variance (ANOVA).
Author-reported effect size and the variance OR the standard error.
Not enough information to calculate an effect size.

The program scoring instrument calculates the effect size of each assessed outcome based on the option selected for continuous effect size and the data entered throughout the instrument.

Dichotomous effect size. If the effect size is dichotomous, reviewers identify the most appropriate effect size calculation method by selecting one of the following options:

Adjusted percentages.
Logistic regression coefficient and its standard error from a multi-predictor model.
Odds ratio for an intervention indicator from a multi-predictor logistic regression model and the standard error of the logistic regression coefficient.
Covariate adjusted means from linear probability model.
Event counts.
Event percentages.
Logistic regression coefficient and its standard error from a single predictor model.
Odds ratio for an intervention indicator from a single-predictor logistic regression model and the standard error of the logistic regression coefficient.
Author-reported.

The scoring instrument calculates the effect size of each assessed outcome based on the option selected for dichotomous effect size and the data entered throughout the instrument.

Cluster adjustment. A cluster adjustment is required if one of the following is true:

The level of analysis does not match the level of assignment (e.g., schools are randomized into treatment and comparison conditions, but the student is the unit of analysis).
The analysis does not account for clustering (e.g., multilevel modeling or adjusting means and SDs for clustering).

Reviewers indicate whether a cluster adjustment is required by selecting one of the following options:

No, the level of analysis matches the level of assignment.
No, the authors provided adjustments for clustering (e.g., multi-level modeling or adjusting means and SDs for clustering).
Yes, the design is clustered, but the authors provided unadjusted summary statistics. If reviewers select this option, they then enter the interclass correlation coefficient.

Total number of clusters. Reviewers enter the total number of clusters. For example, if five facilities were assigned to the intervention group and four facilities were assigned to the comparison group, the total number of clusters would be nine.

Date Published: August 29, 2025