with Brian Paciotti, Ph.D MS
Risk adjustment is one of the most important areas in health care analytics modeling. We can improve patient health and outcomes while reducing health care costs by making more accurate predictions about who is most likely to have additional admissions, emergency visits and who is at highest risk for diagnoses.
There are many tools we can use in predicting risks, but we must be careful with our outcome variables! As the famous Obermeyer et al. paper shows us, modeling patient characteristics against the outcome of higher costs will leave behind some under utilizers, and create additional bias in an already biased system.
Domain knowledge is an important advantage in creating good risk stratification and adjustment models. Domain knowledge helps us choose the most relevant methods and variables, and doesn't leave us at the mercy of variable selection procedures which may either choose variables with spurious correlation or eliminate variables that do have real-world bearing on the outcome, however small, that should be included in a model so whatever influence they have on the outcome isn't improperly assigned to another, related variable.
Additionally, predictions related to the risk of health care utilization may be difficult to make. We may have data points that relate directly - or by proxy - to these outcomes, however, we also know that many, many factors influence both the exacerbation of health conditions and whether or not individuals seeks treatment. Typical predictive probabilities may be cut off at 51% for the measured population, but we may choose to produce more robust predictions by looking at the predicted outcomes of the top 1% or top 10%.
The first foray into a predictive model is often a logistic regression model, where the predicted outcome is binary, and may attempt to answer questions such as, "Will the patient have another admission in the next thirty days or not?", or "Will the patient have an emergency room visit in the next thirty days or not?"
Before the prediction model is created, students must understand the data, prepare it using SAS data steps, and perform variable formatting and recoding.
For this exercise, we gave students access to the Medicare data formatted for the prediction of ED Visits based on chronic condition information by Matthew Gillingham in the class reference, "SAS Programming with Medicare Administrative Data".
Gillingham ED Data
Refine the Dataset for Prediction
I encouraged students to use their domain knowledge when choosing possible model covariates, and to consider what variables and conditions may be most likely to have an impact on ED Visits. I also encouraged students to consider the relationships between covariates and outcomes. For example, being male has a known relationship to higher-risk behaviors, which may cause an ED Visit, and being male also has a known relationship to a greater-likelihood of developing cardiovascular disease. So, one or more of these covariates beta coefficient values may be inflated by an additional unmeasured influence.
The 'STEPWISE' function, when used, show a correlation of the average # of Medicare coverage months with the outcome of an emergency visit. The reason wasn't intuitive to me, but I encouraged students to consider what reasons might be behind this relationship; what signal could we be picking up? Care access, for example?
What about relationships between counties and states and these outcomes? If particular counties and/or states don't appear to have a correlation with increased likelihood of ED visits, does that mean that the beneficiaries are healthier in those counties? What if counties lack a significant number of beneficiaries? What could be the cause?
My prior research indicates there are many very unhealthy counties in our nation with low Medicare utilization, because the median ages in these areas are low. People in these counties may not live long enough to become Medicare eligible, even if they would access the benefit, which many in, for example, indigenous communities, won't.
Formatting Indicator and Outcome Variables
A Proposed Model
Students needed to divide data into Training, Validation, and Testing sets.
The importance of avoiding information leakage during the process of creating these training, validation, and testing sets was emphasized, as well as the importance of refining models using only training and validation sets, while producing the final model scores using only the testing data set.
LASSO was introduced as a form of regularization / bias reduction, and a possible variable selection method. Students were encouraged to try to apply LASSO to proposed models using HPGENSELECT. Here is an example of this approach, using the following variables:
Sex
Race (Self-identified)
# of Coverage Months
State
Age Category
End-Stage Renal Disease Indicator (Yes or No)
Death Indicator (Yes or No)
Alzheimer's Disease Indicator (Yes or No)
Congestive Heart Failure Indicator (Yes or No)
Chronic Kidney Disease Indicator (Yes or No)
Chronic Obstructive Pulmonary Disease Indicator (Yes or No)
Depression Indicator (Yes or No)
Diabetes Indicator (Yes or No)
Ischemic Heart Disease Indicator (Yes or No)
and Stroke / TIA Indicator (Yes or No)
proc HPGENSELECT data=ER_TRAIN;
class SEX RACE PLAN_CVRG_MOS_NUM STATE AGE_CATS ESRD_IND;
model ER_OUTCOME (EVENT='1')=SEX RACE PLAN_CVRG_MOS_NUM STATE AGE_CATS
DEATH_2010 SP_ALZHDMTA SP_CHF SP_CHRNKIDN SP_COPD SP_DEPRESSN SP_DIABETES
SP_ISCHMCHT SP_STRKETIA / dist=binary;
selection method=LASSO(choose=SBC) details=all;
performance details;
code file="/home/u45073192/hca203/model_output.sas";
run;
LASSO selection removed Race, State, Death Indicator, # of Coverage Months, and End-Stage Renal Disease Indicator, while retaining the other variables. These variables were used in the final logit model, the ORs for which are shown to the left.
Students were provided with many ways to interpret the model and judge its quality. Interpretation was discussed using a combination of beta coefficient estimates, p-values, and ORs with confidence intervals.
Students were acquainted with the need to understand that when alpha = 0.05, p-values indicate that the relationship between the covariate and the outcome is a 'false positive' about 5% of the time. A significant p-value alone isn't enough to conclude that a variable has a significant impact on the probability of the outcome of interest, either.
Take the estimated beta-coefficient value for SEX (the increase in probability of an ED visit, as sex goes from male to female). This value was very small: 0.0286. Coupled with a borderline-insignificant p-value of 0.0471, we could conclude that this variable has little impact on the probability of an ED visit.
On the other hand, SP_ISCHMCHT (the increase in probability of an ED visit as having ischemic heart disease goes from 0 - not having it - to 1 - having it), had a beta estimate of 0.2921 and a p-value of <0.0001. The OR interpretation indicates that having Ischemic Heart Disease is estimated to increase the odds of an ED visit by 34%.
Wald confidence intervals and/or p-values associated with odds ratios represent a hypothesis test that indicates whether or not we should conclude that there's any significant impact of that covariate on the probability of an ED visit. For example, although the OR estimate for SP_STRKETIA (the increase in the probability of an ED visit when a person had a stroke or TIA) indicated a 12% increase in ED visit likelihood, the 95% confidence interval extended across 1 to 0.971, leading us to reject the hypothesis that SP_STRKETIA has an significant impact on the probability of an ED visit.
But is the logit model really any good at predicting ED visits? The answer to this question comes from a combination of the c-statistic, the AUC, and the Confusion Matrix and the scores that can be derived from it.
The c-statistic or AUC for the ED Visit prediction model is 0.7264, as shown on the right.
This indicates that, overall, the model is able to differentiate those who don't go to the ED from those who do about 73% of the time.
At first, this seems pretty good! However, looking more closely at the model's confusion matrix provides a more accurate picture of the model's true ability to differentiate those who will probably go to the ED from those who probably won't.
We can use our domain-knowledge to anticipate that predicting ED visits is likely to be difficult. Many things can contribute to the likelihood of an ED visit. While the presence of chronic diseases and increasing age would certainly play a part, there would also be plenty of additional - sometimes, random - events that would contribute to increasing - or decreasing - ED visit probability.
The Confusion Matrix, on the left, shows where the TRUE outcome (person goes to the ED (1), or doesn't (0)) vs. the outcome predicted by the model (_INTO_). In the upper left quadrant, where both the truth is 0 and the prediction is 0, there are 26,418 people who did not go to the ED and were predicted NOT to go; they’re called TRUE NEGATIVES.
Right next to that cell, on the upper right, the truth is still 0 (No ED Visit) but the model guessed 1 (Positive for an ED Visit) – as you might suppose, that’s called a FALSE POSITIVE, because the model guessed positive, but that was incorrect. There were 731 of those.
Now, on the lower left we have 6,353 cases where the truth was 1 (Positive for an ED Visit) but the model guessed 0 (No ED Visit); so those are the FALSE NEGATIVES.
Lastly, in the lower right are the TRUE POSITIVES, where the model guessed positive, and the beneficiary indeed had an ED Visit. There are 728 of those.
In the picture above, the green highlights where the model was RIGHT, and the yellow shows where it was WRONG. We showed students how to use these values to calculate almost everything they would ever want to know about how good the model is (or isn’t) at predicted ED visits.
The following metrics can be calculated for a Logistic Regression / Binary Classification model:
1. Accuracy: The accuracy is the % of samples CORRECTLY CLASSIFIED by the model out of all the samples in the set. The formula is TP + TN / N, or TRUE POSITIVES + TRUE NEGATIVES / Total # in the Sample.
2. Precision (for the Positive Class): Precision is the # of samples that actually belong to the positive class out of all the samples PREDICTED to belong to it. The formula is TP / TP + FP, or TRUE POSITIVES / TRUE POSITIVES + FALSE POSITIVES.
3. Sensitivity OR Recall (for the Positive Class): Recall, or Sensitivity, is the # of samples PREDICTED CORRECTLY as belonging to the POSITIVE class out of all the samples that actually belong to the positive class. The formula is TP / TP + FN, TRUE POSITIVES / TRUE POSITIVES + FALSE NEGATIVES. This tells us how well the model predicts the POSITIVE class, or outcome.
4. F1 Score: The harmonic mean of the precision and recall scores obtained for the positive class: 2 * Precision * Recall / Precision + Recall; generally considered a good overall evaluation metric for most classification models.
5. Specificity: The # of samples PREDICTED CORRECTLY as belonging to the NEGATIVE class out of all the samples that actually belong to the negative class: TN / FP + TN – tells us how well the model predicts the negative class / outcome.
The students were able to calculate all these scores for the ED visit model, as below -
Although the model's c-statistic is better than the traditional cut-off of 0.7, when we reflect on the Confusion Matrix and the metrics we calculated, we can see that the model is very good at identifying NEGATIVE cases (people who did NOT go to the ED), but no better than a coin flip at identifying POSITIVE cases (those who did go).
Students concluded that while they could use this model to predict who WOULDN’T be at the ED, they probably should not use it who WOULD be!
In general, we want our models to have both high Sensitivity and high Specificity, because this signifies that the model contains the smallest number of variables necessary to predict the largest number of outcomes accurately and that the variables we have help us do this.
However, depending on what we're using the model to predict, we might care a lot more about one metric than another. For example, in this case, it would be much more useful to be able to predict who would go to the ED than who would not. So, the score we'd like to maximize would likely be Precision or Recall. On the other hand, consider a case where we need to differentiate those who are not infected from those who are. Then, we wish to focus on accuracy in predicting negative cases, and we would look closely at Specificity.
Teaching Risk Adjustment via Logistic Regression classification / prediction models, allow us to explore all the important steps of model building, coding, interpretation, and evaluation.