US Health Insurance Marketplace

4 min readApr 29, 2020

This is a classification project which is my third project at Istanbul Data Science Academy.

Health care in the United States can indeed be very expensive. The main purpose of health insurance is to reduce such costs to more reasonable, affordable amounts. I wonder which type of insurance is preferred according to the tobacco usage; dental or medical.

I transferred data from here and used 3 datasets called Benefits and Cost Sharing, Rate, and Plan Attributes.

Exploratory Data Analysis (EDA)

Age: Includes the ages of 0 to 64 and family options. I grouped them in reasonably.
Tobacco: Includes two types of information: Tobacco User/Non-Tobacco User and No Preference.
Benefit Name: Includes name assigned to Benefit.
Coverage: Includes coverage status of the Benefit. Blank values are equivalent to ‘Not Covered’.
In-Network Payments: Include the values of whether the cost associated with this benefit is excluded from the in-network maximum out-of-pocket payment total.
Out of Network Payments: Include the values of whether the cost associated with this benefit is excluded from the out of network maximum out-of-pocket payment total.
EHB: Include the values of whether the benefit is considered an essential health benefit.
Dental Only Plan: Include the values of the dental-only status of the plan.
Individual Rate: Dollar value for the insurance premium cost applicable to a non-tobacco user for the insurance plan.
Individual Tobacco Rate: Dollar value for the insurance premium cost applicable to a tobacco user for the insurance plan.

According to the correlation heatmap, there is a strong negative correlation between my target column Metal Level and Dental Only Plan. If I go to the model with this column, the model will overfit. Even the train accuracy score will success near 99%, the test accuracy score will 50–60%. Also, the Individual Rate and Individual Tobacco Rate do not affect the success of the model so much.

Imbalance Data

If we look at the image of the data grouped by the target column Metal Level, we can see that the data is not distributed well balanced. There are 2% of the values are dental and 98% of the values are medical.

The Machine Learning techniques such as Decision Tree and Logistic Regression have a bias towards the majority class. I used the NearMiss algorithm that the majority class has been reduced to the total number of the minority class for handling imbalanced class distribution. Hereby, both classes will have an equal number of entries.

Model Selection

Firstly, I used 10 algorithms with ensemble models such as KNN, Logistic Regression, SVC, Naive Bayes, Decision Tree, Random Forest, Ada Boost, etc. I compared not only train and test accuracy scores but also precision, recall, and F1 scores. Then, I plotted a ROC Curve with the AUC scores below.

According to the graph, the first four models were not succeeded so well and except for the four models were succeeded. Next, I examined whether I can optimize succeeded models using grid search and randomized search.

After the parameter estimation, I combined other models to optimized models in a table. We can see that the best three models are the Decision Tree, Random Forest, and Extra Trees Classifier.

In conclusion, I checked the score of how successful in a confusion matrix for these three models and decided that the best model is Random Forest.

True Positive: 2712

False Negative: 583

False Positive: 1288

True Negative: 2007

You can see the codes of the project here. Also, you can look at the quick report of this project at Tableau here.

US Health Insurance Marketplace

Exploratory Data Analysis (EDA)

Imbalance Data

Model Selection

Written by Yağmur Bali