Analysis on Health Data for Medical Insurance Cost

  • Tech Stack: Python, Causal Inference, Statistical Regression, Machine Learning, XGBoost
  • Github URL: Project Link

This study analyzed health insurance data to identify factors that affect insurance premium prices. Causal inference was conducted using Welch's t-test, revealing that age, chronic disease history, and surgery history were significant factors. Predictive analysis using regularized regression models found that Ridge regression had the highest R2 and lowest RMSE values, and age was found to have the closest relationship with premium price.

Clustering results revealed the existence of two potential groups regarding diabetes status, however, the classification model did not perform well in determining diabetes patients. Assumptions made during the analysis, such as a linear relationship between features and the independent variable, and unbiased sampling, could potentially affect the outcome of the analysis.

Additionally, the study found that although the "Height" feature did not show significant results, creating a "BMI" feature by combining "Height" and "Weight" resulted in significant results.