This paper revisits the problem of five year survivability predictions f o r breast cancer using machine learning tools. This work is distinguishable from the past experiments based on the size of the training data, the unbalanced distribution of data in minority and majority classes, and modified data cleaning procedures. These experiments are also based on the principles of TIDY data and reproducible research. In order to finetune the predictions, a set of experiments were run using naive Bayes, decision trees, and logistic regression. Of particular interest were strategies to improve the recall level for the minority class, as the cost of misclassification is prohibitive. The main contribution of this work is that logistic regression with the proper setting of class weight gives the highest precision / recall level for the minority class. In addition, this work provides precise algorithms and codes for determining class membership and execution of competing methods. These codes can facilitate the reproduction and extension of our work by other researchers.
Keywords: Machine Learning, Big Data, Learning Algorithm, Logistic Regression, Classification, ROC
Bozorgi, M., Taghva, K., & Singh, A. (2022). Revisiting Survivability Prediction of Breast Cancer with Machine Learning Tools. Journal of Applied Statistics & Machine Learning. 1(2): pp. 89100.