A Deep Learning Pipeline for Lung Cancer Classification on Imbalanced Data Set
Lung cancer is one of the most common malignant tumors in the world today, and it is the leading cause of cancer death, accounting for 1 in 4 cancer deaths (Gatsonis et al.,2011). Various automatic lung cancer detection models based on neural networks have been developed (Donahue et al., 2015; Gao et al.,2019; Hua et al., 2015; Liao et al.,2019; Santeramo et al., 2018; Xu et al.,2019). For classification tasks in diagnosis problems, the problem of data imbalance occurs because it is harder to obtain data with positive diagnosis, which is a minority class. Hence, a deep learning pipeline with approaches to deal with class imbalance in the National Lung Screening Trial(NLST) data set is proposed. In this thesis, models proposed by Liao et al. (2019) and Gao et al. (2019) are used as the classification pipeline. The pipeline is applied on different subsets of NLST data to investigate to what degree of class imbalance would damage the classification result of the pipeline. Approaches of oversampling minority class and weighting loss in each class were applied to overcome the influence of imbalance and make more data usable.