Clinical data
We collected the PET-CT reports of patients who were diagnosed with lung cancer between January 1st, 2007, and March 31st, 2020 at Asan Medical Center, a tertiary referral hospital in Seoul, South Korea (Cohort A). The records collected from patients with lung cancer were coded by the International Classification of Disease, 10th revision. The PET-CT reports consisted of the following data: patient ID, exam code, exam date, clinical diagnosis, the reason for an imaging study, examination methods, description of image findings, and conclusion of image interpretation. The conclusion section of the report, written in English, would contain the locations of the primary cancer site, the metastatic lymph nodes, and other metastatic lesions. Additional file 1: Figure S1 shows an example of the conclusion section of a PET-CT report from a patient with lung cancer that was used as the input data in this study. To evaluate the performance of the generated model in the additional-test set, we used PET-CT reports of patients from a different cohort at Asan Medical Center who were treated between January 1st, 2004, and March 31st, 2020 (Cohort B). Although the additional-test set was not collected from different hospital records, we intended to show that our model can work on independent annotated datasets without any overlap in patients. The purpose of our model was to convert any lung cancer PET-CT reports into a structured form so that clinicians could access the metastasis-labelled radiologic reports.
Report annotation
To determine the metastatic stage of lung cancer according to the TNM stage [17], we assessed the primary cancer location, nodal stage of lung cancer, and metastatic sites as the outcome categories. The location of lung cancer was labelled in the class of the lobe; however, if the primary site could not be determined by each lobe due to the huge size, the location was labelled as left or right. In the case of synchronous metastasis and ipsilateral/contralateral metastasis, the annotator follows the initial opinion of radiologists who reported the PET-CT reports. Two clinicians independently annotated the primary cancer location and the metastatic lymph nodes and organs in 500 PET-CT reports and their consistency was calculated by Cohen’s kappa coefficient (Additional file 1: Table S1). Another clinician independently annotated the primary cancer location and metastatic organs in 4190 PET-CT reports that were used as the additional-test set. The additional-test dataset was not used in the pseudo-labelling process nor in any pre-processing.
Ethics approval
The ethics committee of Asan Medical Center approved this study, conducted following the declaration of Helsinki. Also, the ethics committee of Asan Medical Center (approval number 2020–0212) waived the informed consent due to the retrospective observational nature of the study. The clinical data extracted using the ABLE system at Asan Medical Center were indexed by de-identified encrypted patient ID numbers so that the individual patients could not be identified [18, 19].
Pre-processing of typographical errors and keyword extraction
In order to train a deep-learning model that is robust against typographical errors, we developed a spelling correction tool trained on lung cancer-related journals (Additional file 1: Methods). All the radiologic reports had been corrected using this spelling correction tool. As each sentence had an independent meaning in our PET-CT reports, each radiologic report was split into a group of sentences (Additional file 1: Figure S1). Keywords were extracted from each sentence using Named Entity Recognition (NER) [20], which eliminates words that had less impact on extracting the metastatic information. (Additional file 1: Figure S2) Eventually, the whole pre-processing stage provides a refined version of the input data that have been transformed into a set of sentences containing keywords; in turn, the pre-processed inputs are used to train the deep-learning models. The detailed methods for pre-processing were described in Additional file 1: Methods.
Structure of the model
Using the NER tags, we extracted keywords that might represent the primary sites from each PET-CT report. Each keyword consisted of 100-dimensional vectors. In this study, we implemented the Convolutional-Recurrent Neural Network [21] consisting of a single convolutional layer and two LSTM layers (Fig. 1). Convolutional operation and max-pooling extracts key features within the FastText embedding, while LSTM operation focuses on sequential information among the word sequence. This method could improve the representation of words that reflect their context as well as the semantics.
The classification of primary sites is a multi-class classification task, while lymph node staging is a multi-label classification task. The primary cancer sites are listed as the right upper lobe, right middle lobe, right lower lobe, left upper lobe and left lower lobe—a multi-class classification. The lymph node stage is determined by the most distant metastatic lymph nodes from the primary cancer location (TNM staging). Therefore, the model should find all the metastatic lymph nodes, identify the anatomical site and determine whether it is ipsilateral or contralateral with respect to the primary cancer site. In annotating these metastatic lymph nodes, there are some problems. First, too many labels should be annotated for one report, which could lead to the omission of some label annotation by clinicians. The second is the long-tailed distribution of metastatic sites, such that only a small number of uncommon metastatic sites are extracted despite the laborious process of annotation. To overcome this hurdle, target sites for lymph nodes and metastatic organs were selected if their prevalence was higher than 3%. We also used a semi-supervised learning technique called pseudo-labelling, which first trains the model using the small number of labelled data and then assigns pseudo-labels that shows the highest probability to the unlabelled data using that model. Although this method is a relatively simple approach, it showed high performance compared with other semi-supervised learning methods [16]. Using this approach, we assigned pseudo-labels to every unlabelled data; however, unlike in the original paper on pseudo-labelling [16], each pseudo-label was assigned considering the appearance of specific words, not based on probabilistic values. For instance, sentences in which ‘hilar’ and ‘metastasis’ appear are most often related to metastasis in the hilar area, so its pseudo-label would be ‘metastatic hilar lymph nodes.’ As the label value was closely related to the extracted information, keywords within each sentence were used to return pseudo-labels for multi-label classification.
The nodal (N) staging classification model has 13 outputs corresponding to the number of categories belonging to the N stage, and the metastatic sites (M) stage classification model has seven final nodes corresponding to each category. As lymph node staging is determined by the most distant lymph nodes, and by whether the lymph node is ipsilateral or contralateral to primary sites, further processing was necessary in order to determine the location of the metastatic lymph nodes and the primary cancer site. Using the keywords that were used during the pseudo-labelling process, we checked the n-grams surrounding each keyword with the purpose of considering the closest positional word (ipsilateral or contralateral). Next, we analysed the word segments containing the keywords as well as the location information, with the primary cancer site in order to determine the side of the lymph node. (Additional file 1: Figure S3) In nodal and distant metastasis staging, the extraction model uses all words, not just keywords, which would help the model learn other expressions that are not included in the keywords. Accordingly, we noticed that the model was appropriately trained as the sentences containing words such as ‘T4’ and ‘T5’—abbreviation of ‘4th and 5th thoracic vertebrae’—tend not to contain words related to the bone. Therefore, all words were used as input in our proposed model.
Statistical analysis
The prevalence of the outcome was described in numbers and percentages. The inter-rater agreement was calculated by Cohen’s kappa coefficient and the overall accuracy of our proposed model with each pre-processing was described with the A/B test. The performance of our proposed model was evaluated with the following metrics: precision, sensitivity (recall), specificity, F1- score, area under the receiver operating curve (AUROC), and area under the precision-recall curve (AUPRC) with micro-average and macro-average for each outcome in the two validation sets. [22] For each outcome, we evaluated the false-positive and false negative results according to each label. Statistical analysis was performed by the statistics package in Python 3.7.4.