Predicting COVID-19 disease progression and patient outcomes based on temporal deep learning

Background The coronavirus disease 2019 (COVID-19) pandemic has caused health concerns worldwide since December 2019. From the beginning of infection, patients will progress through different symptom stages, such as fever, dyspnea or even death. Identifying disease progression and predicting patient outcome at an early stage helps target treatment and resource allocation. However, there is no clear COVID-19 stage definition, and few studies have addressed characterizing COVID-19 progression, making the need for this study evident. Methods We proposed a temporal deep learning method, based on a time-aware long short-term memory (T-LSTM) neural network and used an online open dataset, including blood samples of 485 patients from Wuhan, China, to train the model. Our method can grasp the dynamic relations in irregularly sampled time series, which is ignored by existing works. Specifically, our method predicted the outcome of COVID-19 patients by considering both the biomarkers and the irregular time intervals. Then, we used the patient representations, extracted from T-LSTM units, to subtype the patient stages and describe the disease progression of COVID-19. Results Using our method, the accuracy of the outcome of prediction results was more than 90% at 12 days and 98, 95 and 93% at 3, 6, and 9 days, respectively. Most importantly, we found 4 stages of COVID-19 progression with different patient statuses and mortality risks. We ranked 40 biomarkers related to disease and gave the reference values of them for each stage. Top 5 is Lymph, LDH, hs-CRP, Indirect Bilirubin, Creatinine. Besides, we have found 3 complications - myocardial injury, liver function injury and renal function injury. Predicting which of the 4 stages the patient is currently in can help doctors better assess and cure the patient. Conclusions To combat the COVID-19 epidemic, this paper aims to help clinicians better assess and treat infected patients, provide relevant researchers with potential disease progression patterns, and enable more effective use of medical resources. Our method predicted patient outcomes with high accuracy and identified a four-stage disease progression. We hope that the obtained results and patterns will aid in fighting the disease.

worldwide, including more than 400,000 deaths (as of 15 June 2020) [2]. Even though the disease has been controlled in certain countries, the WHO director warns the pandemic is still 'Speeding Up' [3]. Because of its sudden onset, many hospitals are still facing medical resource shortages. For example, news in [4] reported a lack of medical resources in New Delhi. In [5], Arizona has experienced record-high hospital capacity as coronavirus cases climb. A reasonable allocation of resources according to patient condition is needed.
The solution to this problem involves determining the stages of disease progression by subtyping and predicting the outcome of COVID-19 patients. Then, targeted treatment and medical resource allocation can be carried out for patients in different stages. Recent studies [6][7][8][9][10][11] have used statistical methods to analyze COVID-19 progress by inpatient symptoms. However, different statistical results were obtained by considering different patient groups and different symptoms. At present, there is no clear division of the stages of COVID-19 progression.
Longitudinal disease analysis is the key to understanding disease progression, designing prognoses and developing early diagnostic tools. The time dynamics of disease can provide more information than static symptom observation [12]. Considering the complex patient states, the amount of interventions and the realtime requirement, the data-driven machine learning approaches by learning from electronic health records are the desiderata to help clinicians [13].
Many existing works have used machine learning methods for COVID-19 prediction tasks. We have summarized them in Table 1. For example, in most method of [27] and in [1,[14][15][16][17][18][19], authors used non-deep learning methods, such as k-NN, LR, Cox, SVM and DT to classify CT/X-ray images and predict the outcomes of COVID-19 patients. However, in terms of prediction accuracy, nondeep learning is not as good as deep learning methods. Deep learning methods can train the parameters with complex nonlinearity to learn the data structures and have achieved state-of-the-art in many medical prediction tasks [28][29][30]. Thus, many current works apply deep learning methods for COVID-19 prediction tasks [17,[19][20][21][22][23][24][25][26]. However, these methods either use the simple multi-layer perceptron for predicting or use the convolutional structures for image classification. Both the above methods ignored the temporal development of patient's status. In the real-world patient records, except for the basic information, vital signs, test values and diagnoses are both time series, especially for the blood samples of COVID-19 patients, the data we used in this paper.
Recently, a deep learning method, recurrent neural network (RNN) [31] can efficiently model temporal sequences. It uses recursion in the direction of sequence evolution to learning the relations among past, present and future. But the basic RNN has the long-term dependency problems [32]. Meanwhile, RNN only process uniformly distributed longitudinal data while COVID-19 patient blood samples are distributed nonuniformly with irregular time intervals between observations. Thus, a method that can model this irregular time series of COVID-19 patients is needed.
In this paper, we retrospectively analyzed the blood samples of 485 patients from the region of Wuhan, China. The medical records collected with standard case report forms, including epidemiological, demographic, clinical, laboratory and mortality outcome information, from an online open dataset under an MIT license. We applied a temporal deep learning method Time-aware Long Short-term Unit (T-LSTM) to model the irregular time series of COVID-19 patients. T-LSTM can predict the mortality with more than 98% accuracy before 3 days. Meanwhile, we have discovered four stages of COVID-19 patients. According to the different stages, we gave the analysis of the patient's state and found the related biomarkers and complications.

Methods
In this section, we first introduce the COVID-19 dataset and the data preprocessing process. Then, we describe the methods for mortality prediction and disease progression in detail.

Table 1 The conclusion of machine learning methods used in COVID-19 prediction tasks
We use the abbreviations of methods and the full names are listed in Table 1 the time series of LHD, lymph and hs-CRP of a 70-yearold female patient during hospitalization. We can see  the time intervals between two observations are irregular, which could be a few minutes or even days.  The detailed statistical information of demographic  and 74 clinical laboratory test features is listed Table 2.
For example, in the dataset, the average age of patients is 58.83, the survival rate is 53.6% and the ratio of male to female is about 1.5:1. We also list the range and mean value of each feature. In Fig. 1, we display the distributions of some features (age, gender, LHD, lymph and hs-CRP) of survival class (0) and death class (1). This COVID-19 blood test data is publicly available at https ://githu b.com/HAIRL AB/Pre_Surv_COVID _19.

Dataset preprocessing
First, we attempted to find a suitable time measurement granularity. In the raw dataset, the lengths of sequences are unequal and different sampling times result in missing data, with an 85% missing rate on average. The missing rate is expressed in Eq. 1. N missing means the number of time points with missing data in one time series. N all means the number of time points in that time series. The presence of vacancies has a large impact on data quality, resulting in unstable predictions and other unpredictable effects [34]. We used 3 days as the basic sampling interval, reducing the average mr below 30%. The time series length of raw data, the average missing rate and the missing rate for each feature are shown in Fig. 1.
(1) mr = N missing N all evolution, and all units are chained together. In basic RNN (the second structure in Fig. 2), the current state h t is affected by the previous state h t − 1 and the current input x t and is described as where σ is an activation function, and W, U and b are learnable parameters. Long Short-Term Memory (LSTM) [32] (the third structure in Fig. 2) is a variant of RNN that is adept at solving long-term dependency problems. A standard LSTM unit consists of a forget gate f t , an input gate i t , memory cells C t , ∼ C t and an output gate o t . However, RNNs only process uniformly distributed longitudinal data by assuming that the sequences have an equal distribution of time differences. COVID-19 patient blood samples are distributed nonuniformly. For example, the time gap between two sequential records could be hours or days. Time-aware Long Short-Term Memory (T-LSTM) [35] (the fourth structure in Fig. 2) incorporates the elapsed time information into LSTM. It applies a memory discount to capture the irregular temporal dynamics. T-LSTM can be formulated as: In Eq. 2, based on the basic LSTM, T-LSTM possesses some new designs. C S t−1 component learns the short-term memory of sequence by learnable network parameters. C T t−1 is the long-term memory calculated from the former memory cell C t − 1 with getting rid of C S t−1 . C S t−1 is adjusted to the discounted short-term memory Ĉ S t−1 by the elapsed time function g(Δ t ). The previous memory We use a log calculation for the elapsed time function. Δ t describes the time gap between two records at two sequential time steps t and t − 1. T t is the actual time at time step t. (2) Adjusted previous memory Current hidden state Meanwhile, for feature selection, using all 74 laboratory test features is unrealistic. To address the high missing rate, repeated features and collection difficulties, we considered three key features: lactic dehydrogenase (LDH), lymphocytes (lymph) and high-sensitivity C-reactive protein (hs-CRP). These features contain specific research biomarkers of COVID-19 patients [33] and can be easily collected in any hospital. Considering that only three features may not achieve high prediction accuracy, we also select 40 features (listed in Table 7) with missing rate less than 30% for comparative experiment.

T-LSTM
Recurrent neural networks (RNNs) [31] (the first structure in Fig. 2) are deep network architectures designed to model temporal sequences. They take sequence data as input, recursion occurs in the direction of sequence

Analysis strategy
We first describe the two tasks in this study and then introduce the specific methods. The whole method process is shown in Fig. 3.

Task 1 (Outcome prediction) A set of labeled patient data is represented as
. . , t onset represents a patient's records over t time steps; specifically, x t i is multivariate data, and each dimension is a clinical record represented by a numeric vector. C ∈ {0, 1} is the outcome, where class 0 means death and class 1 means survival. The outcome prediction task aims to predict patient outcomes by the prediction function f : X → C Task 2 (Temporal patient subtyping / Disease progression mining) The goal is to find patient groups G = {g i | i = 0, …, m} with similar feature representation R = r t i |i = 0, . . . , n; t = 0, . . . , t onset . r t i is the In COVID-19 patient outcome prediction task, T-LSTM is used to handle patient record sequences and then make the prediction. The process is displayed in the proposed method of Fig. 2, in the lower gray area.
For a patient i, the input of T-LSTM at time step t is a three-dimensional feature vector The output is the state representation s i at the last time step. We apply this outcome prediction task as a binary classification task, with two classes: death and survival.
The cross-entropy [36] is mainly used to measure the difference between two probability distributions. We expect our predicted distribution of patient outcomes to be closer to the true distribution. Thus, we use the cross-entropy loss function in Eq. 4. Besides, when using sigmoid active function, this loss can avoid the reduced learning rate causing by traditional mean square error loss when gradient decreases.
p(x) is the prior probability (true label vector) and q(x) is the prediction probability (predicted results vector). Correspondingly, Ĉ is the real class of input data, and C represents the prediction class.
In COVID-19 patient disease progression task, temporal patient subtyping can uncover the dynamic characteristics of diseases by significantly nuanced subtyping, which leads to the potential stages of disease progression. We addressed the issue by building a time stage reference and providing a low-dimensional representation of each subject, encoding his or her position with respect to this reference.
The method structure is displayed in the upper gray area of proposed method in Fig. 2. It has 4 steps: 1) Acquisition of patient representation r t . We used the hidden state h t , extracted from every T-LSTM unit, as the patient's representation r t at time step t. 2) Dimension reduction of r t . For better demonstration, we used the t-distributed Stochastic Neighbor Embedding (t-SNE) [37] method to reduce these high-dimensional vectors r t Obtaining the patient group set G. As prior information about the patient groups was not available, we acquired patient groups by applying an unsupervised clustering method, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [38], on r t . 4) Analysis of G and stages of disease progression. The mortality rate MR, and the average time distance TD were calculated as the analysis criteria. Equation 5 expresses the mortality rate. N death is the number of patients with the death outcome and N patient is the total number of patients. Eq. 6 expresses the average time distance. T t means the current prediction time and T t onset means the time of outcome. |g k | is the number of patients in group g k .

Evaluation metrics
The prediction results were evaluated by assessing the area under the curve of the Receiver Operating Characteristic (AUC-ROC). The ROC is a curve of the True Positive Rate (TPR) and the False Positive Rate (FPR). TN, TP, FP and FN represent true positives, true negatives, false positives and false negatives, respectively.
The patient groups obtained by unsupervised clustering were evaluated by the Calinski-Harabaz Index (CH), which measures the covariance of data within a class and between classes. A larger CH value indicates a better clustering performance. In Eq. 9, m is the number of data and k is the number of groups. B k and W k respectively represent the covariance matrices between groups and within groups. When we get the stages of COVID-19 patients, we used Kullback-Leibler Divergence (KL divergence) to analyze patient characteristics through each laboratory test feature. KL divergence can measure the degree of difference between two probability distributions. For each feature, we first establish the Gaussian distribution N µ, σ 2 with expected value μ and variance σ 2 at each stage. Then, we calculate the average KL divergence of the distribution of adjacent stages. If the average KL divergence of a feature is large, it more likely is a biomarker to distinguish different stages. The basic KL divergence of distribution p(X) and q(X) and the KL divergence of two univariate Gaussian distributions are in Eq. 10 and 11.
For measure and evaluate each feature, we use the average KL divergence (Average KL) between neighbor stages g i , g i + 1 . m is the number of groups.

Results
We used the records of 375 patients as a training set; the ratio of the training set to the verification set was 0.8:0.2. The records of 110 patients made up the test set. This experiment was conducted on 5-fold cross-validation. The code implementation is publicly available at https :// githu b.com/scxhh h/COVID -19.

Baselines
We use the related works summarized in Table 1 as comparison methods. Related works are divided into nondeep learning methods and deep learning methods. We Table 3 AUC-ROC of COVID-19 mortality prediction results by using baselines a n days early: The models make prediction n days before the final death/survival time. They use sequence data from day 0 to n days before the last time to predict b Cox: Cox's proportional hazards regression model is semi parametric regression model. It can analyze the influence of many factors on outcomes. It is used in [19] c k-NN: k-Nearest Neighbors method makes prediction based on the information of nearest k samples in training set. In this mortality prediction task, the most accurate results appeared when k = 3 d SVM: Support Vector Machines classify by solving the separation hyperplane which can divide the training data correctly and has the largest geometric interval e DT: Decision tree is a simple classifier consisting of sequences of hierarchically organized binary decisions. It is used in [33] f BPNN: Back Propagation Neuron Network makes the signal and the error propagate forward and backward separately. It is used in [20] g PNN: Probabilistic Neural Network is a forward propagation network and does not need back propagation to optimize parameters by using Bayesian decisionmaking. It is used in [21] h  use Cox [19], k-NN [16], SVM [17], DT [1], BPNN [20], PNN [21], RNN, LSTM and T-LSTM for COVID-19 mortality prediction. T-LSTM is our method. Table 3 shows the results of COVID-19 mortality prediction using baselines. The AUC-ROC is evaluated at 0, 3, 6, 9, 12, 15, and 18 days early. Here, the results are obtained when the patient's representations are 64 dimensional. The results indicate that our method T-LSTM performed better than all of baselines no matter how early before the onset times of patients. More precisely, using T-LSTM, the outcome prediction accuracy is above 90% at 12 days early and is approximately 97% accurate when predicting 3 days before the disease outcome. More detailed results of train, validation and test sets using T-LSTM are listed in Table 4.

Outcome prediction results
The first four figures in Fig. 3 are the visualizes of prediction results. The first two figures are the AUC-ROC of prediction results of baselines and T-LSTM in different earliness. The third figure is the changes of prediction accuracy and cross-entropy loss when training the model. The fourth figure represents the relation of patient representation dimension and AUC-ROC of prediction using T-LSTM. Too few dimensions lead to incomplete feature learning, while too many dimensions lead to redundant calculations and easy over-fitting. Considering result accuracy, computational complexity and ease of representation use in the following task, we decided to use 64 dimensional vectors to represent patients.
Based on prediction results, we found: 1) Deep learning approaches (T-LSTM, RNN, PNN and BPNN) has higher COVID-19 outcome prediction accuracy than non-deep learning approaches (Cox, k-NN, SVM and DT) as they have completed the highly nonlinear feature transformation by neural junction structures. 2) RNN-based models (T-LSTM and RNN) performance better on time series data as they contain state connections for reproducing time delays and output feedback connections for forming a loop. 3) Time-aware model (T-LSTM) has the best performance as it can model the time series with irregular time intervals, which is a prominent feature of COVID-19 blood sample dataset.
Further, we also select 40 features (listed in Table 7) as the input of T-LSTM for comparative experiment. The results in Table 5 indicate that learning a large number of patient characteristics does not necessarily contribute to COVID-19 patient mortality prediction task. The three biomarkers, LDH, lymph and hs-CRP can make the results better. The AUC-ROC of using 3 features is 3% higher than using 40 features on average. This is because too many features will introduce redundant and irrelevant dependencies leading by redundant features.

Disease progression results
By implementing the four steps of disease progression mining, we obtained the 4 stages in both the death class (critical) and the survival class (general), shown in Fig. 4.
For better visualization, we reduced the dimension of the patient's representation vector, the fifth figure in Fig. 3 is the dimension reduction effect. We chose 2 dimensions due to low representation loss and clear observation. Besides, the DBSCAN clustering effect evaluated by the CH index is shown in the sixth and seventh figures in Fig. 3. Different clustering effects can be obtained by changing the cluster radius parameter ε. The best CH index values for the death class and the survival class are 680.07 and 44.24, respectively.

Table 4 AUC-ROC of COVID-19 mortality prediction results by using T-LSTM on different sets at different timestamps
a n days early: The model makes a prediction n days before the final death/ survival time. It uses sequence data from day 0 to n days before the last time to predict b We use the records of 375 patients as the training set; the ratio of training set to verification set is 0.8:0.2. The records of 110 patients make up the test set. This experiment is conducted on 5-fold cross-validation  Table 5  In this case, both classes have four groups. Four stages of COVID-19 patients are shown in Fig. 4. For each stage, we calculate the mortality rate MR and the average time distance TD. For the death class, MR increases over stages and is 100% at stage 4. For the survival class, MR decreases over stages and is 0% in stage 4. TD in both classes decreases, meaning that the 4 stages are distributed over time. Meanwhile, as the CH index of the survival class is higher than that of the death class, the survival class stages are relatively loosely distributed.
In Fig. 4, the first clustering is obtained by using biomarkers directly and shows that reasonable stages could not be found. In the first clustering, no stage is clustered in the death class and the 2 stages in the survival class have similar mortality rates and no time difference, as the shade of blue indicates. However, using our method, different stages have obvious differences, such as the data point color deepening with the stages. Meanwhile, as shown in the two insets, the class boundary is clearer based on our method.
The division of stages contains the potential characteristics of COVID-19. Here, we present three findings. First, at the time of initial diagnosis, the COVID-19 infected patients' physical conditions are similar, regardless of final survival or death. In Fig. 4, the distance between stage 1 for the death class and the survival class is small, and the two even overlap. This indicates that outcome prediction is likely not accurate at the time of infection. Second, the physical condition of patients who eventually die changes less than that of those who eventually survive. We conclude this from CH index values, where the CH value of the survival class is larger than that for the death class. Third, mortality rate varies by stage. For example, if the patient is classified into the death class and is at stage 1, there is still hope of survival, as shown by the green triangle sample in Fig. 4. However, if the patient is in stage 3 or 4, he or she is very likely to die. Based on estimating the current stage of a patient, doctors will be given a reference, which can help them assess a patient's current situation. Based on that, doctors can carry out targeted treatment and reasonable resource allocation more easily. Thus, the ultimate goal of this study, helping improve the quality of medical care, can be achieved.
Meanwhile, we calculated the mean values of 40 laboratory test features in each stage, the feature values vary with stages. Table 6  Further, we calculated the average KL divergence between adjoint stages of each features in 40 clinical laboratory tests data. We ranked the average KL values. The higher the ranking, the better the biomarkers can be used to distinguish different stages. By ranking 40 biomarkers according to the degree of correlation with COVID-19 (Table 7), we have found the biomarkers which are more relevant to COVID-19. The top 10 are: Lymph, LDH, hs-CRP, Indirect Bilirubin, Creatinine, INR, Serum Sodium, eGFR, Serum Chlorine and Albumin. For each marker, we gave its reference value in each stage, shown in Table 6. Different markers have unique trends in different stages.
Combining the correlation analysis with the reference value analysis, we found that the critical COVID-19 patients are usually accompanied by low values of lymph, eGFR, albumin and Serum Sodium, high values of LDH, hs-CRP, indirect bilirubin, creatinine and INR. For example, in the critical stage 4, the average lymph (%) is just 4 and the average LDH (U/l) is up to 499. Besides, there are three major complications of COVID-19 patientsmyocardial injury, liver function injury and renal function injury. We got the conclusions separately through the value of 1) LDH, 2) albumin and indirect bilirubin, 3) serum sodium, serum chlorine and creatinine in different stages.

Discussion
In recent years, deep learning (DL) technology has been widely used because of its superior performance in various medical applications [28,29], such as medical image recognition [39] and medication recommendations [40]. In the past year, the spread of COVID-19 has had a peripheral effect on the global economy and health.
Therefore, we expect to combine DL methods to study and fight COVID-19. The states of COVID-19 patients in hospital are dynamic time sequence processes. In addition to the basic information of patients, the vital signs, diagnoses and other lab tests are all time series. Existing many works [14-27, 41, 42] have achieved good results for COVID-19 prediction tasks. But they paid little attention to analyze and model the characteristics of COVID-19 patients' time series. Dynamic time series modeling can grasp the relationship between historical observations and current observations, and learn the potential development mode of sequence, which is conducive to more accurate prediction and representation. Besides, we have found that the time series of COVID-19 patients is irregularly sampled -Different time intervals exist in adjacent observations. Every possible test is not regularly measured during an admission. When a certain symptom worsens, corresponding variables are examined more frequently; when the symptom disappears, the corresponding variables are no longer examined. These time intervals will add a time sparsity factor when the intervals between observations are large [13]. Therefore, it is necessary not only to deal with time series, but also to deal with irregular time series according to the characteristics of COVID-19 patients. In this paper, we use time-aware LSTM model solved this problem.
Deep learning methods have outstanding performance in prediction tasks. If a doctor predicts survival or death only by observing the biomarkers and using a threshold, the accuracy is at or below 80% for early predictions. However, the clinical reference value of inaccurate results is very low [43,44]. The DL method has better performance, and the time-aware aspect enables higher accuracy, as shown in Table 3.
However, there are some concerns about the use of DL methods in the high-risk tasks of healthcare.
First, it may be risky to apply predictive methods directly to clinical practice [45]. DL methods may be assistive tools for doctors but not used to make decisions directly. It is challenging for doctors to make optimal decisions, a data-driven and high-accuracy prediction method could help. In this paper, we can predict patient outcomes with higher accuracy than baselines. The method can effectively predict whether the infected patient will die or survive 12 days prior to disease outcome with over 90% accuracy. The prediction accuracies at 3-, 6-, and 9-days prior are 98, 95 and 93%, respectively.
Second, the DL method is the black-box models which are troubled by poor interpretability [46,47], but clinical settings prefer interpretable models. For example, finding the appropriate prediction-related biomarkers is important. Currently, certain studies have identified suitable predictive biomarkers, such as the 3 biomarkers in [33], which are regarded to have a significant impact on patient mortality. For interpretability, our method identified four disease stages distributed over time. This interesting finding cannot be distinguished simply by the value of biomarkers, as shown as the comparison of two clustering results in Fig. 4. The discovered stages are closely related to mortality and time of illness and can help analyze the status of infected patients. This shows that the DL method can explore new patterns in multidimensional space that cannot be demonstrated by a simple variable value [48]. We also ranked 40 biomarkers according to the degree of correlation with COVID-19 progression, which can provide interpretable results to help doctors better understand the model. This study has three basic contributions. 1) we can predict patient outcomes with higher accuracy than all baselines. 2) We identified four stages of COVID-19 progression. The stages are closely related to mortality and time of illness and can help analyze the status of infected patients. 3) We give the ranking of 40 biomarkers according to the degree of correlation with COVID-19. Based on this, we found three major complications of COVID-19 patients -myocardial injury, liver function injury and renal function injury. Further, there is room for further improvement. First, because of the data limitations, our method may face risk of bias, because data-driven methods are easily influenced by different source of data. For example, the results may vary when using different datasets [45]. Second, our current interpretation is based on results, such as the degree of association between biomarkers and disease. We hope to give more explanations about the complex DL blackbox model, such as telling more specific effect of each part of the model on the result. Meanwhile, we hope to enlighten the relevant researchers to further study these 4 stages and present more clinical explanations. In particular, we expect to be able to give specific treatments for different stages. Targeted treatment is significant for both patient rehabilitation and the reasonable allocation of medical resources.

Conclusions
The sudden outbreak and epidemic of COVID-19 has led to worldwide suffering and shortages of medical resources. In this paper, we propose T-LSTM to predict patient outcomes with high accuracy -98, 95 and 93% at 3, 6, and 9 days, which will enable reasonable allocation of medical resources. T-LSTM can effectively model the irregular sampled time series in blood test samples of COVID-19 patients and predict more accurately than existing baselines. Meanwhile, we identified four COVID-19 stages. We ranked 40 biomarkers according to correlations to the outcomes of patients, gave the reference values of top 10 biomarkers for each stage. The top 10 biomarkers are: Lymph, LDH, hs-CRP, Indirect Bilirubin, Creatinine, INR, Serum Sodium, eGFR, Serum Chlorine and Albumin. We also found 3 complications of COVID-19, which are myocardial injury, liver function injury and renal function injury. By analyzing patients' life conditions at different stages, doctors can choose specific, targeted treatments. Future work will focus more on the study of pathological characteristics of different stages. Aiming at four stages, targeted treatments are expected to be designed. Meanwhile, more real clinical data are expected to be available for model validation and the model will be used to mine the inherent hidden features of other diseases.