An evaluation of time series summary statistics as features for clinical prediction tasks

Background Clinical prediction tasks such as patient mortality, length of hospital stay, and disease diagnosis are highly important in critical care research. The existing studies for clinical prediction mainly used simple summary statistics to summarize information from physiological time series. However, this lack of statistics leads to a lack of information. In addition, using only maximum and minimum statistics to indicate patient features fails to provide an adequate explanation. Few studies have evaluated which summary statistics best represent physiological time series. Methods In this paper, we summarize 14 statistics describing the characteristics of physiological time series, including the central tendency, dispersion tendency, and distribution shape. Then, we evaluate the use of summary statistics of physiological time series as features for three clinical prediction tasks. To find the combinations of statistics that yield the best performances under different tasks, we use a cross-validation-based genetic algorithm to approximate the optimal statistical combination. Results By experiments using the EHRs of 6,927 patients, we obtained prediction results based on both single statistics and commonly used combinations of statistics under three clinical prediction tasks. Based on the results of an embedded cross-validation genetic algorithm, we obtained 25 optimal sets of statistical combinations and then tested their prediction results. By comparing the performances of prediction with single statistics and commonly used combinations of statistics with quantitative analyses of the optimal statistical combinations, we found that some statistics play central roles in patient representation and different prediction tasks have certain commonalities. Conclusion Through an in-depth analysis of the results, we found many practical reference points that can provide guidance for subsequent related research. Statistics that indicate dispersion tendency, such as min, max, and range, are more suitable for length of stay prediction tasks, and they also provide information for short-term mortality prediction. Mean and quantiles that reflect the central tendency of physiological time series are more suitable for mortality and disease prediction. Skewness and kurtosis perform poorly when used separately for prediction but can be used as supplementary statistics to improve the overall prediction effect.


Background
Clinical prediction tasks such as patient mortality and disease prediction are highly important for early disease prevention and timely intervention [1,2]. Patient mortality prediction in intensive care units (ICUs) is a key application for large-scale health data and plays an important role in selecting interventions, planning care, and allocating resources. Accurate assessment of mortality risk and early identification of high-risk populations with poor prognoses followed by timely intervention are key in improving patient outcomes. A preliminary disease diagnosis assists doctors in making decisions. With the goal of accurately predicting clinical outcomes, studies have proposed methods that include scoring systems and machine learning models [3,4]. The scoring systems for mortality prediction in widely clinical use include the Sepsis-related Organ Failure Assessment (SOFA) [3], the New Simplified Acute Physiology Score (SAPSII) [5], and the Multiple Organ Dysfunction Syndrome (MODS) [6]. However, most scoring systems based on simple logistic regression for patient mortality prediction have limited prediction performance. With the development of machine learning and deep learning models, studies have applied trained models to clinical prediction tasks and achieve better performance compared to earlier approaches [4,7].
Feature extraction and patient representation are the underlying premise for constructing prediction models; consequently, these factors are important and affect the prediction performance. An increasing number of monitoring devices and laboratory tests in modern ICUs collect multivariate time series data of varying lengths from patients. Variable-length multivariate time series means that more than one physical measurement will be collected from a patient after admission to the ICU and that the sampling frequency of each predictor differs within a given time window. Overall, patient data consisting of physiological measurements have typical characteristics, such as high resolution, varying lengths, noisy values, and system bias, making the extraction of the temporal features of time series challenging. Most of the existing models select specific summary values for each predictor over a given time period and concatenate them to form patient vectors. Statistics are a form of summary values, and studies have shown that summary statistics can reflect the characteristics of time series. Moreover, they have advantages such as simple extraction, high robustness and strong representativeness [8][9][10]. The features of time series can be divided into three aspects: central tendency, dispersion tendency and distribution shape. The distribution and trends of time series can be reflected by combining multiple summary statistics, thus approximating the original data distribution and reducing the impact of noise on the prediction results.
Existing studies based on machine learning models have mainly used simple summary statistics to summarize time series information, such as maximum and minimum observations, as of physiological time series features. However, this lack of more comprehensive summary statistics leads to a lack of information in physiological time series. In addition, using only the maximum and minimum statistics to indicate patient features fails to provide adequate explanations. Despite the likelihood that more comprehensive features would have clinical implications, few existing studies have experimentally evaluated which summary statistics can best represent physiological time series. In this paper, we report an exhaustive set of results based on different combinations of summary statistics used as features of physiological time series for three clinical prediction tasks. The contributions of this study are twofold: on the one hand, we summarize and use 14 statistics as options for physiological time series representation compared with previous studies that used only a few statistics. On the other hand, we experimentally evaluate the performance of different summary statistics as features of physiological time series for different prediction tasks and obtain many conclusions that have practical implications and can provide guidance for subsequent related research.
The remainder of this paper is arranged as follows. First, we outline the related works. Second, we describe our method and its details and then present the experiments and results. Next, we discuss the results of the previous section. Finally, conclusions and future prospects are provided in the last section.

Methods for representing physiological time series
The most common method for representing physiological time series is to summarize the changing features of data contained in predictors using summary features and concatenate them as representative of a patient. Such statistics are simple and easy to calculate and have wide applications. Some studies also adopt the first measurement of predictors as the characteristic value of time series. The statistics used in some of the existing studies are listed in Table 1. From Table 1; these include maximum, minimum and mean values, which are widely used. One reason for their wide use is that these statistics are easy to acquire. Another is that experts tend to believe that the maximum and minimum observations reflect the normality or abnormality of the patient index, while the mean value reflects the average fluctuation range of the index over a period of time. A few studies have attempted to characterize time series features using statistics such as standard deviation, median and skewness. In addition to the above studies, many studies have attempted to fully understand the temporal trends hidden in multivariate time series data. Hug et al. considered a comprehensive set of physiologic measurements and manually defined a set of trend patterns [26]. McMillan et al. used temporal pattern mining to discover time series feature patterns [27]. Cohen et al. identified clinically relevant patient physiological states from physiologic measurements based on hierarchical clustering [28]. Yuan et al. applied nonnegative matrix factorization to group trends in a way that approximates patient pathophysiologic states [29]. Compared with these methods, patient representation based on summary statistics is a simple concept that is easy to calculate and can improve the interpretability of the results. However, the above studies based on summary statistics do not provide a clear reason why only these statistics were selected. It can be surmised that these choice were subjective and lack theoretical and experimental support. In addition, relevant research to determine which summary statistics can achieve the best performances for physiological time series is lacking. Therefore, the goals of this paper are to discover statistics that yield important summary performances and thus provide support for these studies and to improve model prediction performance based on representations of these summary statistics.

Feature selection methods
Datasets containing massive amounts of features can reduce classification accuracy, raise the computational cost and increase the risk of overfitting [30,31]. Varying length multivariate time series can be characterized by multiple summary statistics; however, some statistics may contain useless or redundant information, and some features may be coupled. If representative features are not selected, algorithm resources will be consumed, but accurate classification results will not be obtained. Thus, it is beneficial to use feature selection mechanisms not only to identify the most representative features but also to reduce the number of features. To select a suitable combination of important summary statistics, feature selection is critical [32]. Previous works used three feature selection categories: filter methods, wrapper methods and embedded methods. Genetic algorithms are classically used for feature selection and have wide applicability because they can overcome the shortcomings of exhaustive methods that have high time complexity. Additionally, the genetic algorithm is a feature selection method of combinatorial optimization that can fully consider the relationships between features and find the most suitable feature combinations. Many previous works have selected features based on genetic algorithms and achieved satisfactory results. Leardi R et al. first proposed that the genetic algorithm can be a valuable tool for solving feature selection problems [33]. Mahdi Mohammadi et al. used a genetic algorithm to identify the most significant features of EEG signals and find their diagnostic value for depression [34]. Dino et al. combined a genetic algorithm with gene expression data to classify gene expression data in two steps [35]. Lei et al. proposed a new electrocardio-graph pattern recognition method by combining a genetic algorithm with a support vector machine [36].

Method
Clinical prediction tasks include mortality, length of hospital stay, and disease prediction. The distribution characteristics of physiological time series are the manifestations of physiological states, including dispersion tendency, central tendency, and distribution shape, and these correspond to multiple statistics. By comparing the effects of different statistical combinations on different prediction tasks, the commonalities and differences of the optimal statistical combinations can be found, which can guide subsequent prediction tasks. The premise for finding the best combination of statistics is global search; however, global search is laborious and difficult in practice. This paper considers a feature selection method based on combinatorial optimization, that is, using the genetic algorithm to find the best combinations of statistics.

Identification of the distribution features of physiological time series
To characterize the time series distribution features of different predictors, it is critical to explore many different aspects of the data distribution. Based on statistical theory and existing research, this paper approximates the original data distribution by analysing the central tendency, the dispersion tendency and the distribution shape of each predictor. The central tendency reflects the representative value of the general level of the data or the central value, including statistics such as the mean, median, mode and quantile. The dispersion tendency of the distribution reflects trends describing how far the data are from the central value, including statistics such as maximum, minimum, standard deviation, coefficient of variation, range and interquartile range. The shape of the distribution reflects whether the distribution is symmetrical, the degree of skewness and the flatness of the distribution, including statistics such as skewness and kurtosis. Figure 1 shows the temperature fluctuation of a patient within 24 hours of admission to the ICU. The minimum and maximum values reflect the range of temperature change of the patients and can reflect the trend of the data from the centre value. The mean value reflects the average temperature of the patients over 24 hours and can reflect the degree to which the data distribution aggregates to its centre value. Furthermore, the mode reflects the temperature value that appears most frequently within the 24 hours. The median reflects the average value, and the quantile reflects values in a specific position. The range and interquartile range reflect the degree of difference among the whole data distribution. The variance and standard deviation reflect the dispersion degree of the temperature distribution and the stability of the temperature data: a larger variance indicates that the patient's temperature fluctuates widely, which may indicate that the disease is more severe. The coefficient of variation also reflects the degree of discreteness of the data. However, the central tendency and the dispersion tendency of the temperature distribution cannot reflect the order of temperature measurements; therefore, the shape of the distribution should be considered. The shape of the distribution can reflect the evolution of the disease. Skewness can reflect the symmetry of the data distribution. Generally, the symmetry of the data distribution can be understood as the stability of the temperature change. Both left and right skewness can reflect changes in temperature. Kurtosis reflects sharpness of the peak and the peak degree of the data distribution and reveals the fluctuation trend and the patients' physiologic state. The summary statistics used in this study included the 13 statistics mentioned above, namely, minimum (min), maximum (min), mean, standard variation (std), median, lower quartile (Q1), upper quartile (Q3), mode, range, interquartile range (IQR), coefficient of variation (CV), skewness (skew) and kurtosis (kurt). Based on previous works, the first measurement (first) is also added.

Selection of best statistical combination based on the genetic algorithm
To explore the impact of different combinations of statistics on prediction performance and find the optimal combination, we formalize the problem. Let V = {V 1 , V 2 , · · · , V P } represent a collection of P multivariate time series. Series V i consists of a multidimensional time series of m variables, and the time series of each variable j has n j observations. For a variable-length time series, n j may differ for each variable j. V i can be written as follows: The j-th component of the i-th time series, that is, For every univariate time series V ij , the different variables have different dimensions (observations), but every time series can be represented and transformed into L summary statistics extracted from the time series. In this paper, according to the 14 statistics mentioned, we set L = 14.
Multiple clinical predictors with different sampling frequencies from multiple patients are collected in the ICU. Thus, V is a set of time series of varying length multivariate time series. Specifically, in Formula (1), P represents the number of patients, m corresponds to predictor dimensions such as heart rate, blood pressure, temperature and other vital signs and laboratory predictors and t is the time measurement point, and the length of t differs for different predictor sampling frequencies. Thus, V ijt denotes the t-th measurements of the j-th predictor in the i-th patient. Because of the different sampling frequencies of different predictors in different patients, the total lengths of the vectors obtained by concatenating them differ. We can summarize the measurements of different variables by statistics of fixed numbers and concatenate them to obtain vectors of the same length for patients. The time series of patient i after extracting the time series features using the L summary statistics can be expressed as follows: Note that different statistics have specific statistical meanings. Some problems, such as information overlap, may exist among the statistics. Not all the statistics may perform well for prediction; thus, using all the statistics directly to represent a patient will increase the modelling complexity and can lead to overfitting. Let binary variable x k denote whether statistic k is selected in the best combination, that is, Then, the selection vector X of the best combination of statistics can be expressed as and thus, the representation of patient i after statistical selection can finally be expressed as To select the combination of statistics that best reflects the physiological time series, we regard the selection vector X as an unknown parameter and construct an objective function to solve the optimization problem. The optimal objective function can be written as follows: where E is an evaluation function used to measure the prediction performance; in this study, the area under the receiver operating characteristic curve (AUROC) is chosen in this paper. Here, y i is the true label of the patient in different prediction tasks, and f is the prediction model, which is the random forest algorithm in this study. Because the objective function in Formula (6) cannot be written using explicit expression levels, the simplest and most direct way to find the optimal solution of X is to adopt a global search strategy, that is, to find the prediction effect of all statistical combinations and then select the optimal combination. However, the time complexity of this method is O (2 n − 1), which has practical limitations. The purpose of this paper is to evaluate which statistical combination is most effective for time series representation, and the final result of feature selection is a combination of statistics (such as [minimum, maximum and mean]). The optimal combination can be achieved by chromosome coding in a genetic algorithm. The genetic algorithm is a combinatorial optimization algorithm that approximates a global search; it can fully consider the relationships between features and find the most suitable feature combination.
The parameter settings in the genetic algorithm are as follows. (1) Coding and decoding: Because the selection vector of summary statistics is a binary variable, we use binary coding, and no decoding process is needed. (2) Population: We select the size of the population as 20, and the initial population is generated randomly. (3) Fitness function: In this paper, we select the AUROC as the fitness function to select the feature subset with a better classification effect. The fitness function corresponds to E in Formula (6). (4) Genetic operators: We use the roulette wheel selection scheme as the selection strategy, single point crossover with a probability of 0.6 as the cross strategy and uniform mutation with a probability of 0.1 as the mutation strategy. (5) Termination condition: To determine the convergence of the algorithm adaptively during the iteration process, the termination condition for the genetic algorithm used in this paper combines the maximum genetic algebra with the stationary fitness value. When the continuous fluctuation range of the fitness value is less than the specified threshold or the genetic algebra is larger than the specified algebra, the solution of the algorithm is complete.
To avoid optimistically biased performance estimates from conducting feature selection on the full dataset, we refer to previous work by Ozcift and Gulten, who embedded a genetic algorithm for feature selection into Bayesian network classifier training using a nested cross-validation approach [37]. The general flow of feature selection with the genetic algorithm is given in Table 2. The feature selection based on the genetic algorithm is embedded in a 5-fold cross-validation. For each fold of test data, a set of summary statistics will be obtained by the genetic algorithm; thus, five groups of summary statistics will be obtained under 5-fold cross-validation. Then, based on the summary statistics of each group, the random forest model is used for prediction, and the mean and standard error of the metrics index is taken as the experimental result.

Experiments and results
We explored the performances of different statistical combinations for different clinical prediction tasks, including patient mortality, length of hospital stay and disease prediction, and obtained the optimal statistical combination based on a genetic algorithm. Then, we analysed the results to find the commonalities and differences of the optimal combinations under different tasks.

Dataset and preprocessing
We used the MIMIC-III dataset collected from a variety of ICUs between 2001 and 2012 [38]. MIMIC-III is a large, freely available critical care database developed by the Laboratory for Computational Physiology of Massachusetts Institute of Technology (MIT). The database integrates deidentified, comprehensive, healthrelated data of 58,976 admissions admitted to the ICU of the Beth Israel Deaconess Medical Center (BIDMC) in Boston, Massachusetts.
To reflect the universality of the results, we did not target patients with a certain disease, but accepted all patients. After removing duplicates, we obtained a total  of 42,145 admission records; patients less than 15 years of age were excluded. To prevent possible information leakage and to ensure similar experimental settings compared with related works, we used only the first ICU admission for each patient [39]. In the MIMIC-III database, bedside monitoring data, laboratory test data, input events and output events all consist of time series with time tags. The data for the predictors selected in this paper came from three tables: chartevents, labevents and outputevents. Following the related research, we chose the predictors used in SAPS II, as shown in Table 3 [10,17,21]. For each predictor, we used raw data instead of calculated data. For example, we treated GCSVerbal, GCSMotor, and GCSEyes from the Glasgow Coma Scale (GCS) score as separate features. All the extracted predictors shown in the table came from the first 24 hours after the patient was admitted to the ICU.
Data preprocessing mainly included processing missing values, noisy values and duplicate values. The missing value processing process was divided into three aspects: patients, predictors and statistics. We eliminated patients missing more than 30% of their data and predictors missing more than 40%. Because the sampling frequency of each predictor is different and the calculation of statistics such as std, kurt and skew have requirements for sampling frequency, some indicators with very low sampling frequency led to the inability to calculate those statistics. We eliminated the statistics in which the missing data rate was greater than 20% under these indicators. Then, we used mean interpolation to interpolate the remaining missing values. Abnormal values were processed for each predictor. The outliers were found and dealt with by the box-plot combined with the clinical normal range of the different predictors. For example, to protect information about surviving patients older than 90 years old, the age of these patients is recorded as 300 years old. Here, we replaced it with the median value. In addition, duplicate records were deleted, and inconsistent units were converted. For the interval value, we chose the median value to represent the predictor value of the time point. Ultimately, 6,927 admission records remained after preprocessing. Figure 2 shows the patient cohort selection inclusion criteria and the data extraction process, and Table 4 shows the baseline characteristics and outcome measure of our dataset. The median age of the adult patients was 65 years, and 58.8% of patients were male. In-hospital mortality was approximately 19.5%, and the median length of stay in the ICU was 4.7 days. We did not process non-time series predictors such as age and sex. For the time series predictors, we calculated 14 statistics, including min, max, mean, std, median, Q1, Q3, mode, range, IQR, CV, skew, kurt and the first measurement of each predictor from the first 24 hours after admission to the ICU.

Clinical prediction tasks
The clinical prediction tasks selected in the experiment included patient mortality, length of hospital stay, and disease prediction. Mortality prediction is a primary patient outcome, including short-term, in-hospital and long-term mortality. In the experiment, whether the patient died within 72 hours after entering the ICU was selected as the short-term mortality label, and the 30-day and 1year mortality rates were used as the long-term mortality label. The length of the hospital stay of an admission can be defined as the time interval between admission and discharge; we calculated the length of hospital stay for each admission in hours. When a patient is discharged, there will be multiple diagnosis, which are represented by the ICD(international statistical classification of disease)-9 diagnosis codes. We followed [10] and divided all the ICD-9 codes into 20 diagnostic groups; each diagnostic group had similar diseases (e.g. respiratory system diagnosis). Thus, the task of disease prediction is transformed into the task of predicting the ICD-9 code groups.

Experimental design
For the three prediction outcomes, we approximated a global search to obtain the best combination of statistics Fig. 2 Patient cohort selection inclusion criteria and data extraction process. Adult patients at their first hospital admission with a low missing data rate were selected as the patient cohort, and then, the clinical data of these patients, such as cohort demography, vital signs, and laboratory examinations, were extracted using a genetic algorithm. To improve the generalizability of the statistical combinations obtained by the genetic algorithm, we embedded the genetic algorithm in a crossvalidation procedure, as shown in Table 2. For each data fold, we obtain a set of optimal statistical combinations (i.e., fivefold cross-validation yields 5 sets of statistical combinations). To reduce the effect of randomly partitioning the data during cross-validation, we repeated the entire process five times, selecting different random seeds for dividing the data each time. For the 25 sets of statistical combinations obtained under each prediction task, on the one hand, we compared their prediction performance with the combination of statistic commonly used in previous studies, and on the other hand, we conducted an in-depth analysis of these combinations. Then, we constructed two indexes to quantify the importance of different statistics used for prediction (see the Discussion section). The most important statistics were found by comparing the commonalities and differences of the optimal combination of statistics under different prediction tasks. As performance measures, we choose the AUROC and the area under the precision-recall curve (AUPRC) for the classification tasks and Mean Squared Error (MSE) for the regression tasks. AUROC and AUPRC evaluate the discrimination ability of the model, namely, the ability to assign higher severity scores to patients who died in the hospital compared with those who did not. The higher the AUROC and the AUPRC are, the better the model is. We calculated the mean and standard error of AUROC, AUPRC and MSE scores based on cross-validation as the final result.
All the experiments in this paper were programmed in the Python language, using Spyder 3.6 on a PC equipped with an Intel (R) Core (TM) i7-6700 CPU @ 3.40 GHz processor. The iterations of the genetic algorithm were terminated when the fluctuation in the fitness value became less than δ = 10 −3 for 50 consecutive iterations or when the total number of iterations exceeded 200. The crossover probability was set to 0.6, the mutation probability of the genetic algorithm was set to 0.1, and the size of the population was set to 20.

Results
We report the results under different prediction tasks separately. For each prediction task, we list the prediction results based on a single statistic, commonly used combinations of statistics, and the optimal combinations of statistics obtained by the genetic algorithm.

Results of mortality prediction
Patient mortality prediction tasks are divided into shortterm, in-hospital, and long-term mortality prediction by  Table 5 shows the AUROC and AUPRC of the 14 selected statistics applied separately for the four mortality prediction tasks. When using a single statistic for mortality prediction, mean, median and Q3 achieved the best results under different prediction tasks. In other words, the statistic that reflects the concentrated trend of the physiological time series achieved the best and near-best prediction results on the mortality prediction task whether in the short or long term prediction. In addition, for short-term mortality prediction, the effect of the max statistic is also significantly greater, which is a statistic that reflects dispersion trends. It is not difficult to understand that if the short-term mortality is predicted using the data of patients 24 hours after entering the ICU, the values that will be significantly related to the predictive label are the degrees of fluctuation of the patient predictors. If the predictors are relatively stable, patient state can also be considered relatively stable. In contrast, large fluctuations are considered to indicate an unstable patient condition; such patients have a higher mortality rate. For the long-term prediction, the average levels of the predictors at a certain stage are closely related to the prediction results over extended periods. If the predictor remain at a consistently abnormal level, the mortality rate is higher over longer time spans.

combination of dispersion and central tendency is better.
It is further demonstrated that for short-term prediction, statistics that reflect the dispersion tendency have a better representation effect and can reveal fluctuations in the patient's physiological state. For longer-term mortality prediction tasks (such as 30-day and 1-year), the addition of the std statistic enriches the physiological time series fluctuation information. Even knowing the min, max and mean value of the physiological time series, it is difficult for these statistics to reflect violent fluctuations in the patient's physiological state. Long-term prediction causes a reduction in the time dependence of the prediction; thus, more information needs to be added to achieve good results. Tables 7,8,9,10 presents the optimal ten combinations of statistics obtained by the genetic algorithm and their performances for short-term, in-hospital and longterm mortality prediction. As shown, the prediction effect of the optimal combination of statistics obtained by the genetic algorithm is rarely weaker than the prediction effect of the commonly used combinations of statistics. As the prediction interval is extended, the prediction performance decreases, which indicates that predicting long-term mortality based only on data collected within 24 hours after patient entering the ICU not ideal. For short-term mortality prediction tasks, Q1 and Q3 appear more frequently. And the statistics that show dispersion tendency also appear frequently, such as min, max and so on. Skew and kurt, two statistics that describe the shape of the time series distribution and are often ignored, appear quite frequently and reflect the role of these two statistics in supplementing the other available information. Under longer-term mortality prediction tasks, mean, Q1 and Q3, which are concentrated statistics, also achieve better results. Combining statistics such as min, max, and mean can better characterize the distribution of physiological time series. In addition, the commonly used combinations of statistics such as [min, max] and [min, max, mean, std] also achieve good prediction results on both in-hospital and long-term mortality prediction tasks. In other words, this paper used experiments to demonstrate why the existing studies chose these particular statistical combinations to represent physiological time series. Table 11 shows the performance of a single statistic for length of hospital stay prediction. A certain level of correlation exists between the length of hospital stay and mortality prediction. Generally, patients with higher mortality have more severe symptoms; consequently, their hospital stays are relatively long. Consistent with mortality prediction, range works best when based on a single statistic. At the same time, std, CV, and IQR, which reflect the dispersion tendency, have better effects. In addition to indicating the dispersion tendency, the better performing statistics also constitute crossover features, just as range = max − min. Therefore, the importance of cross features is self-evident. Table 12 shows the performances of commonly used combinations of statistics for predicting length of hospital stay. [min, max, mean, std] corresponds to the smallest MSE and the best prediction performance. Table 13 shows the optimal ten combinations of statistics obtained by the genetic algorithm and their prediction performances for length of hospital stay prediction. The effect of the combinations of statistics obtained by the genetic algorithm is superior to the effect of the common combinations of statistics. Range appears in each group, illustrating the validity of this statistic for predicting the length of hospital stay of patients. A larger range indicates an unstable condition, and patients with unstable conditions will naturally be hospitalized longer. In contrast, statistics such as the mean, which reflects the central tendency, appear less Table 7 The optimal ten combinations of statistics obtained by the genetic algorithm and their prediction performance for 72-hour mortality prediction   Table 9 The optimal ten combinations of statistics obtained by the genetic algorithm and their prediction performances for 30-day mortality prediction   frequently. When predicting the length of hospital stay, the stability of the patient's condition is the most important factor; thus, statistics that indicate the dispersion tendencies of time series function better.

Results of disease prediction
We treat disease prediction as a multilabel classification task and calculate the AUROC and AUPRC. Table 14 shows the performances of single statistics for disease prediction. On this task, a comparison of the results shows that the mean, median, Q1, Q3 and other statistics that reflect centralized trends have the best effect. In contrast, the effects of statistics that reflect the dispersion tendency are not very good. The performances of skew and kurt, which reflect the shape of the time series distribution, are the worst. This result shows that if only one statistic is used for patient disease prediction, the shape of the distribution is unimportant; the level of the value is more important.
The corresponding prediction performances of combinations of multiple statistics are shown in Table 15.  Among the five commonly used combinations, it is surprising that the single mean statistic works best-even better than combinations of multiple statistics. From the optimal ten combinations obtained by the genetic algorithm shown in Table 16, we can see that the mean statistic appears in almost all the combinations, indicating its core role in disease prediction. Furthermore, min, max, and range are evenly distributed among the multiple combinations. We speculate that these metrics provide good auxiliary data for disease prediction; however, using these statistics alone does not result in good prediction.
In summary, through the analysis of the prediction performances of different prediction tasks based on single statistics, commonly used combinations of statistics, and  approximately optimal combinations of statistics obtained by the genetic algorithm, we discovered many interesting and clinically significant phenomena. We have indirectly demonstrated the rationality of using various combinations of statistics that were applied in previous research. Additionally, we found the statistics that are extremely important in clinical prediction tasks, which can provide guidance for future research.

Discussion
In the experiments, we used a genetic algorithm to obtain combinations with approximately optimal prediction results for different prediction tasks. Taking 72-hour mortality prediction as an example, the 5-fold cross-validation genetic algorithm was repeated 5 times to obtain 25 groups of combinations. Each group corresponds to multiple statistics, and the prediction performance varies among the different combinations. Which statistics appear most frequently and which statistics will achieve better prediction results are meaningful research questions. In the previous chapter, we performed a rough analysis. In this chapter, we quantitatively analyse the frequency of each statistic in the optimal combinations and the mean values of indexes under different tasks. Since we chose random forest as the classifier in the experiments, it is necessary to verify the performances of other classifiers based on the obtained statistics. So we also discuss this issue. Tables 17,18,and 19 show the results of each statistic regarding patient mortality, length of hospital stay and disease prediction, respectively. Frequency represents the number of occurrences of a statistic in the 25 combinations, and Mean_AUROC and Mean_AUPRC represent the average AUROC and AUPRC for all the combinations in which the statistic appears.
In the mortality prediction task, the statistics with the highest frequency for 72-hour short-term mortality prediction are min, max, Q1 and Q3. The mean_AUROC and mean_AUPRC values corresponding to median and Q1 are high, while first are low. Statistics that embody the dispersion tendency, such as min and max, play a central role in short-term mortality prediction, while statistics such as first are more irrelevant to patients' physiological status information. For the in-hospital mortality prediction task, min and std occurred most frequently, and min and max achieved the highest Mean_AUROC and Mean_AUPRC, respectively. For the long-term mortality prediction task, min, std, and kurt performed best. Kurtosis and skew measures have rarely been used in previous studies to measure the shapes of physiological time series distributions. However, the experiments in this paper show that these two statistics provide supplementary information and should not be discarded. Apart from this lack, we can clearly see that the statistics widely used in previous studies have indeed played a better role in predicting mortality. When predicting the length of hospital stay, range appears most often, and its effect is the best. In the disease prediction task, the most frequent occurrence is std, but the measures that Table 16 The optimal ten combinations of statistics obtained by the genetic algorithm and their prediction performances for disease prediction  perform the best are statistics that reflect the central tendency.
To verify whether the combinations of statistics obtained in this paper can also obtain good prediction results using other classifiers, we select logistic regression, SVM and decision tree. We compare the prediction performance of the optimal combination of statistics and the commonly used combinations of statistics under different prediction tasks by multiple classifiers. Tables 20, 21, 22 and 23 show the results of the 72hour, in-hospital, 30-day, and 1-year mortality prediction, respectively. Tables 24 and 25 show the results of the length of hospital stay and the disease group prediction.
In the task of mortality prediction, regardless of shortterm, in-hospital or long-term prediction, from a horizontal perspective, the decision tree has a poor prediction effect. The performance of SVM is similar to random forest, but the time complexity is high. Logistic regression is usually able to achieve higher AUPRC. The time complexity of the random forest is low, and it can obtain  the best prediction effect in most cases compared to other classifiers. This is why we choose the random forest as the classifier at the stage of calculating the fitness value by the genetic algorithm. Vertically, patient representation based on the best combination of statistics has achieved the best prediction results in most cases compared to the commonly used combinations of statistics.
A single statistic such as mean and first is less effective than the combination of multiple statistics. In the cases where the optimal combination of statistics does not achieve the optimal effect, the combination of [min, max, mean, std] has achieved the optimal effect many times. On the one hand, it shows that the statistical combinations obtained by random forest and the analysis of effective statistics are also applicable to other classifiers. On the other hand, it also reflects the scientific nature of the commonly used combinations of statistics such as [min, max, mean, std].
In the length of stay prediction task, the MSE of random forest is much smaller than the MSE of logistic regression. The MSE corresponding to the optimal combination is smaller than the commonly used combination, and much smaller than the MSE corresponding to a single statistic. In the disease prediction task, the optimal combination of statistics only performs best when the random forest is used as a classifier. When logistic regression and decision tree are used as classifiers, the performance based on a single statistic 'mean' is the best. Although the optimal combination of statistics do not achieve the best prediction effect, in the results of random forest, we can also find that the effect of mean and optimal combination of statistics is not much different. It is also consistent with the conclusion that the statistic 'mean' plays an important role in disease prediction. In general, the effective statistical combinations based on random forest in this paper can also achieve better prediction results when selecting other classifiers. It shows that the discussion of effective statistics under different prediction tasks in this paper has a strong generalization ability.

Conclusion
In this paper, we summarized 14 statistics that describe the characteristics of physiological time series, of which three involve aspects of the central tendency, dispersion tendency, and distribution shape. Then, we evaluated the performances of these summary statistics of physiological time series as features for clinical prediction tasks, including patient mortality, length of hospital stay and disease prediction. We performed experiment on patient representations based on both single statistics and commonly used combinations of statistics. To find the combinations of statistics with the best prediction performances  under different tasks (limited by the high time complexity of global search), we used a cross-validation-integrated with a genetic algorithm to obtain the combinations of statistics with approximately optimal performances. A quantitative analysis was performed on each statistic in the optimal combinations. Through in-depth analysis of the experimental results, we have reached the following conclusions: (1) As the prediction time becomes longer, the prediction performance becomes increasingly worse.
Using data acquired only within 24 hours after the patient entered the ICU was insufficient to make reasonable long-term mortality prediction.
(2) Statistics that reflect centralized trends, such as mean and median, play an important role in almost all mortality prediction tasks.
(3) For short-term mortality prediction, statistics that show dispersion tendency are also representative, such as min, and max. Cross-features such as range may contain more information. (4) For the length of hospital stay prediction task, the statistics that reflect the dispersion tendency perform better. The length of hospital stay is closely related to the stability of the patient's physiological state: unstable patients have a higher probability of staying longer. (5) For the disease prediction task, statistics that reflect the centralized trend, such as the mean, make larger contributions to the prediction result. The mean represents the average level of different predictors is sig-nificantly correlated with judgements concerning whether the patient's condition is due to a specific disease. (6) Commonly used combinations of statistics such as [min, max, mean] and [min, max, mean, std] achieve good prediction results in most cases; thus, these experiments help to verify the rationality of previous research. (7) Skew and kurt, which reflect the shape of a distribution, perform poorly when used individually as features for prediction, but they appear frequently in the optimal combinations, indicating that they can play a role as supplemental information.
Although we evaluated the effect of statistics of physiological time series under different prediction tasks, some limitations still exist. This paper considers the central tendency, dispersion tendency and distribution shape when choosing statistical features but does not fully consider latent characteristics, such as periodicity. Moreover, due to limitations in the sampling frequencies of some of the clinical predictors, the analysis of kurt and skew, which describe shape of a distribution, was insufficient. Furthermore, these experiments were applied only to patient mortality, length of hospital stay and disease prediction. Research on other clinical tasks still needs to be performed. In future work, we plan to correct the deficiencies of this study and design a more suitable patient representation method and model to improve the results of clinical task prediction.