A study of health effects of long-distance ocean voyages on seamen using a data classification approach

Background Long-distance ocean voyages may have substantial impacts on seamen's health, possibly causing malnutrition and other illness. Measures can possibly be taken to prevent such problems from happening through preparing special diet and making special precautions prior or during the sailing if a detailed understanding can be gained about what specific health effects such voyages may have on the seamen. Methods We present a computational study on 200 seamen using 41 chemistry indicators measured on their blood samples collected before and after the sailing. Our computational study is done using a data classification approach with a support vector machine-based classifier in conjunction with feature selections using a recursive feature elimination procedure. Results Our analysis results suggest that among the 41 blood chemistry measures, nine are most likely to be affected during the sailing, which provide important clues about the specific effects of ocean voyage on seamen's health. Conclusions The identification of the nine blood chemistry measures provides important clues about the effects of long-distance voyage on seamen's health. These findings will prove to be useful to guide in improving the living and working environment, as well as food preparation on ships.


Background
Ocean-going seamen are living on a ship with a confined environment for a long period of time. Such a special environment, often not most human friendly, may cause various changes in the human body for people who work and live there for an extended period of time [1]. Seamen may experience subtle changes in physiological [2][3][4][5] and psychological functions [6,7] in their bodies. Many studies have been conducted on different aspects of maritime health issues. It has been previously reported that ocean voyages could affect the human immune system and result in various illness [8,9]. Specifically, it is found that the risk of ischemic heart disease (IHD) lethality on board is much higher than that on land, based on an analysis of data of 124 seamen who died suddenly of myocardial infarction [10]. The diet and the lack of physical exercises, while living on ship, are believed to be the top two contributing factors to IHD [11]. Besides, cardiovascular disease is another serious maritime health problem [9]. We expect that these illnesses and/or changes in physiological conditions will be reflected by changes in some blood chemistry measures of the seamen. In this paper, we focus on the identification of the most significantly changed chemistry measures in the seamen blood samples that we have collected before and after their ocean voyage.
Statistical methods have been often used to analyze the health effects of long-distance ocean voyages on seamen [12,13]; however, simple statistical methods are often found to be inadequate for dealing with complex relationships among physiological and psychological functions in seamen bodies under study. In this paper, we present a computational study of this type of problem using an approach different from the traditional statistical methods. We consider the problem of identifying the health effects of ocean voyages on seamen as a classification problem, i.e., to classify blood chemistry measures that are consistently affected or not by ocean voyage, and apply a supervised machine learning method, specifically support vector machines (SVM), to solve the classification problem. Support vector machines have been widely used for classification problems and found to be particularly effective for discovering informative feature patterns for small data sets [14].
The identification of discriminant features (e.g., chemistry measures of blood) among pre-defined classes of objects is of fundamental and practical interest. By identifying relationships linking specific features (e.g., certain blood chemistry measures) and feature values with certain classes of objects such as diseases, one can possibly derive new insights about the disease and its development.
For our problem, we have used a feature selection method in conjunction with the SVM-based classifier, called recursive feature elimination (RFE), to find blood chemistry measures that show consistent and substantial changes caused by ocean voyage, which takes into consideration of mutual information between features in the feature selection process. This procedure has proved to work better than other correlation-based methods [14] for solving similar problems. We anticipate that our findings, in terms of identified blood chemistry measures with the most substantial and consistent changes in seamen bodies due to ocean voyage, will provide useful guiding information for food preparation and intake for seamen during ocean voyages and for better designing of a healthier living and working environment on ship.

Methods
Given is a collection of M seamen blood chemistry data x 1 , x 2 ,... x k , ... x m , with each x k representing a vector  of 41 blood chemistry measures for each seaman, either before or after the ocean voyage. Our goal is to identify a subset of the 41 measures that show substantial and consistent changes caused by the ocean voyage. We formulate this problem as follows.
For each x k (a seaman) in {x 1 , x 2 ,... x k , ... x m }, we assign a label y k = +1 or -1 to indicate if this data is measured before or after an ocean voyage. Our classification goal is to find a discriminant function F(x) on the set of 41dimentional vectors to best separate the samples labeled with +1 from the ones labeled with -1. A by-product in solving this classification problem is the identification of a subset of the 41 blood measures that can best distinguish between the subsets of samples with different labels, the ultimate goal of this study. Specifically, we intend to find a linear discriminant function as follows: that can best separate the two subsets with different y labels of the given seamen blood samples, where ω is a weight vector and b is a bias value, to be determined through finding the optimal discriminant function. For this problem, we have chosen to use an SVM approach coupled with an RFE procedure, to find the optimal discriminant function F(x), as well as a subset of blood measures that show substantial and consistent changes between the two subsets.

Support vector machine
SVMs are a class of machine learning techniques [15] widely used for solving classification problems like our problem. An SVM-based classification algorithm constructs a separating hyperplane between two classes of data samples like the ones mentioned above with different labels in the input space. The separating hyperplane is determined by a) mapping the input space into a higher dimensional feature space through a kernel function, and b) constructing in this feature space two maximal margin hyperplanes [16] to separate the mapped data samples in the higher dimensional space.
The goal in training an SVM is to find a separating hyperplane along with two parallel supporting hyperplanes, one on each side of the separating hyperplane, which give the margins of the data samples to the Figure 1 An example of a dataset in two-dimensional space. A dataset with two types of samples, represented by pentagram and four pointed star, respectively, where H is a separating hyperplane, and H 1 , H 2 are two supporting hyperplanes. separating hyperplane as large as possible (see Figure 1). As shown in Figure 1, the margin is equal to 2/||ω||; therefore, finding the hyperplanes that separate the data samples with different labels that have the maximal margin is equivalent to solving the following constrained optimization problem: with the constraints, where C>0 is a penalty parameter and ξ >0 is a slack variable. This problem is often represented in its dual form as follows: with constraints, where α's are the Lagrange multipliers. This formulation, where data vectors are present only in dot products in the feature space, makes the execution of the training algorithm simpler [15]. Specifically, the classification problem can be rewritten as, is the kernel function that maps a data vector in the input space to a higher dimensional feature space, and SVs are all possible support vectors on the parallel hyperplanes as mentioned above. For further details of support vector machines, we refer the reader to [16].

Recursive feature selection
Recursive feature elimination (RFE) is a feature-selection procedure, used in conjunction with the training of an SVM-based classifier [14], and has been applied in several microarray gene-expression studies [14,17] for feature selection and in industrial process analyses to determine the most essential process variables [18]. During the training of an SVM-based classifier, RFE is used to eliminate features that are "insignificant" to the performance of the classifier (in our case, blood measures showing no consistent and substantial differences between the before-and the after-sailing blood samples), which typically consists of three steps [14]: train a classifier using an SVM; compute the ranking for all features, based on some pre-defined criteria, and remove the feature with the lowest ranking.
The following function DJ() calculates the effect on the objective function (6) by removing a specific feature, which is used in our feature selection, where H is the matrix with elements y h y k K(x h , x k ), H (-i) is H with the i th feature removed, and K is a kernel function that measures the similarity between x h and x k We used a Gaussian radial basis kernel function K (x h , In this procedure, features are removed one at a time, and then the SVM will be retrained to update the new ranking of features. It should be noted that the top ranked features are not necessarily the ones that are individually most relevant to the classification performance since the relevance of features are evaluated in the context of other features, i.e., mutual information among features in terms of their collective discerning power are considered.

Data set
Seamen's Blood Chemistry Data 200 seamen are involved in this study, who are healthy Han Chinese males with age from 19 to 38 (mean 25.2 ± 4.9). Before and after a 3-month voyage, 5 milliliters of venous blood was collected from each of the 200 seamen after fasting for 12 hours, and then has been centrifuged for 5 minutes. Serum was collected and analyzed using a HITACHI 7600 modular full automatic biochemical analyzer, for 41 chemistry measures (as listed in Table 1). (The seamen blood chemistry data is available upon request). By removing those with any missing information, a total of 170 pre-sailing and 170 post-sailing samples have been complied with completed blood chemistry measures. In our computational study, each seaman sample is represented as a feature vector consisting of 41 blood chemistry measures.

Generation of training and test Datasets
Our goal is to identify a subset of blood chemistry measures among the 41 measures that show consistent and significant changes across the seamen's blood samples collected before and after sailing. We first split the whole dataset randomly into two subsets, one for training and another for testing. A training set is used to select features (blood chemistry measures) and find the right weights of the features so an optimal separating hyperplane between the two labeled subsets can be derived, while the test set is used to evaluate the effectiveness of the trained SVM mostly for its generality, where the evaluation criterion on the trained SVM is the sign function in Eq. (6).
We first mix all the blood chemistry data, both preand post-sailing data, into one set while keeping the "pre-" and "post-sailing" label (-/+) for each vector, and then we separate this dataset into a training and a test set, by randomly putting sample data into the two subsets. One key to establish a good training set is that it should capture all the varieties existing in our seamen samples. In order to accomplish this, we have used multiple training sets and the associated test sets to assess each trained classifier, and used a combined classifier based on all the trained individual classifiers (based on specific training set) using a majority-rule vote at the end. In order to obtain good training sets through random partition of the original dataset, we considered seven different ratios between the numbers of samples in the training and the test set, ranging from 1:1, 1.5:1, 2:1, 2.5:1, 3:1, 3.5:1, to 4:1, respectively. In total, we generated 300 pairs of training sets and associated test sets, and trained a classifier for each of the 300 training sets. Note that samples with the same or similar pre-and post-observed values but different labels are considered as noise affecting the performance. We checked our dataset to ensure that no two vectors have conflicting labels. Patients in this study all signed the Informed Consent Form; and this study has been approved by the Medical Ethics Committee in Chinese PLA General Hospital, China.

Results and Discussion
By running the SVM-RFE procedure outlined in the above section using the 300 training sets, we obtained 300 ranked lists of features in terms of their level of contribution to our classification problem as defined in Eq. (7). Each list includes a subset of top features that make the classifier have the best classification performance, which is measured using the overall prediction accuracy P = (TP+TN)/M, with TP, TN being the numbers of true positive and true negative, respectively, and M the total number of seamen in the test set. For example, in one of the 300 results, we obtained the best overall accuracy 98.3% when 10 features are selected and used as the most useful subset of features. The 10 features are ranked as shown in Table 2. We do the same on each of the 300 training sets, and got 300 subsets of features, each of which makes its respective classifier have the best prediction performance.

Majority-rule for selecting important features
As discussed above, different ways of partitioning the original dataset into training and test sets may lead to (somewhat) different performance by the trained classifier. We have generated 300 pairs of training and test sets, and trained an SVM-based classifier for each training set. To decide the ultimate subset of features to use for getting the best classifier, we have used a majorityrule voting strategy. The premise of this strategy is that the intrinsically important features will always be chosen by the best trained classifier, which should be Table 1 41 Blood chemistry measures used in this study   Index  1  2  3  4  5  6  7  8  9  10  11   Feature  TP  ALB  ALT  AST  TB  GLU  UN  Cr  UA  ALP  GGT   Index  12  13  14  15  16  17  18  19  20  21   independent of the specific sampling. The majority-rule voting process is described as follows: for each of the 41 features, we count the number of times when this feature is among the remained features for the ith training set, i = 1, 2, ..., 300. For example, feature MAO (32) is present in all the 300 subsets, so its count is 300, while the count of feature PHOS (17) is 286. After we get the count for each of the 41 features, we re-rank all the selected features based on this count, as shown in "Count" column of Table 3. By using the majority-rule, we found that 12 such features are chosen by at least 50% (majority) of the 300 trained classifiers, with their names along with their frequencies of being used across different trained classifiers being listed in Table 3. Then we examined different combinations of the 12 selected features to calculate the average prediction accuracy of the 300 classifiers, and the classification results of the best combinations are given in Table 3. Among these, the highest accuracy is 95.74% when all the 300 classifiers are trained on 9 features {MAO, PHOS, CK-MB, Ca, LDH, FRUC, K, Na, ALB}. These 9 features constitute the smallest subset of features that make the classifiers have the best prediction performance, so we select them as our selected subset of features [19,20].

Comparison with t-test-based feature selection
As a comparison, we have also used a paired t-test, a popular statistical method, to evaluate each feature independently in the seamen data to identify the ones that show substantial changes before and after the sailing. For each of the 41 blood chemistry measures, we have a null hypothesis that the measure has no substantial change before and after the sailing at a specified significance level α, say, α = 0.05. We use p(x) to represent the probability, under the null hypothesis, of observing x at this significance level. If p(x) >α, the test fails to  reject the null hypothesis, meaning that there is no significant change before and after the sailing; if p(x) <α, the null hypothesis will be rejected, indicating that the corresponding feature has significant changes before and after the sailing. We have calculated the p values for each of the 41 features; Table 4 shows the identified significant features using α = 0.05. We have compared the ranked lists of features by SVM-RFE and by paired t-test, and found that among the top nine features in the two lists, seven of them are common to both lists. The difference is mostly due to the way that the features are selected in the two methods, where SVM-RFE uses more global information and the paired t-test bases solely on individual features in dependent of others in their feature selections. We believe that the substantial smaller set of features that our method selected, compared to the other methods, provides a more focused and informative subset of features for further studies.

Clinical Implication
Our analysis above identified a number of blood chemistry measures with consistent and substantial changes across all the seamen surveyed before and after their ocean voyages. The top 9 features selected by SVM-RFE are MAO, PHOS, CK-MB, Ca, LDH, FRUC, K, Na and ALB. Table 5 summarizes the changes in their mean values and standard deviation between before-and after voyage. Based on the findings, we noted the following: Creatine Kinase (CK) is typically present in the cytoplasm and mitochondria of organs such as heart, muscle and brain; and it is directly related to cellular energy conversion, muscle contraction and regeneration of ATP. It reversibly catalyzes the phosphoryl transfer reaction between creatine and ATP. We found that after the occean voyage, the mean value of CK reduced to 142.46 U/L from 171.68 U/L measured before sailing. This is probably due to the lack of physical exercise by the seamen, which ultimately led to the reduced demand for energy of the body muscle, as well as the reduction of CK. As one of its isoenzymes, the mean value of CKMB is reduced to 6.2 U/L from 9.73 U/L, the variation tendency is the same with CK. Lactate dehydrogenase (LDH) is a key enzyme in the glycolytic process. It exists in virtually all organs, particularly in liver, kidney, cardiac muscle, skeletal muscle, pancreas and lung. In anaerobic conditions, the regeneration of NAD+ is completed by the reaction in which LDH catalyzes pyruvic acid to become lactic acid; and LDH can also catalyze lactic acid to become pyruvic acid, with hydrogen being transferred to its coenzyme to become NADH. The mean value of LDH changed from 185.87 U/L to 158.15 U/L from pre-sailing to post-sailing, which could be due to the reduction of physical exercise intensity of seamen during the ocean voyage, having led to the reduction of energy supplied by the glycolytic process except for the normal aerobic metabolism of the body.
ALB (albumin) is made by the liver, and decreased serum albumin may indicate liver diseases as well as kidney disease, which allows albumin to escape into the urine.
Decreased ALB could also be explained by malnutrition or a low protein diet [21]. The altered ALB levels often suggest changed liver metabolism. Increased protein intake during ocean voyage may be needed in the seamen's diet.
Fructosamine (FRUC) is a substance formed from plasma protein during the life cycle of glucose. Since the half-life for plasma protein is 17 days, the measured FRUC level reflects the blood glucose level generated by the food taken in the 1-3 weeks prior to the voyage. Hence the observed change of FRUC from 169.74 μmol/ L to 181.41 μmol/L after sailing probably reflects the type of food taken during the ocean voyage, which suggests that seamen should take less sugar during their future ocean voyage.
The observed change in the levels of inorganic ions, Ca, PHOS, K and Na may be caused by electromagnetism radiation and stress during the sailing. Though they cannot be used to diagnose specific diseases, their decreased levels generally indicate the poor state of the seamen's health. In the detected electrolyte, the level of serum sodium (Na) decreased obviously, possibly due to that Na ran off during sweating without being supplemented properly. In addition, calcium (Ca) is lacking in their diet, special measures should be adopted in their food preparation.

Conclusions
The possible effects of long-distance ocean voyage on seamen's health are receiving increasingly more attention in recent years. Living on ship with confined environments, ocean-going seamen could suffer from various health problems due to abnormal electromagnetism radiation, great temperature changes, poor diet structure, which may cause subtle changes in physiological and psychological functions in their bodies. In this study, we have used an SVM-RFE approach to identify important blood chemistry measures with significant and consistent changes before and after a voyage. A number of features have been identified to have such changes, such as MAO, PHOS, CK-MB, Ca and FRUC. Their identification provides important clues about how ocean voyage may affect seamen's health. Our findings could provide useful guidance for making necessary changes in their living environments, food preparation and exercise routines.