We first designed an experimental paradigm to solve the RP recognition problem. Then, we collected the FVEP dataset. Next, we applied some of the feature engineering methods and the MCAC model to RP recognition.
Dataset
The FVEP dataset in this paper consists of four subdatasets; (1) a normal FVEP dataset, including 5164 FVEP data, collected from 1366 patients, each providing test results for both the left and right eyes twice; (2) an RP disease FVEP dataset, including 1112 FVEP data, collected from 278 patients, each providing test results for both the left and right eyes twice; (3) an abnormal FVEP dataset that includes a collection of apparently abnormal FVEP data and optic neuritis FVEP signals, totalling 800 items; and (4) an unlabelled FVEP dataset, including 4000 FVEP data, collected from 1000 patients. As shown in Fig. 2, the FVEP signals are different in normal people and RP patients. However, it is difficult to distinguish the RP disease FVEP signal from the normal human signal, as shown in Fig. 2 closest to the lower edge. In addition, patients were 4-88 years old, with an average age of 42 years (see Fig. 3).
All four datasets were collected from the Ophthalmology Department of the First Affiliated Hospital of Army Medical University (Southwest Hospital) from July 1, 2012 to March 1, 2020. The equipment used for FVEP collection was the Espion E2, which was loaded with the Espion E2 system.
Feature engineering
With the development of deep neural networks, automatic feature extraction techniques are becoming increasingly mature. For example, recurrent neural networks and self-attentive networks can extract timedependent features in sequences, and convolutional neural networks can automatically extract pattern or spatial features in images. In some studies of time series classification problems, the manual features and depth features are utilized simultaneously to improve the classification accuracy [12, 13]. In this paper, the FVEP manual features extracted from the domain knowledge are combined with the deep features extracted from the neural networks to obtain more useful information for prescreening RP diseases. The manual features alone have difficulty capturing complex disease patterns, while the deep features are more comprehensive than the manual features [14].
As shown in Fig. 4, feature engineering can be divided into three stages, including exploratory data analysis, feature extraction and feature selection. First, exploratory analysis of the dataset is performed using visualization tools to find information that can be used. Then, in this paper, time series feature extraction is performed by using domain knowledge with the help of the time series feature extraction librarytsfel tool (TSFEL). TSFEL includes over 60 different features extracted across temporal, statistical and spectral domains. Finally, feature selection is performed on the extracted features by using a feature selection algorithm.
As shown in Fig. 4, time series feature extraction can be divided into three categories, including the temporal, statistical and spectral domains. The temporal domain method mainly extracts time-related features from time-series data, including autocorrelations, mean differences, and entropy. The statistical domain method is mainly used to extract features by statistical methods, including the maximum, minimum, median, histogram, etc. The spectral domain method is mainly used to convert the FVEP signal to the spectral domain for feature extraction, including the fast Fourier transform [15], FFT mean coefficient, wavelet transform [16], etc.
In the feature selection stage, the variance filtering algorithm [17] and the Pearson correlation coefficient algorithm [18] are chosen. The variance filtering algorithm is based on the principle of calculating the variance corresponding to each feature value in the dataset and rejecting it if it is below the threshold. By default, all the zero-variance features will be rejected, and a variance of 0 means that the feature values of the sample have not changed. The Pearson correlation coefficient principle calculates the linear relationship between the features and labels, and rejects them if their values are close to zero.
Figure 5 shows the correlation coefficient of the manual features. Some features were positively correlated with RP disease, and the others were negatively correlated with the RP disease. The FVEP of RP patients showed decreased amplitude and unchanged peak time in P2 wave. The reason is that generalized retinal dysfunction in RP will cause much smaller input for the following visual passway, while the conduction time of visual passway was usually unaffected. Therefore,the P2 wave in RP patients displayed decreased amplitude and unchanged peak time compared to the normal control.
MCAC Model
Figure 6 presents our proposed multi-input neural network based on convolution and confidence branching (MCAC-Net). MCAC-Net has two inputs and two outputs, where the inputs include the FVEP signal and 7 manually extracted features, and the output includes the category and confidence level. For the manual feature input, the features are extracted through a fully connected layer of 128 neurons. For the FVEP signal input, the waveform features are extracted after two branches, i.e., the global feature extraction branch and the local feature extraction branch. For global feature extraction, a global one-dimensional convolution with a convolution kernel of the same size as the length of the FVEP signal is used. For local feature extraction, a one-dimensional convolution with a smaller convolution kernel size combined with maximum pooling is used. Finally, the outputs of the three branches are concatenated and passed through a layer of fully connected layers to extract features for classification and out-of-distribution detection. The network outputs categories with category confidence, and its category output is considered meaningful when the category confidence is greater than a certain threshold. The construction details of the network blocks are described as follows:
Global convolution
In the field of deep learning, RNNs were previously mainly used to capture temporal patterns or features. However, due to the inherent nature of RNNs, it is difficult to handle long time sequences and perform parallel computations, which ultimately affects the computational speed and model performance. The convolutional structures have demonstrated efficient parallel computations as well as the ability to capture features [19]. In this paper, we adopt a \(\hbox{T}\times 1\) filter, called a global convolution, where T is the time length of the input FVEP signal and its value is 320. A global convolution extracts features from the integrated sequence at once, and will capture the nontime-invariance (time-invariance) features in the time series. Each global convolution filter processes the entire input and returns a vector of size with a RELU activation function. Integrating global convolutions will give the output. Each line of the output can be considered a representation of the entire time series.
Local convolution
The ability to have local patterns, considering the shorter time steps, is relevant for the predictions. Therefore, MCAC-Net utilizes a local convolution parallel to the global convolution to capture the local features. The focus of the two convolutions is inconsistent; the global convolution focuses on the features of the time series as a whole, while the local convolution focuses on the features of the sequence locally, and fusing the two features improves the classification performance. Unlike the global convolution, the length of the local convolution filter is small. To extract the most representative features, MCAC-Net utilizes a one dimensional maximum pooling layer. In this section, MCAC-Net utilizes a filter length of 3, including three convolutional layers and three maximum pooling layers.
Confidence branching
After global and local feature extraction, the manual features are fused and concatenated to make predictions. Unlike general classification tasks, MCAC-Net outputs category confidence c in addition to the category. When the category confidence is low,the category branch output is considered meaningless.
$$\begin{aligned} \text{p,c}={\mathcal{F}}\left( x,\theta \right) \quad {{\text{p}}_{i}},c\in \left[ 0,1 \right] ,\quad \sum \limits _{i=1}^{M}{\text{p}}_{i}=1 \end{aligned}$$
(1)
When we train MCAC-Net, some hints are provided to the network to adjust the softmax output using the confidence c.
$$\begin{aligned} {{\hat{p}}_{i}}=\text{c}*{{p}_{i}}+\left( 1-c \right) {{y}_{i}} \end{aligned}$$
(2)
Due to the imbalance between RP and the normal samples in the dataset, choosing focal loss as the loss function can reduce the data imbalance impact on the classification performance. After replacement using the new softmax output, the new loss function of MCAC-Net is:
$$\begin{aligned} {{L}_{t}}= & {} -\underset{i=1}{\overset{M}{\mathop \sum }}\,( y*{{\left( 1-{{{\hat{p}}}_{i}} \right) }^{\gamma }}*\log \left( {{{\hat{p}}}_{i}}\right) \nonumber \\{} & {} +( 1-y )*{{{\hat{p}}}_{i}}*\log ( 1-{{{\hat{p}}}_{i}})) \end{aligned}$$
(3)
$$\begin{aligned} {{L}_{c}}= & {} -\log \left( c \right) \ \end{aligned}$$
(4)
$$\begin{aligned} \text{L}= & {} {{\text{L}}_{t}}+\lambda {{\text{L}}_{c}}\ \end{aligned}$$
(5)
First, for the Focal Loss loss function \({{\text{L}}_{t}}\), \({{\hat{p}}_{i}}\) is used instead of \({{p}_{i}}\). Then, a confidence loss \({{L}_{c}}\) is added to prevent the neural network from always choosing \(c = 0\) during training. finally, the ratio of the two losses is controlled using \(\lambda\)
Pretraining strategy
To utilize the unlabelled data in the dataset, this paper adopts a pretraining strategy for the local feature branches. As shown in Fig. 7, this paper builds a convolutional autoencoder to automatically extract the FVEP signal. In the training phase, this paper first uses the training set combined with a large amount of unlabelled data to train the local branches unsupervised. Then, the local branches are integrated into MCAC-Net, and the local branches are trained for the second time by setting a low learning rate. The whole training process is shown in Algorithm 1.
Experimental setup
The experiment is divided into three parts. The first part compares the model classification performance, including comparing the performance of different neural network architectures when no manual features are added, versus observing the change in performance when the manual features are added. The second part compares the out-of-distribution detection performance of the different models. The third part provides a visual analysis. of the neural networks. In the first part of the experiment, for the MLP network, three fully connected layers are set, each with 128, 64 and 64 neurons in turn. for fully convolutional networks (FCN) [20], after the output of the full convolution layer, the fully connected layer is replaced with the global average pooling layer, which greatly reduces the number of parameters and avoids overfitting, and finally, the output is passed through the softmax layer. At the same time, the batch normalization and ReLU activation functions are used to accelerate convergence and reduce overfitting. The three convolutional layers are one-dimensional convolutions with filter sizes of 5, 3, 3 and the number of each layer is 64, 64, 64. For ResNet [21], three residual blocks are stacked. Each residual block consists of three convolutional layers with filter sizes of 5, 3, 3, and the number of each layer is 64, 64, 64. For the CNN LSTM network [22], the dropout layer is added in the LSTM branch to reduce overfitting, and the number of neurons in the LSTM is set to 64. In the CNN branch, after three convolutional layers, the global average pooling layer is connected, and then the outputs of the two branches are connected and output through the softmax layer. For the MCAC-Net network, we first temporarily remove its manual feature branch, which is CAC-Net. Unlike the Residual Network (ResNet), MCAC-Net uses maximum pooling for filtering to extract the more valuable features. The three convolutional layers in the local feature extraction branch of MCAC-Net have filter sizes of 5, 3 and 3, and the number of each layer is 64, 64 and 64 in that order. The convolutional layers in the global feature extraction branch have a filter size of 320 and a number of 64.
In the second part of the out-of-distribution detection experiments, we compare three types of models. First, the traditional anomaly detection models include LOF (local outlier factor). Breunig et al. [23], one class SVM [24], and a minimum covariance determinant (MCD) [25] model, are used based on the features output from the last fully connected layer of MCAC-Net. For these methods, the training set of the normal class is needed. In this paper, the training set of the two classes from the classification experiment is used as the out-of-distribution detection training set. Then, the information entropy of the MCAC-Net output is calculated, and if it is higher than a certain threshold, it is an out-of-distribution sample. This method does not require additional training. Finally, for the confidence-based algorithm, the training set from an anomalous FVEP dataset is needed in addition to the training sets of the two categories in the classification experiment
Hyperparameters
The MCAC-Net2 model adopts pretraining technology and sets the local branch learning rate to 0.001. The remaining parameters are kept consistent with all the models, the learning rate is set to 0.01, the batch size is set to 128 and the number of iterations is 50. All the neural network models have focal loss as their loss function, which is expressed as eq. We select the model with the lowest training loss during training as the best model in the training set and report their test set evaluation results.
Training and testing
For the normal FVEP dataset, the RP FVEP dataset was grouped according to the patients’ ID, data from 70% of the patients were randomly selected as the training set data, and data from 30% of the patients were used as the test set data. For the abnormal FVEP dataset, 30% of the data were randomly selected as the training set, and 70% of the data were used as the test set. The data of each patient includes the FVEP signal, age and disease type.
Development environment
For the experiments in this paper, the computer configuration is composed of an AMD 2600 CPU, GTX1070 TI GPU and 16 GB of RAM. The data preprocessing, manual feature extraction, and MCSA-Net models are run on the Windows 10 64-bit operating system, and the deep learning framework used is Tensorflflow 2.0, executed in the Anaconda program.