### Data source

The CTU-UHB data which were obtained from the Physio Net is the only and the largest open access database of its type. The data base consists of 552 recording which were obtained in the university's obstetrics ward in Brno, Czech Republic. All recordings are 90 min in length and begin 90 min prior to delivery. The database consists of simultaneous record of UC and FHR which are sampled at 4 Hz. The CTG signal recording lasts 60 min for 1st stage of labor and 30 min for 2nd stage [24].

Figure 1 show the general schemes of the methodology followed in this study and it includes: labeling of the CTG signal, preprocessing, time frequency representation, data preparation for training of mode, fine tuning a pertained Resnet50 model and evaluation of the model performance through validation and testing. There are several methods of data pre classification or data labeling criteria for CTG signal analysis. Among them, labeling based on experts visual annotation, APGAR score [8] and pH based annotation [15,16,17,18] are the most common methods. Since clinical expert’s visual annotation and APGAR score are subjective data labeling criteria, so both were disregarded and the objective data labeling criteria that is pH value of neonatal umbilical artery blood measured shortly after the baby was born [25] is used. So, a pH less or equal to 7.15 was decided to be pathological and a pH greater than 7.15 is assigned for normal class after careful examination. Based on our data pre classification criteria, the database contained 439 normal and 113 for distressed classes.

### Preprocessing

In biomedical signal processing and analysis, preprocessing is the first and very important step to be made before further analysis of the signal. Nature of features used in training and final classification of the model is reliant on the signal quality obtained after preprocessing [14].

In clinical practice CTG signal is acquired by external sensors, thus FHR signal is subjected to artifacts and spikes that arise from maternal movement, sensor displacement which may cause the signal drop to zero value [11] and other deliver related factors [18]. The noises that affects FHR signal ordinarily reveals itself as spiky, outliers or missing value (periods where FHR value drop to zero). Before further analysis of the CTG signal, noises are eliminated to obtain reasonably better quality signal for more accurate results. So, the main goal of this step is to reduce the aforementioned noises by applying a conventional preprocessing techniques applied in FHR signal analysis [13, 26] and the steps are shown in Fig. 2.

First the raw CTG signal containing FHR and UC signal are obtained, and then FHR signal is extracted and passed to the further step. Then the long gaps which is a missing value for more than 15 s are removed from the time series signal [20]. In addition, missing values at the beginning and at the end of recording are excluded to start from the stable point. The extreme value of FHR that are less than 50 bpm and greater than 200 bpm are called outliers (not physiologic) [17]. The outliers and small gaps are determined and linearly interpolated [13] using an algorithm provided by Matlab 2021a. It is a type of interpolation which uses linear polynomial to generate a new data point between two points by using curve fitting technique. For a two known data points by coordinates of (x_{0,} y_{0}) and (x_{1,} y1) linear interpolation generates the new value of **y** using Eq. 1.

$$y = \frac{{y_{{0{ }}} \left( {x_{1} - x} \right) + y_{{1{ }}} \left( {x - x_{0} } \right)}}{{x_{1} - x_{0} }}$$

(1)

Sample points of FHR signal that is greater than by 25 beat from the previous adjacent beat is not physiologic and unreliable beat [27] and this unstable point reveals itself as spikes on a FHR signal, so it is removed using cubic spline interpolation [13]. It is a very powerful and widely used method that interpolates a function between a given set of points by means of piecewise smooth polynomials [28]. A cubic spline f(x) interpolating on the partition x_{0} < x_{1} < ⋯ < x_{n-1} is a function for which f(x_{k}) = y_{k} is a piecewise polynomial function that consists of n − 1 cubic polynomials f_{k} defined on the ranges [x_{k,} x_{k+1}]. An example of cubic spline interpolation passing through 6 data point is shown in Fig. 3.

Finally segment of the 1st 20 min (4800 sample) [12, 13, 18] and last 15 min (3600 sample) [17, 29] were selected for further analysis considering the length of signal in first and second stage of labor. Considering x(*i*) as FHR signal having a sampling frequency of 4 Hz and a unit of beats per minute (bpm), where *i* = 1,2, …, *N* and *N* is the number of sample points, the following logic shown in Fig. 4 is performed in preprocessing stage using Matlab 2021 a.

In stage of preprocessing, first the FHR signal is extracted as shown in Fig. 5b from the CTG signal that contains the FHR and uterine contraction signal as described Fig. 5a. Once the FHR signal is extracted it goes to further stage of preprocessing and finally segmented based on stage of labor as early stage of labor and final stage of labor for experiment one and two as shown in Fig. 5c and d respectively.

### Data augmentation

One of the most frequent problems in the field of deep learning is unbalanced classes. Data augmentation is one of the ways for dealing with this problem. Bearing in mind the amount of data available, under-sampling the largest class (Normal class) was ruled out and over-sampling the minority distressed class was chosen for augmentation. There are some types of time series data augmentation inspired from 2D augmentation such as: Jittering(injecting noise), rotation, scaling, window slicing [30]. Slicing augmentation is implemented in this work; it is the same as data augmentation used for image called cropping augmentation. The main idea underlying slicing is that the data is augmented by removing or adding time steps from the pattern's ends. Likewise, minority class of database is oversampled in the time series by slicing shifting backward for five minute two times. Therefore, slice of 20-min window slice for first stage and 15-min window slice for second labor stage data were generated.

### Time frequency representation of FHR signal using Morse wavelet

Various types of wavelet family are available for CWT, among them analytic wavelet (AW) has been widely employed for time frequency analysis and representation of physiological signal such as electroencephalogram (EEG) [31], electrocardiogram (ECG) [32], electromyogram (EMG) [33]. It's a wavelet with a complex value and a Fourier transform of zero at negative frequencies [34] and a generalized Morse wavelet is a latest and well known of its family. This family of wavelet is an ideal choice for analysis of non-stationary signal with varying amplitude and frequencies over time. It calculates the amplitude, frequency, transient, short duration, localizing discontinuities, and combined time–frequency localization of time-varying amplitude, frequency, transient, and short duration [35].

From the family of analytic wavelet specifically generalized Morse wavelet is called exactly analytic wavelet as it has no leakage for negative frequency unlike other analytic wavelets [36]. Negative frequency leakage in wavelets causes interference and degrades the transform result [33, 35]. Furthermore the flexibility nature of generalized Morse parameters made it a super family to encompasses all other analytic wavelet class [36, 37].

Generalized Morse wavelet form has two parameter exhibiting additional degree of freedom in comparison with other AW. It is represented as \(\varphi_{P,\gamma } \left( t \right)\) and defined in frequency domain [36, 38] by Eq. 2

$$\varphi_{P.\gamma } \left( \omega \right) = U\left( \omega \right)\alpha_{P,\gamma } \omega^{{\frac{{P^{2} }}{\gamma }}} e^{{ - \omega^{\gamma } }}$$

(2)

P^{2} is time bandwidth product, \(\gamma\) is symmetry parameter, is Euler’s number \(\approx 2.71828\), \(U\left( \omega \right)\) is unit step function and \(\alpha_{P,\gamma }\) is normalizing constant. Rather than the time bandwidth product, \(\beta\) is employed as a decay or compactness parameter in several Morse wavelet applications which is \(P^{2} = { }\gamma { *} \beta\) [35]. The equation of Morse wavelet using \(\beta and \gamma\) is written as Eq. (3).

$$\varphi_{\beta .\gamma } \left( \omega \right) = U\left( \omega \right)\alpha_{\beta ,\gamma } \omega^{\beta } e^{{ - \omega^{\gamma } }}$$

(3)

By adjusting \(\gamma\) and \(\beta\) parameters, the generalized Morse wavelet can take a broad range of mother wavelet that has not been even fully explored [35]. For instance setting \(\gamma\) = 1 and \(\gamma\) = 2 results to other family of analytic wavelet named Cauchy and derivative of Gaussian wavelets respectively [36].

The wavelet duration in time is proportional to the square root of the time bandwidth product P and determines number oscillation that can fit into the envelop [37], whereas the symmetry parameter \(\gamma\) determines symmetry behavior of wavelet in time domain [33].

When \({\upgamma }\) is set to 3, the skewness of the Morse wavelet via demodulation is 0 and this value results the wavelet to exhibits minimum Heisenberg area while remaining exactly analytic [35]. Wavelet with large Heisenberg area results to poor time frequency concentration [37], so setting \(\gamma\) = 3 results the wavelet to the most symmetric and the most Gaussian wavelets (‘Airy’ wavelet family) with minimum of Heisenberg area [33]. Hence the value of symmetry parameter \(\gamma\) = 3 is used for time frequency representation of FHR in this work.

As discussed previously time-bandwidth product determines oscillations in the envelope [38]. so, for a fixed value of \(\gamma\) at 3 varying time-bandwidth product \(P^{2}\) varies the oscillatory behavior of wavelet. Therefore, based on type of analysis and behavior of signal, one can adjust Morse parameters value and examine its effect on the mother wavelet and on frequency response of filter bank. Figure 6 shows the effect of time different value of time-bandwidth product value for fixed value of \(\gamma\) at 3.

As shown in Fig. 6 when the value of time-bandwidth product \(P^{2}\) increases the wavelet becomes more oscillatory in the envelop, this leads the wavelet to become narrower in frequency and spread out in time. The plot of frequency response clearly visualizes the effects in frequency domain, for the value of \(P^{2}\) = 4 the frequency response is widest compared with \(P^{2}\) = 60 and 120. At \(P^{2}\) = 120 it’s very narrow.

The value of Morse wavelet parameter determines time frequency tradeoff in representing the FHR signal. For lower \(P^{2}\) value the wavelet transform results in good temporal resolution but poor spatial resolution, whereas for higher \(P^{2 }\) the wavelet transform becomes good in spatial resolution but poor resolution in temporal. For time frequency analysis the default value of time-bandwidth product which is 60 is recommended [32]. Considering the effect in time frequency trade off, the extreme values of \(P^{2}\) is ruled out for this study, as the aim is to localize visible and hidden characteristic of FHR in time and frequency jointly. So, time band width product value of 55 and 60 with symmetry parameter \(\gamma\) = 3, are carefully selected for time frequency representation of FHR signal in this work. A Matlab ‘cwtfilterbank’ is utilized for time frequency conversion of FHR signal. The parameters in the filter bank are adjusted considering the FHR signals and wavelet parameters, thus sampling frequency 4 Hz, signal length (4800 and 3600) gamma parameter 3 and time bandwidth product value 60, 55 and voice for octave value of 12 are used.

### Adoption of pre-trained ResNet 50 model for transfer learning

Transfer learning is a term that refers to passing weight values of a trained neural network to another new neural network, so that building and training a network from scratch will be avoided [39]. Transfer learning makes updating and retraining a network considerably faster and easier than training a network from scratch. It permits the use of popular models that have already been trained on huge datasets to train models with less labeled data and some of the well-known CNN models used for transfer learning are; AlexNet, GoogLeNet, Vgg, OverFeat, ResNet, Xception [40]. Different aspects of pre-trained networks are important to consider when selecting a network to use for a specific purpose. Network accuracy, speed, and size are the most critical aspects. However, choosing a network is usually a compromise between these factors. So, taking major consideration such as limited infrastructures like memory space and GPU and accuracy of the model in account ResNet 50 is selected for this study.

ResNet is a sort of neural network first introduced by K He et al. in 2015 [41]. It is an architecture designed to be more in-depth structured than all previous architectures and enables for the successful training of incredibly deep neural networks without being hampered by vanishing gradients [41]. ResNet counters the problem of vanishing gradients by introducing identity shortcut connection indicated by X in Fig. 7 or skip connection indicated curved arrow in Fig. 7.The identity connections helps the residual block to reuses the input features of the upper layer and add it with output of current layer before feeding the next layer as illustrated in Fig. 7.

ResNet is one of the most powerful deep neural networks, which achieved exceptional generalization capabilities on ILSVRC 2015 classification competition. ResNet also took first place in the ILSVRC and COCO 2015 contests for ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation [42]. ResNet50's design is divided into four stages, as shown in Fig. 8 the network uses 7 × 7 and 3 × 3 kernel sizes for initial convolution and max-pooling respectively on input image of size 224 × 224 × 3. Following that, Stage 1 of the network begins, which consists of three residual blocks, each block has three layers, and kernel sizes utilized to execute the convolution operation in all three layers of the stage 1 block are 64, 64, and 128 [42].

### Data split scheme

The dataset split was done for training and testing by applying a well-known rule of data splitting which is 80% and 20% training and testing sets respectively. Among the 20%, 10% was used for validation, and 10% for test. A training set is used to train the network while a validation set is used to monitor the model performance and fine tune hyper-parameters during the training process. Finally, a test set is used once in order to evaluate the performance of the final model.