The increasing complexity of modern mechanical systems, especially rotating machinery, demands effective condition monitoring techniques, particularly deep learning, to predict potential failures in a timely manner and enable preventative maintenance strategies. Health monitoring data analysis, a widely used approach, faces challenges due to data randomness and interpretation difficulties, highlighting the importance of robust data quality analysis for reliable monitoring. This paper presents a two-part approach to address these challenges. The first part focuses on comprehensive data preprocessing using only feature scaling and selection via random forest (RF) algorithm, streamlining the process by minimizing human intervention while managing data complexity. The second part introduces a Recurrent Expansion Network (RexNet) composed of multiple layers built on recursive expansion theories from multi-model deep learning. Unlike traditional Rex architectures, this unified framework allows fine tuning of RexNet hyperparameters, simplifying their application. By combining data quality analysis with RexNet, this methodology explores multi-model behaviors and deeper interactions between dependent (e.g., health and condition indicators) and independent variables (e.g., Remaining Useful Life (RUL)), offering richer insights than conventional methods. Both RF and RexNet undergo hyperparameter optimization using Bayesian methods under variability reduction (i.e., standard deviation) of residuals, allowing the algorithms to reach optimal solutions and enabling fair comparisons with state-of-the-art approaches. Applied to high-speed bearings using a large wind turbine dataset, this approach achieves a coefficient of determination of 0.9504, enhancing RUL prediction. This allows for more precise maintenance scheduling from imperfect predictions, reducing downtime and operational costs while improving system reliability under varying conditions. In today’s industrial landscape, the complexity of mechanical systems, particularly rotating machinery, poses significant challenges for effective condition monitoring and maintenance [ 1, 2, 3]. Traditional methods for failure prediction and prevention, such as threshold-based alarms, vibration analysis, and manual inspections, often struggle to keep pace with the demands of modern operations. These approaches are typically reactive, detecting faults only after significant wear has occurred, leading to unplanned downtime. Additionally, they tend to rely on fixed thresholds, which fail to account for varying operating conditions, resulting in false alarms or missed detections. As machinery becomes increasingly complex, these limitations highlight the need for more advanced, data-driven techniques that can adapt to evolving system behavior and improve predictive accuracy [ 4, 5]. This complexity often leads to a higher incidence of unexpected failures, which not only disrupt production but also incur substantial costs [ 6]. Consequently, the need for advanced monitoring techniques has become critical in ensuring operational reliability and efficiency. Traditional approaches to condition monitoring often rely on heuristic methods that can be limited in their ability to adapt to the evolving nature of machinery [ 7]. These methods frequently face challenges such as the randomness inherent in vibratory data and the difficulties in accurately interpreting this data. Such limitations underscore the necessity for more robust data quality analysis, as the reliability of monitoring systems hinges on the quality of the data being analyzed. Inadequate preprocessing and signal interpretation can lead to erroneous conclusions, resulting in missed opportunities for early intervention. To address these issues, there is a growing motivation to leverage sophisticated learning systems, particularly deep learning techniques, which offer enhanced capabilities for analyzing complex datasets [ 8]. Unlike traditional methods, deep learning models can learn hierarchical representations of data, enabling them to capture complex patterns that may elude conventional algorithms. This shift towards data-driven methodologies reflects a broader trend in predictive maintenance, where the integration of machine learning improves prediction accuracy and also facilitates understanding of the processes affecting machine health. Based on this analysis, several state-of-the-art studies have been selected to identify research gaps and highlight the need for the recent contributions made in this work. For instance, in [ 9] the authors propose a method for assessing the health of wind turbine high-speed shaft bearings and predicting RUL. They extract features from vibration signals, fuse them using a self-organizing feature map network, and construct a health indicator curve. The exponential degradation model, combined with Bayesian updates and the expectation–maximization algorithm, is used for RUL prediction. Validated with real-world data, the method outperforms support vector regression in prediction accuracy. The approach reduces human intervention in feature fusion but still requires human input in feature selection and may be limited by its reliance on an exponential degradation model. The study in [ 10] introduces a data-driven method for predicting wind turbine main bearing failures using only supervisory control and data acquisition data, avoiding the need for extra sensors. It leverages an artificial neural network-based approach to analyze healthy data, making it applicable even in wind farms without recorded faults. Tested on 12 turbines, the method predicts failures months in advance, aiding in maintenance planning. The advantages include cost-effectiveness and adaptability across varying conditions, though relying solely on supervisory control and data acquisition data may limit deeper insights compared to specialized sensors. Some human intervention is also required for data preprocessing and management. In [ 11], the paper presents a novel approach to RUL of wind turbine gearbox bearings, addressing the challenge of limited data in small or newly built wind farms. The main contribution is the use of prior knowledge from an empirical model to augment limited raw data before applying a deep neural network, specifically pre-interaction long short-term memory (PI-LSTM), to capture sequential degradation features. This method prevents overfitting and allows for better handling of interrupted time-series data. Fine tuning with real data minimizes bias from the empirical model. The approach effectively balances accuracy and complexity, using deep learning to enhance prediction accuracy while reducing human intervention in data handling and feature engineering. However, some manual steps are needed for data augmentation and fine tuning. In [ 12], the paper presents a method for predicting the RUL of wind turbine rotating components using a health indicator (HI) derived from vibration signals. It emphasizes feature extraction from vibration data and employs principal component analysis to select optimal features while filtering out anomalies. By combining similarity and degradation models, the method improves accuracy in maintenance planning, allowing for timely repairs and reducing operating costs. While it effectively uses deep learning techniques, some human intervention is still needed for data preprocessing and feature selection, adding a layer of complexity. Overall, the approach enhances predictive maintenance in wind energy production. In [ 13], a new approach for monitoring the condition of main bearings in wind turbines is introduced. This research utilizes a Dual Attention-Based Bi-LSTM (DA-Bi-LSTM) model that features two attention mechanisms, allowing it to concentrate on the most pertinent input parameters and their historical data. This method addresses the shortcomings of traditional models that assigned equal weights to condition parameters and failed to account for different operational characteristics. However, the model’s dependence on historical data may require human intervention for data preprocessing, and its analytical data processing might hinder scalability when compared to fully machine learning-based methods. Overall, the DA-Bi-LSTM model offers a promising avenue for real-time condition monitoring within the wind industry, although its success may rely on the quality and availability of the input data. In [ 14], a neural network model is designed for real-time detection of bearing faults in wind turbines and is implemented on Raspberry Pi to enhance the reliability of renewable energy. The model processes raw sensor data from both healthy and faulty bearings by segmenting it into smaller parts, allowing for rapid predictions. Although the approach achieves high accuracy and low latency, it has some limitations. Its reliance on historical data may require human intervention for preprocessing, and its small-scale design may struggle to handle the complexity of data, particularly under harsh conditions. In [ 15], a fault diagnosis strategy is proposed to enhance noise immunity in high-speed bearing vibration signals of offshore wind turbines often affected by environmental and structural noise. This method combines a two-dimensional convolutional neural network (2DCNN) with a random forest (RF) classifier. Vibration signals are transformed into two-dimensional grayscale images for analysis. Techniques such as exponential linear unit (ELU) activation, batch normalization (BN), and dropout improve feature extraction and noise resilience. Experimental results show that the 2DCNN-RF model achieves high diagnostic accuracy on the CWRU high-speed bearing dataset and maintains strong performance in noisy conditions without retraining, making it a promising solution for fault detection in challenging environments. Table 1 summarizes key findings and methodologies from recent studies relevant to this research. The studies indicate that significant human intervention in data preprocessing and feature selection introduces complexity and potential biases, while extensive data collection can be impractical for smaller wind farms. Reducing human involvement and streamlining data processing are crucial for improving the efficiency and scalability of predictive maintenance systems. Utilizing machine learning and representation models can automate feature extraction and RUL prediction, making the systems more robust and easier to implement across various operational settings. Accordingly, this paper, as depicted by a simplified diagram in Figure 1, introduces a novel two-part approach designed to tackle the challenges associated with condition monitoring in mechanical systems. The first part emphasizes comprehensive data preprocessing, employing techniques such as min-max scaling and feature selection through RF to streamline the analysis process [ 16, 17, 18]. This minimizes human intervention while effectively managing the complexity of the data. The second part of this study presents the Recurrent Expansion Network (RexNet), a multi-layered architecture built on recurrent expansion theories from multi-model deep learning [ 19]. RexNet is fundamentally built upon pretrained networks, starting with an LSTM network in this work. It then incorporates additional layers, such as a single layer of RexNet, followed by two layers, and so forth [ 20]. This framework not only simplifies the application of RexNet by allowing for hyperparameter fine tuning but also enhances the exploration of multi-model behaviors and interactions among variables, such as health indicators and RUL [ 21]. By integrating robust RF data quality analysis with the advanced capabilities of RexNet, this methodology offers insights that surpass those provided by conventional approaches. The optimization of both RF and RexNet hyperparameters through Bayesian methods ensures the attainment of optimal solutions and facilitates fair comparisons with existing state-of-the-art techniques [ 22]. The effectiveness of this approach is demonstrated through its application to high-speed bearings within a large wind turbine dataset, showcasing its potential to significantly enhance maintenance strategies and operational efficiency [ 23]. The remainder of this paper is divided into the following sections. Section 2 presents an overview of the dataset, highlighting its complexities and detailing the results of feature selection using random forest. Section 3 introduces the developed methods, with a focus on RexNet and its core learning principles. Section 4 provides a comprehensive discussion of the findings, and Section 5 concludes with insights into future research directions. This is dedicated to showcasing the materials used in this work. It introduces the datasets utilized, including their collection methods, key features, and characteristics, as well as elements related to their complexity. Additionally, it presents the results of feature selection, supported by important visualizations and detailed descriptions. The dataset used in this study was collected over approximately 65.3096 days from a commercial 2 MW wind turbine equipped with a condition monitoring system [ 23, 24]. The turbine’s shaft, driven by a 20-tooth pinion, operated at an average speed of 30.9 Hz (see Figure 2a). Data were acquired at 10 min intervals, resulting in 144 acquisitions per day. Each acquisition was sampled at 97.656 Hz for 6 s. A later inspection of the bearing revealed a crack in the inner race, as depicted in Figure 2b. In addition to the timestamp, the collected features included health indicator (HI) value, particularly related to vibration, cage energy, ball energy, inner race energy, outer race energy, shaft tick, and rotation speed per minute (RPM). In the context of predicting the RUL of rotating machinery, key features such as energy, the HI, and the shaft tick are extracted through mathematical processing of raw sensor data. The shaft tick is calculated by first isolating the relevant frequency components using an envelope process followed by spectral analysis to identify peak energy values corresponding to specific rotating components. This process is detailed in Equation (1), where the power spectral density ( PSD" role="presentation" style="position: relative;">[Math Processing Error] P S D ) is computed using Welch’s method, with spc" role="presentation" style="position: relative;">[Math Processing Error] s p c representing the power spectrum and rng" role="presentation" style="position: relative;">[Math Processing Error] r n g the frequency range of the component (e.g., shaft, cage, ball, inner race, outer race, etc.). The HI, a crucial metric for assessing machinery condition, is computed as the weighted norm of multiple condition indicators (CIs). This involves normalizing the CIs and applying a weighting matrix derived from their inverse covariance, as shown in Equation (2). Here, CIwn" role="presentation" style="position: relative;">[Math Processing Error] C I w n represents the whitened, normalized array of CIs, and ω" role="presentation" style="position: relative;">[Math Processing Error] ω is the weighting matrix based on the Jacobian. Lastly, energy is determined by extracting the maximum PSD values within the defined frequency range, as detailed in Equation (3). Eng=max⁡spcrng" role="presentation" style="position: relative;">[Math Processing Error] E n g = max s p c r n g (1) HI=CIwnTωCIwn" role="presentation" style="position: relative;">[Math Processing Error] H I = C I w n T ω C I w n (2) Energy=max(PSD(rng))" role="presentation" style="position: relative;">[Math Processing Error] E n e r g y = m a x ( P S D ( r n g ) ) (3) Figure 3 provides an overview of these raw data variables and their variability over time. An increase in inner race energy, shaft tick, and HI values is observed around March 19th (shown in Figure 2a,b,d, respectively). This indicates the onset of fault development and the beginning of damage propagation in the bearings, likely due to windstorms occurring around that time. With respect to RPM, the load on the gearing is proportional to RPM. From approximately 23 March to 1 April, the fault progression can be observed. During this period, the RPM was low, indicating minimal load on the system, and a corresponding drop in HI values was noted. This suggests that the fault became less severe as the load decreased. In this case, it would be also valuable to have a load metric based on RPM to enhance the analysis [ 26, 27, 28]. Such a metric could likely be derived from a wind turbine power table, allowing for a more detailed understanding of how load affects fault propagation. It is important to clarify that the dataset employed in this study captures realistic degradation from a healthy state to complete failure, specifically due to bearing inner race faults, which is a common failure mode in rotating machinery. This degradation process is not merely a simulation of wear but reflects a real-world scenario where machinery gradually deteriorates over time until it reaches a point of failure. This makes the dataset particularly suitable for both RUL prediction and condition monitoring. While condition monitoring is typically viewed as the continuous assessment of machinery’s operational status to detect early signs of wear or malfunction, our study goes beyond this by integrating RUL estimation. RUL prediction specifically aims to estimate the time remaining before a component or system fails. The dataset we use includes the entire lifecycle of degradation, starting from the early stages of wear to the final failure, allowing our model to track this progressive decline and provide predictions about when failure is likely to occur. This integration of RUL and condition monitoring is critical for practical applications, as it offers a comprehensive view of machine health. Condition monitoring alone may detect anomalies or operational deviations, but RUL estimation adds value by predicting the timeframe in which these deviations will lead to failure. In our approach, the model not only monitors the machinery’s current condition at various stages of degradation but also forecasts how much longer the system can operate before it fails completely. This dual focus addresses both short-term monitoring and long-term maintenance planning. In this study, the RF algorithm is utilized, with hyperparameters tuned using Bayesian optimization and an objective function based on the root mean squared error (RMSE) to assess feature importance [ 29, 30]. Key hyperparameters considered include the number of trees, minimum leaf size, maximum number of splits, and the number of predictors to sample. A permutation base is followed to extract the best features [ 18, 31]. The results of the hyperparameters for the RF regressor in Figure 4 indicate a well-balanced model configuration. The number of trees used in the forest is 473, which enhances the model’s ability to capture complex patterns while improving overall performance. The minimum leaf size is set to 14, ensuring that each leaf contains at least this number of samples, which helps prevent overfitting and promotes better generalization to unseen data. The maximum number of splits is 200, allowing for substantial complexity in the decision trees, enabling them to capture detailed interactions within the data. Finally, the number of predictors to sample at each split is 7, which introduces randomness and diversity among the trees, further reducing correlation and improving the robustness of the model. Overall, these hyperparameter values suggest a thoughtful approach to balancing complexity and performance. In Figure 5, the analysis suggests that the RPM load on the gearing has the most significant influence on fault growth, closely followed by the shaft tick. However, it is important to note that RPM is non-linearly related to torque, and since these are doubly-fed induction machines, at lower wind speeds, the output shaft operates at a lower RPM due to the cubic relationship of power to wind speed (i.e., P ∝ V3" role="presentation" style="position: relative;">[Math Processing Error] P ∝ V 3 , where P" role="presentation" style="position: relative;">[Math Processing Error] P stands for power, the amount of energy the wind turbine can produce, V represents wind speed, and the cube (3) indicates that a small increase in wind speed leads to a much larger increase in power output). Shaft tick, while not a direct fault, is a useful metric of mechanical looseness due to wear and indicates that the configuration for envelope analysis is correctly set. The HI, which is a whitened, fused representation of the four condition indicators (CIs) from various components (cage, ball, inner race, and outer race), ranks next in importance, as expected due to its holistic nature. The lower significance of the individual energy features of the bearings, such as the cage, ball, inner race, and outer race energies, can be attributed to their aggregation into the HI. The fault propagation primarily occurs during higher loads, typically associated with higher wind speeds, where high-cycle fatigue is most prevalent. By focusing on the key metrics of RPM load on the gearing, shaft tick, and HI indicators, this analysis captures the most influential factors contributing to fault growth, supporting more effective monitoring and maintenance strategies for high-speed bearings in wind turbines. In this work, our approach, termed RexNet, is inspired by the recurrent expansion algorithm, which facilitates the training of multiple learning models in a sequential manner [ 19]. In this framework, each subsequent model learns from the outputs of the previous models, creating a continuous chain of learning. This method leverages the dataset’s inputs, the feature mappings from the hidden layers of deep learning models, and their estimated targets, collectively referred to as IMTs (Inputs, Mappings, Targets). In this context, referring to Equation (4), let x" role="presentation" style="position: relative;">[Math Processing Error] x denote the inputs, φk(xk)" role="presentation" style="position: relative;">[Math Processing Error] φ k ( x k ) represent the mappings produced by the deep network at the training round, and k" role="presentation" style="position: relative;">[Math Processing Error] k , and y~k" role="presentation" style="position: relative;">[Math Processing Error] y ~ k indicate the estimated targets for that round. The variable k" role="presentation" style="position: relative;">[Math Processing Error] k signifies the current training stage, while n" role="presentation" style="position: relative;">[Math Processing Error] n represents the number of networks trained at each round. The innovative aspect of Rex is that at stage k+1" role="presentation" style="position: relative;">[Math Processing Error] k + 1 , the model not only learns from the data representations φk(xk)" role="presentation" style="position: relative;">[Math Processing Error] φ k ( x k ) but also from the model behaviors, specifically the interactions between the initial inputs x0" role="presentation" style="position: relative;">[Math Processing Error] x 0 and the responses y~k" role="presentation" style="position: relative;">[Math Processing Error] y ~ k produced by the previous models. This approach enables Rex to acquire a deeper understanding of the learning process by incorporating knowledge derived from both the input representations and the performance of prior models. Unlike traditional deep learning models, which typically rely solely on the input data x0" role="presentation" style="position: relative;">[Math Processing Error] x 0 , Rex methodology facilitates a more nuanced learning experience. By building on the insights gained from earlier rounds, Rex captures complex relationships and dynamics within the data, ultimately leading to improved predictive performance and robustness in various applications. xk+1=[x0, φkxki=1n,( y~k)i=1n]" role="presentation" style="position: relative;">[Math Processing Error] x k + 1 = [ x 0 , φ k x k i = 1 n , ( y ~ k ) i = 1 n ] (4) In RexNet, the process is streamlined compared to traditional deep learning methods. Instead of treating each previously trained deep network as a direct input to the next one, RexNet treats these networks as additional layers for the subsequent models. This approach simplifies the fine-tuning process by allowing new networks to refine the learned features directly, without having to reprocess the entire deep network structure from the earlier rounds. As depicted in Figure 6, each round of training involves only one deep network, which is then treated as a layer within the broader RexNet framework. This means that after each round, the newly trained network becomes an integrated layer, contributing to the overall model in a sequential manner. In essence, RexNet evolves layer by layer, with each deep network acting as a self-contained component or layer. The formula for the Rex layer can be represented as Equation (5), where k" role="presentation" style="position: relative;">[Math Processing Error] k denotes the number of layers in RexNet. Each layer  k" role="presentation" style="position: relative;">[Math Processing Error] k is a trained deep network, and as the number of layers increases, the model’s complexity and learning capacity also expand. This structure allows RexNet to progressively accumulate knowledge and refine its predictive power across multiple rounds, leading to improved performance with each iteration. By treating each network as a modular layer, RexNet achieves a balance between model complexity and training efficiency. xk+1=[x0, φkxki=1n,( y~k)i=1n]" role="presentation" style="position: relative;">[Math Processing Error] x k + 1 = [ x 0 , φ k x k i = 1 n , ( y ~ k ) i = 1 n ] (5) The deep networks in RexNet layers are implemented, comprising an input layer, a Rex layer a long short-term memory (LSTM) layer, a dropout layer, a fully connected layer, and a regression layer [ 20]. The hyperparameters for the model include the number of hidden units, maximum epochs, mini-batch size, initial learning rate, gradient threshold, L2 regularization, dropout rate, recurrent dropout rate, state activation function (as a character array), gate activation function, and sequence length. These hyperparameters are fine tuned to ensure optimal performance. The objective function ( obj" role="presentation" style="position: relative;">[Math Processing Error] o b j ) for the Bayesian optimization is defined as in (6), serving to reduce variability in predictions (standard deviation δy~k" role="presentation" style="position: relative;">[Math Processing Error] δ y ~ k ) and reduce confidence interval width. It also targets accuracy in prediction by adding in a term related to the root mean squared error. N " role="presentation" style="position: relative;">[Math Processing Error] N refers to the number of observations. obj=δy~k+(y~k−ykN)" role="presentation" style="position: relative;">[Math Processing Error] o b j = δ y ~ k + ( y ~ k − y k N ) (6) LSTM is specifically designed to capture long-term dependencies in sequential data, effectively mitigating the vanishing gradient problem. It is composed of several crucial elements: the cell state Ct" role="presentation" style="position: relative;">[Math Processing Error] C t , which retains information across time steps; the input gate it" role="presentation" style="position: relative;">[Math Processing Error] i t , which is responsible for controlling the addition of new information to the cell state; the forget gate ft" role="presentation" style="position: relative;">[Math Processing Error] f t , which determines which information to discard; and the output gate ot" role="presentation" style="position: relative;">[Math Processing Error] o t , which controls the output based on the cell state. The updates to the cell and hidden states are governed by Equations (7)–(12), where σ" role="presentation" style="position: relative;">[Math Processing Error] σ and tanh" role="presentation" style="position: relative;">[Math Processing Error] t a n h represent the sigmoid and hyperbolic tangent activation functions, and W" role="presentation" style="position: relative;">[Math Processing Error] W and b" role="presentation" style="position: relative;">[Math Processing Error] b denote the weights and biases, respectively. These equations enable LSTMs to maintain and adjust information over extended sequences, making them highly effective in tasks such as time series forecasting and natural language processing. it=σ(Wi⋅ht−1, xt+bi" role="presentation" style="position: relative;">[Math Processing Error] i t = σ ( W i ⋅ h t − 1 , x t + b i (7) ft=σ(Wf⋅ht−1,xt+bf" role="presentation" style="position: relative;">[Math Processing Error] f t = σ ( W f ⋅ h t − 1 , x t + b f (8) ot=σ(Wo⋅[ht−1,xt]+bo" role="presentation" style="position: relative;">[Math Processing Error] o t = σ ( W o ⋅ [ h t − 1 , x t ] + b o (9) Ct=ftCt−1+itC~t" role="presentation" style="position: relative;">[Math Processing Error] C t = f t C t − 1 + i t C ~ t (10) C~t=tanhWc⋅ht−1,xt+bc" role="presentation" style="position: relative;">[Math Processing Error] C ~ t = t a n h W c ⋅ h t − 1 , x t + b c (11) ht=ottanhCt" role="presentation" style="position: relative;">[Math Processing Error] h t = o t t a n h C t (12) In traditional analytical data processing, the initial step typically involves min-max normalization, which scales the data to a specific range, ensuring that all features contribute equally to the analysis. After this normalization, the process is streamlined through RF-based feature selection, which automatically identifies the most relevant features, reducing dimensionality and improving model performance with minimal human intervention. Unlike conventional methods that often rely heavily on expert knowledge in data and signal processing, RexNet’s architecture handles the complexity autonomously. By allowing each model to build on the knowledge of previous models, RexNet captures intricate patterns and relationships within the data without requiring extensive manual input. This reduces the need for specialized expertise while effectively tackling data complexity, enabling the model to learn both data representations and interactions, thus improving performance in complex datasets. This presents the results of RUL prediction and the performance of the models. Generally, MATLAB’s integrated functions and class layers are compiled in binary form, optimizing their execution speed within the MATLAB environment. However, the RexNet layer, being a custom-designed function stored in a separate code repository, significantly increases the computational time. Despite using a system equipped with an Intel i7 processor, 16 GB of RAM, and 12 MB of cache memory, we encountered high execution times, which led us to limit the RexNet model to two layers. It is important to note that our experiments were conducted on standard, commercially available laptops to ensure accessibility for a wider range of researchers, rather than relying on high-performance computing systems. During our tests, the laptop forcibly shut down due to overheating, even though the system was not intentionally stopped. This highlights the hardware limitations we faced, and at this stage, running models with more than two layers is not feasible with our current setup. However, we plan to explore models with more than two layers in future experiments when access to high-performance computing (HPC) systems or more robust computational resources becomes available. This presents the results in a step-by-step manner to provide a comprehensive understanding of the model’s performance. We begin by analyzing the learning behavior through the examination of the loss function and related convergence metrics. Next, the focus shifts to curve-fitting results of the studied methods, specifically LSTM and its variant with recurrent expansion (RexNet), featuring multiple layers. Following this, the residuals of RUL predictions are analyzed, with particular attention given to early- and late-stage predictions. A normality distribution analysis is performed, and the results are thoroughly discussed. Finally, regression metrics, including the RMSE, MAE, MSE, and R 2 as in (13)–(16), will be presented for both training and validation, alongside training time. The hyperparameters obtained through Bayesian optimization of the learning model will also be showcased and discussed in a comparative manner. The concludes with a discussion of the advantages and limitations of the studied approach, offering insights into its practical applications. RMSE=1n∑i=1n yi−y~i2" role="presentation" style="position: relative;">[Math Processing Error] R M S E = 1 n ∑ i = 1 n y i − y ~ i 2 (13) MAE=1n∑i=1nyi−y~i" role="presentation" style="position: relative;">[Math Processing Error] M A E = 1 n ∑ i = 1 n y i − y ~ i (14) MSE=1n∑i=1nyi−y~i2" role="presentation" style="position: relative;">[Math Processing Error] M S E = 1 n ∑ i = 1 n y i − y ~ i 2 (15) R2=1−∑i=1nyi−y~i2∑i=1nyi−y¯2" role="presentation" style="position: relative;">[Math Processing Error] R 2 = 1 − ∑ i = 1 n y i − y ~ i 2 ∑ i = 1 n y i − y ¯ 2 (16) In Figure 7, we analyze the training behavior of the models, comparing their loss function curves and the Area Under the Loss Curve (AULC) for deeper insights. In Figure 7a, the RexNet with one LSTM layer and one additional layer of pretrained LSTM from IMTs demonstrates more stable convergence compared to standard LSTM. This stability is reflected in the RexNet’s smooth and gradual reduction in loss, showing fewer oscillations or abrupt changes as the training progresses. In contrast, LSTM exhibits an exponential-like convergence behavior but struggles more during the early stages, indicating some difficulty in stabilizing. Despite this, the AULC value for LSTM is lower, as shown in Figure 7b, which suggests that LSTM converges faster overall, although less smoothly. For RexNet with two layers (one layer of pretrained LSTM from IMTs, one layer of previously pretrained RexNet IMTs, and one additional fine-tuning layer of LSTM), the training behavior further improves. This model achieves both faster and more stable convergence, exhibiting a rapid reduction in loss combined with fewer fluctuations. Additionally, the AULC for this model is the lowest among all models, signifying that it not only converges quickly but also consistently minimizes the loss throughout the training process. This behavior implies that the two-layer RexNet model benefits from leveraging pretrained components and the added complexity of the extra LSTM tuning layer. The pretrained LSTM and RexNet IMT layers help capture useful patterns more effectively, while the additional tuning layer ensures that the model fine tunes its predictions. The result is both efficient learning (faster convergence) and stable optimization (less fluctuation in the loss curve). This combination suggests that the two-layer RexNet model has both higher learning efficiency and greater adaptability to complex data, making it superior in both convergence speed and accuracy. In Figure 8, the plots show RUL comparison and residual analysis for both the training and testing sets, comparing the LSTM model and two RexNet variations (one layer and two layer). In Figure 8a, the training RUL curve fits indicat that all models generally track the actual RUL, but with varying precision. The LSTM model has an R 2 value of 0.9057, showing reasonable accuracy but with noticeable deviations from the ideal RUL. RexNet with one layer performs better, with an R 2 of 0.9320, suggesting greater stability and accuracy. The two-layer RexNet outperforms both with an R 2 of 0.9706, indicating minimal error and the closest fit to the ideal RUL during training. In Figure 8b, the testing set results follow a similar trend. LSTM achieves an R 2 of 0.9050, maintaining consistency but underperforming compared to the RexNet models. RexNet with one layer achieves an R 2 of 0.9179, while the two-layer RexNet excels with an R 2 of 0.9504, demonstrating its superior accuracy and better generalization to unseen data. The residual analysis in panels (c) and (d) further highlights the models’ performance. For the training set, LSTM shows larger residual variability, indicating less precise predictions of RUL. RexNet with one layer reduces this variability, while the two-layer RexNet minimizes the residuals, confirming its superior training fit. On the testing set, the LSTM model continues to display more residual variability, while both RexNet models achieve smaller residuals, with the two-layer RexNet exhibiting the least variability. This confirms its robust generalization and accuracy when predicting RUL on new data. In Figure 9, the residual distributions for both the training (a) and testing (b) sets illustrate how the RexNet models outperform LSTM in terms of prediction accuracy for RUL [ 32]. The distributions for RexNet models are much tighter, with RexNet featuring two layers having the narrowest distribution and the highest probability concentrated around zero residuals. This indicates that RexNet 2 consistently produces predictions that are closer to the true RUL values, enhancing the precision of the model. Moreover, the residuals in the RexNet models, particularly RexNet 2, are more concentrated in the early prediction phase. This behavior means the model places greater emphasis on accurate predictions during the initial stages, while late predictions are penalized more heavily. This is beneficial in RUL prediction, as it reduces the risk of significant errors during critical later stages of equipment degradation. By minimizing late prediction errors, the model helps prevent unexpected system failures, ensuring more reliable maintenance scheduling and reducing the potential damage caused by inaccurate planning. This characteristic enhances the overall robustness of the maintenance strategy, making it more effective in real-world scenarios. The following table, Table 2, presents a comprehensive numerical evaluation of the performance metrics across three models: LSTM, RexNet with one layer, and RexNet with two layers. These metrics include the RMSE, MSE, MAE, R 2, and computational time in minutes. From the results, we observe that the RexNet model with two layers consistently outperforms the other models across all key error metrics. It achieves the lowest RMSE (3.2734 for training and 4.1353 for testing), MSE (10.7153 for training and 17.1008 for testing), and MAE (2.3497 for training and 3.1719 for testing). This demonstrates that RexNet with two layers produces more accurate RUL predictions, as reflected by its higher R 2 values (0.9706 for training and 0.9504 for testing), indicating a better fit between predicted and actual RUL values compared to LSTM and one-layer RexNet. Accordingly, RexNet demonstrated significant advantages over conventional models. RexNet’s superior performance in predictive accuracy is particularly noteworthy in the context of RUL prediction, where small errors can lead to costly maintenance decisions. The RexNet model, especially with two layers, demonstrates stronger generalization capabilities compared to conventional LSTM models. This is largely due to RexNet’s ability to capture more complex patterns and relationships in the data, which is critical for accurately predicting the non-linear degradation behaviors often seen in industrial systems, specifically rotating mechanical systems. The key advantages of RexNet over traditional LSTM models include the following. Enhanced feature learning: RexNet’s design allows for better extraction of hierarchical features, particularly in multi-layer configurations. This contributes to higher prediction accuracy, as evident from the lower RMSE, MSE, and MAE values. Improved generalization: With higher R 2 values, RexNet shows a better fit between predicted and actual RUL values, indicating superior generalization across different datasets and operating conditions. Scalability: RexNet, due to its modular structure, can be scaled to more layers, potentially leading to even better performance as more complex relationships are captured in larger datasets. Flexibility: RexNet’s architecture can be adapted and customized with different numbers of layers, making it flexible enough to be fine-tuned for various predictive maintenance applications. Enhanced feature learning: RexNet’s design allows for better extraction of hierarchical features, particularly in multi-layer configurations. This contributes to higher prediction accuracy, as evident from the lower RMSE, MSE, and MAE values. Improved generalization: With higher R 2 values, RexNet shows a better fit between predicted and actual RUL values, indicating superior generalization across different datasets and operating conditions. Scalability: RexNet, due to its modular structure, can be scaled to more layers, potentially leading to even better performance as more complex relationships are captured in larger datasets. Flexibility: RexNet’s architecture can be adapted and customized with different numbers of layers, making it flexible enough to be fine-tuned for various predictive maintenance applications. Despite its strengths, RexNet faces notable limitations, and these advantages come with significant trade-offs, which must be considered when choosing the right model for practical applications. Some of the key limitations of RexNet compared to traditional methods, like LSTM, are as follows. Computational cost: As the number of layers in RexNet increases, so does the computational time. For example, while LSTM requires only 29.5 min for training, RexNet with one layer takes 122.6 min, and RexNet with two layers requires 562 min. This substantial increase in training time may limit RexNet’s feasibility in real-time or resource-constrained environments. Resource intensity: The increased computational time is accompanied by a higher demand on hardware resources. In our experiments, the use of standard, commercially available laptops led to forced shutdowns due to overheating, which restricted us to only two layers. Running deeper models with more layers would likely require HPC infrastructure, which may not be readily available to all users. Diminishing returns: While increasing the number of layers enhances prediction accuracy, the improvement may plateau beyond a certain point. As seen in our results, the jump from one layer to two layers brings a notable improvement, but the trade-off in computational resources might outweigh the benefits if further layers are added. Implementation complexity: RexNet introduces a custom layer structure that adds complexity to its implementation. This can make the model more challenging to optimize and integrate into existing predictive maintenance frameworks compared to conventional models, like LSTM, which are more mature and widely supported. Computational cost: As the number of layers in RexNet increases, so does the computational time. For example, while LSTM requires only 29.5 min for training, RexNet with one layer takes 122.6 min, and RexNet with two layers requires 562 min. This substantial increase in training time may limit RexNet’s feasibility in real-time or resource-constrained environments. Resource intensity: The increased computational time is accompanied by a higher demand on hardware resources. In our experiments, the use of standard, commercially available laptops led to forced shutdowns due to overheating, which restricted us to only two layers. Running deeper models with more layers would likely require HPC infrastructure, which may not be readily available to all users. Diminishing returns: While increasing the number of layers enhances prediction accuracy, the improvement may plateau beyond a certain point. As seen in our results, the jump from one layer to two layers brings a notable improvement, but the trade-off in computational resources might outweigh the benefits if further layers are added. Implementation complexity: RexNet introduces a custom layer structure that adds complexity to its implementation. This can make the model more challenging to optimize and integrate into existing predictive maintenance frameworks compared to conventional models, like LSTM, which are more mature and widely supported. The RexNet model, particularly in its two-layer configuration, offers significant advantages in terms of predictive accuracy for RUL estimation compared to LSTM. However, this comes at the cost of increased computational time and hardware requirements. The choice between RexNet and traditional models, like LSTM, depends on the specific needs of the application. If high accuracy is paramount and sufficient computational resources are available, RexNet is the better choice. However, for real-time applications or environments with limited resources, the simplicity and lower computational demands of LSTM may be more appropriate. Future work will explore deeper RexNet models when more robust computational systems, such as HPC stations, become available. The hyperparameter tuning results summarized in Table 3 provide valuable insights into the effectiveness of different models, LSTM, RexNet one layer, and RexNet two layers, in predicting RUL. Although the LSTM model exhibits certain advantages, such as 14 hidden units and a maximum of 513 training epochs that allow it to capture complex patterns, the RexNet models also present strong cases worth considering. RexNet one layer, with 12 hidden units and an initial learning rate of 0.0198, may provide a more stable learning environment, which can be beneficial in avoiding overfitting and ensuring generalization across diverse data scenarios. Similarly, RexNet two layers, with nine hidden units and an initial learning rate of 0.0994, offers another viable option for modeling RUL, particularly with its robust gradient threshold of 0.9269. These characteristics indicate that RexNet models could effectively learn underlying relationships in RUL data without the complexities associated with deeper architectures, like LSTM. The higher L2 regularization values in both RexNet models suggest a commitment to controlling overfitting, which is critical in RUL predictions where generalization is key. Additionally, the choice of learning algorithms, ‘rmsprop’ for RexNet one layer and ‘adam’ for RexNet two layers, offers adaptive learning rates that can enhance the models’ ability to converge on optimal solutions. In conclusion, while the LSTM model demonstrates capabilities in RUL forecasting, the RexNet models are recommended due to their stable performance and lower risk of overfitting. They offer a compelling alternative that may simplify the training process while still achieving competitive predictive accuracy, highlighting the importance of selecting a model that balances complexity with generalization capabilities tailored to the specific characteristics of the data. In summary, it is important to emphasize that the dataset used in this study is highly realistic and recorded under challenging conditions during a windstorm, which presents significant obstacles for data collection and observation, as noted in our previous work [ 33]. This authenticity ensures that the data accurately reflect real-world scenarios, thereby enhancing the reliability of our predictions. Additionally, the RexNet model has been previously employed to tackle complex classification problems [ 34], where it was applied to three distinct imaging datasets. This not only demonstrates the model’s versatility and effectiveness across various contexts but also reinforces its ability to manage complex data structures, ultimately strengthening the credibility of the results obtained in this research. In this study, we explored the application of RF and RexNet, expanded versions of LSTM networks, with recurrent capabilities for predicting the RUL of high-speed wind turbine bearings. Utilizing a realistic dataset from a 2 MW wind turbine lifecycle, we focused on the degradation of bearings due to cracks in the inner race. To enhance the feature selection process, we implemented RF under Bayesian optimization, which effectively identified critical features such as load, safety index, and health indicator. These selected features were subsequently fed into two variants of RexNet: RexNet with one layer and RexNet 2 with two layers, noting that computational limitations necessitated the termination of operations. The results showcased the efficacy of the RexNet variants across various evaluation metrics, including learning behaviors (loss and AULC), RUL prediction visualizations (curve fitting), and several error metrics, including R 2. The findings demonstrated significant achievements in RUL prediction, highlighting the potential of RexNet architectures in advancing predictive maintenance strategies for high-speed wind turbine bearings. Conceptualization, T.B. and E.B.; methodology, T.B. and E.B.; software, T.B.; validation, T.B., E.B., F.D. and W.H.L.; formal analysis, T.B., E.B., F.D. and W.H.L.; investigation, T.B., E.B., F.D. and W.H.L.; resources, T.B. and E.B.; data curation, T.B., E.B., F.D. and W.H.L.; writing—original draft preparation, T.B.; writing—review and editing, T.B., E.B., F.D. and W.H.L.; visualization, T.B., E.B., F.D. and W.H.L.; All authors have read and agreed to the published version of the manuscript. This research received no external funding. These data have been provided by the co-author of this manuscript, Eric Bechhoefer. Please contact him for any inquiries regarding their use. Additionally, all codes utilized in this study can be downloaded from https://doi.org/10.5281/zenodo.13933312 (accessed on 1 October 2024). Author Eric Bechhoefer was employed by the company Green Power Monitoring Systems International Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. Abstract The increasing complexity of modern mechanical systems, especially rotating machinery, demands effective condition monitoring techniques, particularly deep learning, to predict potential failures in a timely manner and enable preventative maintenance strategies. Health monitoring data analysis, a widely used approach, faces challenges due to data randomness and interpretation difficulties, highlighting the importance of robust data quality analysis for reliable monitoring. This paper presents a two-part approach to address these challenges. The first part focuses on comprehensive data preprocessing using only feature scaling and selection via random forest (RF) algorithm, streamlining the process by minimizing human intervention while managing data complexity. The second part introduces a Recurrent Expansion Network (RexNet) composed of multiple layers built on recursive expansion theories from multi-model deep learning. Unlike traditional Rex architectures, this unified framework allows fine tuning of RexNet hyperparameters, simplifying their application. By combining data quality analysis with RexNet, this methodology explores multi-model behaviors and deeper interactions between dependent (e.g., health and condition indicators) and independent variables (e.g., Remaining Useful Life (RUL)), offering richer insights than conventional methods. Both RF and RexNet undergo hyperparameter optimization using Bayesian methods under variability reduction (i.e., standard deviation) of residuals, allowing the algorithms to reach optimal solutions and enabling fair comparisons with state-of-the-art approaches. Applied to high-speed bearings using a large wind turbine dataset, this approach achieves a coefficient of determination of 0.9504, enhancing RUL prediction. This allows for more precise maintenance scheduling from imperfect predictions, reducing downtime and operational costs while improving system reliability under varying conditions. Keywords: bearings; deep learning; random forest; recurrent expansion; remaining useful life; wind turbine
周老师: 13321314106
王老师: 17793132604
邮箱号码: lub@licp.cas.cn