In modern manufacturing, the prediction of the remaining useful life (RUL) of computer numerical control (CNC) milling cutters is crucial for improving production efficiency and product quality. This study proposes a hybrid CNN-LSTM-Attention-PSA model that combines convolutional neural networks (CNN), long short-term memory (LSTM) networks, and attention mechanisms to predict the RUL of CNC milling cutters. The model integrates cutting force, vibration, and current signals for multi-channel feature extraction during cutter wear. The model’s hyperparameters are optimized using a PID-based search algorithm (PSA), and comparative experiments were conducted with different predictive models. The experimental results demonstrate the proposed model’s superior performance compared to CNN, LSTM, and hybrid CNN-LSTM models, achieving an R 2 score of 99.42% and reducing MAE, RMSE, and MAPE by significant margins. The results validate that the proposed method has significant reference and practical value for RUL prediction research of CNC milling cutters. CNC machine tools are critical in mechanical processing [ 1]. Key to CNC machining, milling cutters’ lifespans impact production efficiency and quality due to wear and potential damage [ 2]. Automation and intelligent tech developments make accurate RUL prediction for milling cutters essential for enhancing production and machining quality [ 3]. This study focuses on the prediction of tool wear, particularly bottom edge wear width, which significantly affects the RUL of CNC milling cutters. RUL prediction models leveraging signal processing and deep learning are emerging as key in manufacturing, offering precise technical solutions for RUL estimation [ 4]. These models use signal processing to refine prediction accuracy and stability by filtering and extracting features, supplying robust input data for deep learning. Peng et al. [ 5] and Huang et al. [ 6] demonstrated the effectiveness of spindle current and vibration signals in monitoring tool condition and wear during milling. Single sensors are limited in capturing comprehensive wear data, posing challenges for accurate RUL prediction [ 7]. Multi-sensor studies like those by Kuntoğlu et al. [ 8] have shown promise, with high correlations found between temperature and acoustic emission signals in identifying tool wear. Feature extraction, including time-domain, frequency-domain, and time–frequency domain analyses, is essential for improving RUL prediction accuracy [ 9]. Zhang et al. [ 10, 11] and Guo et al. [ 12] showed that multi-channel signal processing and self-attention mechanisms can significantly enhance prediction models, offering improved accuracy and industrial applicability. Deep learning’s capacity for handling big data and complex nonlinear problems positions it as a key technology in RUL prediction for milling cutters. Studies like those by Ambadekar et al. [ 13] and Marani et al. [ 14] have harnessed CNN and LSTM models respectively, demonstrating deep learning’s potential in this field. However, individual models like CNN or LSTM face limitations in comprehensively addressing RUL prediction tasks. Hybrid models, by leveraging multiple algorithms, can surpass these limitations, enhancing both generalization and accuracy [ 15, 16]. Sayyad et al. [ 17] and Zhang et al. [ 18] have shown that combining deep learning techniques with time–frequency analysis and the Hurst exponent can significantly improve prediction accuracy and computational efficiency. Attention mechanisms, as introduced by Xue et al. [ 19], have also enhanced feature extraction quality and model generalization in tool wear prediction. Meanwhile, hyperparameter optimization strategies such as those by Chintakindi et al. [ 20] and Mahmood et al. [ 21] have further refined model performance. Despite these advancements, challenges remain in the generalization and accuracy of RUL prediction models, particularly in complex wear scenarios. There is a need for more sophisticated integration of multi-channel sensor data, feature extraction, and algorithmic strategies to fully exploit the potential of these models. This study proposes a tool RUL prediction model based on hybrid CNN-LSTM-Attention-PSA. This model not only effectively integrates and analyzes multi-channel feature data but also enhances prediction accuracy and robustness through optimized processing. The main contributions of this paper can be summarized as follows: (1) This paper proposes and validates the application of a CNN-LSTM-Attention deep learning model optimized by the PSA algorithm for CNC milling tool RUL prediction. This model leverages the spatial feature extraction capability of CNN, the temporal sequence data processing capability of LSTM, and the attention mechanism’s ability to identify and utilize key information. This method extracts important features from multi-channel sensor data to precisely assess tool wear. This innovative combination significantly improves the model’s predictive accuracy; (2) This study introduces a proportional-integral-derivative (PID) theory-based search algorithm (PSA) [ 22] for model hyperparameter optimization. The PSA algorithm mimics the principles of a PID controller by using differential, integral, and proportional operations to perform fine-tuned searches of model parameters. This aims to minimize prediction errors and quickly and steadily find the optimal model parameters. This method enhances the model’s adaptability to uncertainties and non-linear problems, further improving predictive accuracy and generalization ability; (3) Through a series of rigorous experiments, the effectiveness of the CNN-LSTM-Attention-PSA model in RUL prediction is validated. Comparisons with other models demonstrate significant improvements in prediction accuracy and computational efficiency, laying the foundation for future research in related fields. This paper proposes and validates the application of a CNN-LSTM-Attention deep learning model optimized by the PSA algorithm for CNC milling tool RUL prediction. This model leverages the spatial feature extraction capability of CNN, the temporal sequence data processing capability of LSTM, and the attention mechanism’s ability to identify and utilize key information. This method extracts important features from multi-channel sensor data to precisely assess tool wear. This innovative combination significantly improves the model’s predictive accuracy; This study introduces a proportional-integral-derivative (PID) theory-based search algorithm (PSA) [ 22] for model hyperparameter optimization. The PSA algorithm mimics the principles of a PID controller by using differential, integral, and proportional operations to perform fine-tuned searches of model parameters. This aims to minimize prediction errors and quickly and steadily find the optimal model parameters. This method enhances the model’s adaptability to uncertainties and non-linear problems, further improving predictive accuracy and generalization ability; Through a series of rigorous experiments, the effectiveness of the CNN-LSTM-Attention-PSA model in RUL prediction is validated. Comparisons with other models demonstrate significant improvements in prediction accuracy and computational efficiency, laying the foundation for future research in related fields. This paper proposes an RUL prediction method based on a hybrid CNN-LSTM-Attention-PSA model. The powerful optimization capabilities of PSA are used to optimize the hyperparameters of the CNN-LSTM-Attention-based deep learning model to achieve optimal RUL prediction performance. Accurate and reliable tool wear prediction models are of significant importance for actual production [ 23]. These models can achieve high-precision monitoring and prediction of tool wear status, enabling timely scheduling of CNC machine maintenance and effectively avoiding production efficiency reduction due to tool damage. In the construction of RUL prediction models, algorithms and data signals are regarded as two core elements [ 24]. A single algorithm often has inherent flaws; thus, the current trend is to use hybrid algorithms for regression tasks. Data signals have a decisive impact on the training and prediction of regression models [ 25]. Single-channel signals are insufficient to meet the demands of advanced signal processing tasks in terms of data completeness, redundancy, and anti-interference capability [ 26]. Therefore, today’s RUL prediction models often require the use of multiple different signal data. Given the aforementioned background, this study proposes a deep learning framework that integrates multi-channel signal feature extraction, specifically a convolutional neural network–long short-term memory–attention (CNN-LSTM-Attention) model, as illustrated in Figure 1. This model integrates three different types of sensors to monitor cutting force, vibration, and spindle current, acquiring a total of seven channels of raw signals. Through preprocessing steps such as data cleaning, feature extraction, and dimensionality reduction, the raw signals are transformed into high-quality inputs for the model. This process generates 15 key features that capture the essential information needed for accurate RUL prediction. The model triggers tool replacement when the tool wear reaches a threshold of 0.42 mm, ensuring optimal production efficiency. The proposed method in this study primarily includes the following key improvements: (1) Extraction of features from multi-channel raw signals that have been cleaned. By integrating, extracting, and reducing the dimensionality of raw signals that have been cleaned, the feature recognition capabilities of the machine learning model are enhanced, providing a richer data foundation for model training and prediction; (2) This convolutional neural network (CNN) layer performs localized feature extraction through convolution operations, capturing spatial correlations within time-series data. By using filter arrays, the CNN effectively extracts features from the input data, making it suitable for processing complex time-series signals; (3) To address the temporal nature of the data, the LSTM layer is introduced, which enhances the model’s ability to capture long-term dependencies. The LSTM’s gating mechanism allows it to retain important temporal patterns while mitigating the vanishing gradient problem, thus supporting stable model training; (4) The self-attention mechanism is used to dynamically assign importance to different parts of the time-series input, enabling the model to focus on the most critical features. This selective attention improves the model’s performance in handling intricate relationships within the data. Extraction of features from multi-channel raw signals that have been cleaned. By integrating, extracting, and reducing the dimensionality of raw signals that have been cleaned, the feature recognition capabilities of the machine learning model are enhanced, providing a richer data foundation for model training and prediction; This convolutional neural network (CNN) layer performs localized feature extraction through convolution operations, capturing spatial correlations within time-series data. By using filter arrays, the CNN effectively extracts features from the input data, making it suitable for processing complex time-series signals; To address the temporal nature of the data, the LSTM layer is introduced, which enhances the model’s ability to capture long-term dependencies. The LSTM’s gating mechanism allows it to retain important temporal patterns while mitigating the vanishing gradient problem, thus supporting stable model training; The self-attention mechanism is used to dynamically assign importance to different parts of the time-series input, enabling the model to focus on the most critical features. This selective attention improves the model’s performance in handling intricate relationships within the data. In summary, by integrating CNN for local feature extraction with LSTM for temporal sequence learning and incorporating attention mechanisms to prioritize critical features, the proposed model addresses the limitations of prior approaches that rely on either CNN or LSTM alone. This hybrid architecture allows for more accurate and robust RUL prediction in dynamic manufacturing environments. PID-based Search Algorithm (PSA) PID-based Search Algorithm (PSA) The PSA’s unique control mechanism allows it to balance exploration and exploitation during the search process, thereby avoiding local optima and achieving global optimization. It is particularly suitable for complex and computationally expensive neural network models such as the CNN-LSTM-Attention model used for predicting CNC milling cutter wear. Its fine control over hyperparameters is expected to improve model performance while reducing the computational resource consumption during the tuning process. The optimization flowchart for the PSA algorithm is shown in Figure 2. The PSA algorithm successfully applies the essence of PID controller theory—the dynamic adjustment mechanism—to the process of solving optimization problems. For a detailed procedure of the PSA algorithm to optimize the model parameters, please refer to Appendix A. 2. CNN-LSTM-Attention Method for Hyperparameter Optimization Based on PSA CNN-LSTM-Attention Method for Hyperparameter Optimization Based on PSA As shown in Figure 1, this study proposes a hierarchical model combining CNN-LSTM with an attention mechanism to achieve effective feature extraction and long-term dependency modeling for sequential data. Firstly, the model utilizes a one-dimensional convolutional layer with a fixed window size to extract local features of the input. The rectified linear unit (ReLU) activation function is applied after the convolutional layer to introduce non-linear features and enhance the feature representation capacity. Then, the max-pooling layer compresses the feature dimensions, effectively highlighting the main features and suppressing noise. The LSTM layer, as the core of this structure, accepts the output of the convolutional layer, and its gating mechanism allows the model to capture long-term dependencies in the sequential data. By setting the Return_Sequences parameter to true, the LSTM layer ensures output at each time step, providing a foundation for the subsequent attention layer. The self-attention mechanism calculates the mutual influences between elements within the sequence, introducing dynamic reallocation abilities for sequence weighting. This addresses dependencies within layers and global information integration. The mechanism calculates attention weights using the sigmoid function, optimizing the focus of the model. The dropout layer randomly disconnects certain neural connections, reducing the risk of model overfitting. The final stage of the model is the fully connected layer, responsible for outputting the final prediction results. In model compilation, the adaptive moment estimation (Adam) optimizer and mean squared error (MSE) loss function are used to adjust weights and optimize network performance. The entire network architecture is progressively layered, comprehensively learning both sequential and dynamic features and providing an effective and sophisticated analysis tool for complex time-series prediction tasks. To further optimize hyperparameter selection, this study employs the PSA algorithm to avoid subjective bias in manual parameter selection and achieve globally optimal hyperparameter combinations. This optimization approach not only enhances the model’s fitting capability for milling wear data but also significantly improves prediction accuracy. The RUL prediction process of the CNN-LSTM-Attention method combined with PSA hyperparameter optimization consists of four main steps, as shown in the framework and flowchart in Figure 3. The specific steps are as follows. Step 1: Use sensors to obtain raw data during the milling process, including three channels of vibration signals, three channels of cutting force signals, and one channel of spindle current signal. The wear amount observed by the industrial imaging measurement instrument during each cut is used as the prediction target of the model. Step 2: Fuse and extract features from the multi-channel raw data, including feature preprocessing and feature engineering, to construct high-quality training and testing datasets. Step 3: Create and initialize the CNN-LSTM-Attention model and its hyperparameters, then train the model using the training dataset. Step 4: Under the guidance of the PSA optimizer, input the test dataset into the CNN-LSTM-Attention model. Based on the evaluation metrics output by the model, the optimizer assesses whether the optimization target has been achieved. If achieved, lock in and output the optimal hyperparameters for re-training the model and outputting the final target value; if not, continue iterative optimization until the best hyperparameter combination is obtained. The prediction principle of the proposed CNN-LSTM-Attention-PSA in this paper is as follows: 1. The design of the deep learning model primarily combines Conv1D, LSTM, and self-attention. The design of the deep learning model primarily combines Conv1D, LSTM, and self-attention. The Conv1D operation applies filters to the input sequence to capture local spatial patterns, mathematically represented as follows: C i ( t ) = ∑ j = 0 k − 1 W i j · X ( t + j ) + b i (1) where C i ( t ) represents the feature map generated by the i -th convolution kernel at time step t , W i j is the weight of the convolution kernel, X ( t + j ) is the value of the input time series at time step t + j , b i is the bias term, k is the size of the convolution kernel, and t is the time step. The ReLU activation function φ is applied to enhance the model’s expressive power and its ability to learn non-linear relationships, aiding the network in learning complex patterns. The expression for the ReLU activation function is as follows: φ x = m a x 0 , x (2) C i ′ t = φ C i t (3) To further compress the feature map and reduce noise, a max pooling layer is employed after the ReLU function. Max pooling selects the maximum value from a fixed-size window in the convolution output, reducing the dimensionality of the data while retaining the most salient features. The formula for max pooling is as follows: M ( C i ) = m a x N K ( C i ) (4) where M ( C i ) represents the value at position i in the pooled sequence, while N 2 ( C i ) denotes the neighborhood of length K centered at position i in the original sequence. In practice, K is set to 2 in this study, thus selecting the maximum activation value within each 2 × 2 region. By applying the max pooling layer, the original time-series data, after convolution and max pooling, is transformed into a new compressed feature map M . The pooled data structure has dimensions of ( L 2 ) × F , where the sequence length is halved to L 2 , while the feature dimension F remains unchanged. This data structure provides high-quality and information-rich input for the subsequent LSTM layer, laying the foundation for the application of deep learning models in tool wear prediction. The new compressed feature map M is transferred to the LSTM layer, the core component responsible for capturing long-term dependencies in sequential data. LSTM’s gating mechanisms, including the forget gate, input gate, and output gate, enable it to retain important temporal information over long sequences. The LSTM equations are as follows: f t = σ ( W f · h t − 1 , M t + b f ) (5) i t = σ ( W i · h t − 1 , M t + b i ) (6) C t ~ = t a n h ( W c · h t − 1 , M t + b c ) (7) C t = f t ∗ C t − 1 + i t ∗ C t ~ ) (8) o t = σ ( W o · h t − 1 , M t + b o ) (9) h t = o t ∗ t a n h ( C t ) (10) where f t , i t , and o t represent the forget gate, input gate, and output gate, respectively; t a n h denotes the hyperbolic tangent activation function; h t − 1 is the hidden state from the previous time step; M t is the input at the current time step; the notation h t − 1 , M t signifies the concatenation of h t − 1 and M t ; σ represents the sigmoid activation function; the symbol * denotes element-wise multiplication of vectors or matrices. By applying the aforementioned operations to each time point in the sequence, the LSTM layer outputs a series of hidden states denoted as H = h 1 , h 2 ⋯ , h L 2 , where each one is a comprehensive reflection of the information at its corresponding time point, capturing the long-term dependencies from the start of the sequence to the current point. The hidden-states H output set by the LSTM layer maintains the length of the input sequence at L 2 but re-encodes the information at each time point. This transformation provides ideal input for the subsequent attention mechanism, which allows the model to focus on the most critical parts of the sequence based on these hidden states. Next, the self-attention mechanism dynamically allocates importance to different elements within the sequence by calculating attention scores. These scores reflect the relevance of each time step in predicting the target output. The attention score is computed as given below: s c o r e ( q , h t ) = q T W α h t (11) where s c o r e ( q , h t ) is the attention score, and W α is the weight matrix, a parameter learned during the model training process. Next, the softmax function is used to transform these scores into a probability distribution, namely the attention weights: α t = e x p ( s c o r e ( q , h t ) ) ∑ j = 1 L 2 e x p ( s c o r e ( q , h j ) ) (12) where α t represents the relative importance of time step t in the entire sequence, with L 2 being the sequence length. Once the attention weights for each time step are calculated, the model uses these weights to perform a weighted sum of all hidden states, resulting in the final context vector: c = ∑ t = 1 L 2 α t h t (13) where L 2 represents the sequence length, and the context vector c is a summary of the entire input sequence by the model, which will be used for subsequent prediction tasks. After obtaining the weighted context vector c , the model proceeds to the final step of decision making. Firstly, to prevent overfitting and enhance the model’s generalization ability, a dropout layer is introduced: c ′ = D r o p o u t ( c , r a t e = d r o p o u t _ r a t e ) (14) where c ′ is the context vector processed by dropout, and d r o p o u t _ r a t e is the dropout probability. Subsequently, the model maps the processed context vector to the predicted values through a dense layer: y ^ = W · c ′ + b (15) where y ^ is the model’s predicted milling tool wear, and W and b are the weights and biases of the dense layer, respectively. PSA Hyperparameter Optimization Principle To enhance model performance in predicting CNC milling tool wear, this paper employs the PSA algorithm, which simulates the dynamic adjustment strategy of a PID controller, to optimize model hyperparameters, aiming to minimize prediction error and improve prediction accuracy. Initialization: First, define the position of each individual in the hyperparameter space and the PID control parameters K p , K i , and K d ; these three parameters need to be adjusted according to the actual situation to ensure the stability and speed of the algorithm. The initialization process can be formally represented as follows: X 0 = l b + ( u b − l b ) × r a n d ( N , D ) (16) where l b and u b represent the lower and upper bounds of each hyperparameter, respectively; r a n d ( N , D ) generates an N × D matrix of random numbers between [0,1]. Here, N is the population size, and D is the hyperparameter dimension. Evaluate fitness: Fitness evaluation is conducted by calculating the performance metrics of each individual (i.e., a specific hyperparameter configuration) through experiments or simulation models. Assuming the goal is to minimize the model’s prediction error, the fitness function f( x) can be calculated using the mean squared error ( MSE): f x = M S E y ^ i , y i = 1 n ∑ i = 1 n y ^ i − y i 2 (17) where y i is the actual value, and ŷ i is the predicted value. Apply the PID control strategy for individual updates. In each iteration, individuals update their positions in the hyperparameter space based on the PID control mechanism: ∆ X = K p × E ( t ) + K i × ∑ E ( t ) + K d × E t − E ( t − 1 ) Δ t (18) X t + 1 = X t + Δ X (19) where Δ X is the adjustment amount calculated based on the error E ( t ) , E ( t ) is the error in the current iteration, ∑ E ( t ) is the accumulation of historical errors, and E t − E ( t − 1 ) Δ t attempts to predict the error’s future trajectory so as to adjust the strategy proactively and prevent additional error escalation. Boundary checking and adjustment: After updating, check the position of each individual to ensure it does not exceed the preset boundaries: X t + 1 = m i n ( m a x ( X t + 1 , l b ) , u b ) (20) Iterate until the set maximum number of iterations T is reached or until a specific convergence criterion is met. Record the individual with the optimal fitness as the final hyperparameter set. In milling tool RUL prediction research, data quality directly determines the performance of the CNN-LSTM-Attention-PSA model. Compared to single-channel signals, multi-channel signals and multi-channel signal fusion for feature extraction significantly improve the quality and dimensions of feature extraction due to their ability to provide richer information. Thereby, it optimizes the depth and breadth of model learning and improves the accuracy of feature recognition [ 27]. Therefore, multi-channel signals not only enhance the performance of the CNN-LSTM-Attention-PSA model but also improve its robustness and generalization ability in complex environments. The construction and evaluation of the proposed CNN-LSTM-Attention-PSA model in this paper are based on the precise measurements of milling cutter edge wear depth obtained using the JVB250 industrial video measuring instrument, which is manufactured by Guiyang Xintian OETECH Co., Ltd., located in Guiyang, China. This instrument is equipped with a 4.1-megapixel color CCD camera and a programmable three-ring cold light source with a laser pointer, offering a magnification range of 28 to 180 times. The measurement accuracy of the JVB250 is 3 μ m . During the metal cutting process, the wear of the milling cutter causes changes in various physical signals. The commonly used signal detection sensors are shown in Table 1, which include cutting force signals, vibration signals, current signals, power signals, acoustic emission signals, and temperature signals [ 28]. Considering functionality, cost, and ease of installation, this paper selects cutting force signals, vibration signals, and current signals as the main sources of raw data for model training and testing. In this study, down milling was employed on 45 steel with a spindle speed of 1500 rpm, a feed rate of 1000 mm/min, and a cutting depth of 0.5 mm. A four-flute carbide end mill was used for the machining tasks. After each pass, the wear on the milling cutter was measured using an industrial video measuring instrument. 1. Cutting Force Signal Cutting Force Signal By monitoring the cutting force during milling operations, we can capture the interaction between the tool and the workpiece in real time [ 29]. Such signal monitoring can accurately map the degree of tool wear because changes in cutting force are closely associated with the geometric alterations that occur during the tool wear process. The advantages of this signal lie in its directness and high correlation, making it a crucial data source for evaluating tool wear. However, its drawback is that measuring cutting force usually requires specialized equipment, which can increase experimental costs. Additionally, under variable and complex machining conditions, the measured signal can be easily interfered with by various factors, affecting data accuracy. This paper uses the KCDW three-axis dynamometer to simultaneously measure the raw cutting force signals of the milling cutter in the x, y, and z axes during machining, with a sampling frequency set at 1000 Hz. The dominant cutting force frequency is 100 Hz, much lower than the sampling frequency. The KCDW triaxial dynamometer used in this study has a measurement range of 10 to 2000 N, and its output sensitivity ranges from 1.0 to 1.5 mV/V. The actual force measured in the experiment ranged from 0 to 130 N. The dynamometer’s sensitivity allows it to detect changes in force as small as 0.01 N, which is sufficient to capture the subtle force variations indicative of early tool wear, enhancing the detection and monitoring accuracy. 2. Vibration Signal Vibration Signal Monitoring the vibration signals generated during processing provides another perspective for observing the tool wear state. Vibration signals are capable of detecting subtle changes that cutting force signals often fail to detect, particularly in instances of minor wear and tool tip fracture [ 30]. The advantage lies in its refined sensitivity to detect subtle changes in tool status, and the equipment is comparatively straightforward to set up. However, the drawback is that vibration signals may be influenced by various factors such as the condition of the machine tool and material properties, making signal analysis more complex. This paper uses the WT9011DCL-BT50 vibration sensor to simultaneously measure the raw vibration signals of the milling cutter in the X, Y, and Z axes during machining, with a sampling frequency set at 200 Hz. This acquisition frequency was selected to capture the key vibration frequencies associated with the milling cutter wear process. It is deemed suitable for capturing the dominant vibration modes without being overwhelmed by excessive high-frequency noise, thereby ensuring efficient data collection and analysis. 3. Spindle Current Signal Spindle Current Signal By monitoring the changes in the motor current signal driving the milling cutter, the tool’s condition can be indirectly monitored. Changes in the current signal are related to the tool load, making it an indirect indicator of tool wear [ 31]. Its advantages include the simplicity and low cost of the measurement method as well as the non-contact nature with the tool and workpiece, which ensures that the machining process is not affected. However, its drawback is the low sensitivity of current signals to tool wear, especially in the initial stages of slight wear, where the signal changes may not be significant. This paper uses the DAM3154 acquisition module sensor to simultaneously measure the three-phase current signals of the milling cutter during machining, with a sampling frequency set at 100 Hz. This acquisition frequency ensures that there is sufficient resolution to capture motor load changes associated with tool wear while avoiding excessive high-frequency noise. Given the relatively slow change in spindle load due to milling forces, 100 Hz is sufficient to track significant fluctuations without being overwhelmed by high-frequency components, which are less relevant to this type of signal. The method of raw data acquisition is shown in Figure 4. The three-dimensional force sensor is installed on the workbench, the vibration sensor is installed on the spindle, and the current sensor is installed on the three-phase leads of the spindle motor within the milling machine’s control cabinet. Through the installation and acquisition with these three sensors, a total of seven channels of raw characteristic signals were collected. These three signals have different focuses and together form a comprehensive signal system to predict the RUL of milling cutter. Cutting force signals provide direct mechanical information, and vibration signals capture subtle changes, while current signals serve as an indirect and cost-effective monitoring method. Machine learning models integrate these different types of signals to improve prediction accuracy and reliability, achieving effective predictions of tool RUL. To address the varying sampling frequencies of force (1000 Hz), vibration (200 Hz), and current (100 Hz) signals, all signals were resampled to a uniform frequency of 1000 Hz. Linear interpolation was applied for both vibration and current signals to maintain data alignment and ensure that the signals were synchronized for subsequent analysis. This method minimizes the risk of aliasing and preserves the essential characteristics of the lower-frequency signals. Figure 5 and Figure 6, respectively, depict the time-domain and frequency spectrum of the original signal and the resampled signal in the x-direction of vibration. The absence of significant changes confirms the effectiveness of our resampling method. To address the issues related to sampling near the Nyquist frequency (100 Hz for the 200 Hz sampling rate), a low-pass filter was applied to the original signal before interpolation. This filter eliminates any frequency components above 100 Hz, ensuring that the resampled signal does not introduce interpolation artifacts or frequencies beyond the Nyquist limit. This approach guarantees that the dominant frequency component, which lies exactly at 100 Hz, is accurately represented without contamination from higher frequencies. By applying a low-pass filter prior to signal linear interpolation, we ensured that the resampled signal remained within the Nyquist limit, improving the reliability of our vibration data analysis. This modification eliminates high-frequency artifacts and strengthens the validity of the results presented. During the monitoring process of machining, the 7-channel signals collected by sensors are often accompanied by a high proportion of noise, which may interfere with subsequent data analysis and model training [ 32]. To reduce unnecessary information interference, this study first performed data cleaning, denoising, and other data preprocessing on the raw signals before applying machine learning algorithms. Missing values were handled using linear interpolation, and a combined method of ensemble empirical mode decomposition (EEMD) and wavelet threshold denoising was employed for denoising. This ensures the provision of high-quality data for subsequent data analysis and model training. For a detailed process of missing value processing and denoising, please refer to Appendix B. Taking the x-axis vibration signal as an example, a comparison of the signal before and after denoising after resampling is shown in Figure 7. The comparison shows a significant reduction in random noise and an enhancement in signal stability. This processing method effectively retains the key information related to tool wear, providing a signal basis that facilitates subsequent feature engineering analysis. To further address potential high-frequency interference, especially vibrations from the spindle and motor above 500 Hz, a band-pass filter was applied during signal processing. This filtering process helped isolate the frequency range most relevant to tool wear dynamics, minimizing the impact of noise and ensuring the integrity of the cutting force data. When addressing machine learning prediction problems for CNC milling tool wear, the preprocessing of the data is a vital step. This study employed a series of feature engineering techniques to optimize the data to meet the needs of the CNN-LSTM-Attention-PSA model. By employing analysis methods from the time-domain, frequency-domain, and time–frequency domain, this study comprehensively extracted signal features to capture their essential attributes. Specifically, this includes 10 statistical measures in the time domain: mean, standard deviation, peak-to-peak value, root mean square, skewness, kurtosis, shape factor, impulse factor, margin factor, and crest factor. These statistical measures not only reflect the peak forces on the milling tool and the stability of the mechanical system but also may reveal the correlation between the tool wear status and the contact area of the cutting edge. In the frequency-domain analysis, by transforming into the frequency space, four frequency-domain features were extracted, namely average amplitude, centroid frequency, frequency variance, and mean square frequency, to deeply analyze the signal’s frequency and energy distributions. Frequency-domain features were obtained through Fourier transform (FFT), revealing spectral details in multi-channel cutting force and vibration signals, such as machine tool dynamic response and tool-workpiece interaction characteristic frequencies. In this study, wavelet packet decomposition (WPD) combined with Meyer wavelet further analyzed the signals, optimizing their time–frequency localization, thereby capturing dynamic characteristics of the signal in different frequency bands, resulting in a comprehensive and precise description of the signal energy distribution. For a three-level wavelet packet decomposition, the energy of eight frequency bands was retained. Through the above feature processing procedures, the 7-channel signal data collected in this study were processed via feature engineering, extracting 22 features from each channel, forming a set of 154 features. In this study, the entire cycle of milling tool failure was analyzed over 377 cuts, thus obtaining a feature matrix of 377 × 154 and a wear target matrix of 377 × 1. To prevent the complexity of the model from increasing and to guard against overfitting, this study initially standardized each feature to eliminate the influence of dimensions on the model, and assesses the information gain of the features. Consequently, a random forest algorithm was used to select the features that significantly contribute to the prediction of tool wear. As shown in Figure 8, the importance of the 154 features was calculated using the random forest algorithm, revealing their importance in relation to the target variable. From Figure 8, it can be observed that the cutting force in the X direction is strongly correlated with tool wear, with a total importance exceeding 50% of all features, while the sum of all current signal features only accounts for 0.013. This indicates that single-sensor data may not be sufficient to provide all the feature information required for machine learning. The results show that the cutting force in the x-direction has a stronger correlation with tool wear compared to the y-direction. This can be attributed to the nature of the down milling operation, where the primary cutting load is concentrated in the x-direction. The y-direction forces are less sensitive to wear as they mainly reflect the lateral forces during the milling process. This dynamic results in the observed weaker correlation between the y-direction forces and tool wear. By setting the importance threshold at 0.01, the 15 most informative features were selected, and the total importance of the filtered new feature set reached 90%. This method not only reduces the complexity of the model but also retains the maximum amount of original data information, laying the foundation for building accurate predictive models. By preprocessing and engineering features from the raw data, a feature fusion matrix of 377 × 15 was constructed. Considering that CNC milling operation parameters (such as milling speed, feed rate, cutting depth, etc.) continuously change over time, and the current wear state may be affected by previous operational states, we introduced a sliding window method to capture such sequential dependencies. Since there are seven signal channels in this research scenario, the window size was set to 7 optimally to effectively capture the time dependence as well as to control the computational complexity, thus obtaining better prediction performance. Accordingly, the time-series sliding window feature matrix was set to 370 × 7 × 15, making it suitable for training machine learning models such as LSTM or CNN that rely on time-series data. This study constructed an experimental platform using a CNC machine VMC850S. The experiments were conducted using a four-flute square end mill made from Tungsten Carbide, with a diameter of 10 mm, sourced from Xiamen Golden Egret Special Alloy Co., Ltd., located in Xiamen, China. The tool was secured in an HSK taper holder, and the manufacturer specifies that the system runout is maintained within a tolerance of 5 microns. The experimental material employed was AISI 1045, which includes a Brinell hardness of 197 HBW, an ultimate tensile strength of 585 MPa, an elongation at fracture of 16%, and an ASTM grain size number of 6. These attributes render it a favorable choice across diverse manufacturing applications. To account for batch variability, this study conducted full-life-cycle experiments with three square milling cutters under the same operating conditions. A multi-sensor system was deployed to capture cutting signal data used for training the CNN-LSTM-Attention-PSA model. The data collection covered cutting force, vibration, and spindle current signals. Specific information about the sensor equipment is shown in Table 2. The KCDW three-dimensional force sensor was utilized for measuring cutting force data, which can accurately capture the forces exerted on the milling cutter during the cutting process. The data transmission method for the three-dimensional force sensor involved outputting digital signals directly through an RS485 cable to a computer, which were then converted into force data by specialized software provided by the manufacturer. Vibration data were collected by the WT9011DCL-BT50 Vibration Sensor, ensuring high sensitivity in detecting minor vibrations. The data transmission for the vibration sensor was facilitated through a Bluetooth connection to the PC for direct data transmission. The JLK-36 Open-ended Hall Current Sensor was tasked with real-time collection of spindle current, thereby reflecting the load variations experienced by the spindle. The current sensor data transmission method employed the JLK-36 Open-ended Hall Current Sensor for direct measurement, and the data were subsequently acquired by the DAM3154 Current Acquisition Module and transmitted to the computer. After each milling pass cycle, the JVB250 industrial imaging measuring instrument was used to accurately measure the wear of the milling cutter, facilitating subsequent analysis of tool wear. The experimental equipment setup is shown in Figure 9. An example measurement of a worn tool is shown in Figure 10, where the process of detecting the amount of tool wear is documented in detail. In the milling experiment, to better demonstrate tool wear, the process parameters were set according to the recommendations of the tool manufacturer, which include a spindle speed of 1500 rpm, feed rate of 1000 mm/min, cutting depth of 0.5 mm, and cutting width of 1 mm. According to the calculation, the cutting speed of the tool was 785.46 mm/s. Down milling was employed, and no cutting fluid was used throughout the process. The operator followed a predetermined path to cut the tool from the starting point to the end point of the workpiece, with each complete pass having a travel length of 300 mm. After each pass, the milling cutter was removed and measured for wear using the JVB250 industrial imaging measuring instrument. Tool wear was measured in terms of bottom edge wear width (VB), as specified by the ISO 3685:1993 standard [ 33]. The VB was observed after each cutting pass using the JVB250 industrial imaging measuring instrument. After each wear measurement, the tool was remounted on the milling machine for the next machining under the same conditions and so on until the measured wear reached the set upper limit. Under these experimental conditions, the observed VB is shown in Figure 11, following the typical behavior pattern of tool wear. More specifically, in the initial stage of machining, i.e., phase A, the rate of tool wear was notably high, and with the accumulation of operational cycles, the degree of wear increased significantly. Upon entering phase B, tool wear progression exhibited a stable pattern. In the late stage C of machining, the tool wear progressively intensified, and as the cutting cycles continued to increase, the rate of wear acceleration increased sharply. For constructing the prediction model, an integrated CNN-LSTM-Attention architecture was employed. The CNN layer effectively extracts local features from time-series data while capturing the long-term dependencies within it. The attention mechanism layer further enhances the model’s identification of key time-series segments, thereby enhancing prediction accuracy. However, this deep learning model involves a multitude of parameters and hyperparameters, such as the learning rate, batch size, number of units, and dropout rate, which must be carefully tuned to optimize model performance. To this end, the study introduced PSA to optimize the model’s hyperparameters. In the PSA, a series of parameters were established, including a population size of 20 (N), a number of variables of 4 (nvars), a number of iterations of 100 (T). According to the recommendations of the PSA algorithm’s proposer, in this study, K p , K i , and K d were set to 1, 0.5, and 1.2, respectively. This paper defines the search space for the following hyperparameters: a learning rate ranging from 0.001 to 1, a batch size ranging from 8 to 128, a number of hidden layer units ranging from 16 to 128, and a dropout rate ranging from 0.1 to 0.5. These parameters collectively work in the PSA to guide the search process and find the optimal combination of hyperparameters. The optimization process is shown in Figure 12. Specifically, the algorithm adopts a PID control-based approach, iteratively searching and adjusting the impact of each parameter on the objective function, continually approaching the optimal solution. The model’s loss function, mean squared error (MSE), is used as the optimization objective, meaning the parameters are adjusted to minimize the value of the loss function. The optimal configuration was found during the 76th global iteration, with a learning rate of 0.0068, a batch size of 13, a number of units of 127, and a dropout rate of 0.3620. The specific structure and related parameters of the model are shown in Figure 13. This study evaluated the trained model on the test set, including evaluation metrics such as mean squared error ( MSE), root mean squared error ( RMSE), mean absolute error ( MAE), and R 2 score, as depicted in Equations (21)–(24). The model’s performance on each metric can intuitively reflect the accuracy and stability of the model’s predictions. R 2 = 1 − ∑ i = 1 n y ^ i − y i 2 ∑ i = 1 n y ¯ i − y i 2 (21) M A E = 1 n ∑ i = 1 n y ^ i − y i (22) R M S E = 1 n ∑ i = 1 n y ^ i − y i 2 (23) M A P E = 100 % n ∑ i = 1 n y ^ i − y i y i (24) where n is the total number of samples, y i is the actual value, y ¯ i is the mean of the actual observed values, and ŷ i is the predicted value. This study compared the performance of the RUL prediction model as shown in Table 3. In comparison with the CNN-LSTM-Attention model, we established four distinct sets of hyperparameters and displays their corresponding evaluation metrics. Additionally, the last row of the table presents the evaluation metrics of the CNN-LSTM-Attention model optimized with PSA. Table 3 presents a comparison of the evaluation metrics for the CNN-LSTM-Attention and CNN-LSTM-Attention-PSA models, revealing that the CNN-LSTM-Attention-PSA model outperforms in addressing the RUL prediction problem. The R2 value can reach 0.9942, indicating that the model can explain 99.42% of the variance in the sample wear amounts. The MAE and RMSE are 4.7557 × 10 −3 and 6.0846 × 10 −3, respectively, meaning that the model’s average error and standard deviation in predicting each sample’s wear amount are both below 0.01 mm. Finally, the MAPE value is 4.4429, indicating that the model’s average percentage error in predicting wear amounts is less than 5%. Conversely, the performance of the four manually tuned CNN-LSTM-Attention models was highly unstable, and even the best-performing set was inferior to the CNN-LSTM-Attention-PSA model. The prediction errors of the CNN-LSTM-Attention-PSA model are shown in Figure 14. Specifically, Figure 14a presents the prediction errors for all samples, and Figure 14b presents the prediction errors for the test samples. In the figures, all sample prediction errors do not exceed 0.0175 mm, with a total wear amount of 0.42 mm, demonstrating the strong robustness and reliability of the CNN-LSTM-Attention-PSA model in RUL prediction. To further elucidate the role of each component in the CNN-LSTM-Attention-PSA model for RUL prediction, this study conducted a comparative analysis of various model combinations, as presented in Table 4, examining the performance of the CNN model, LSTM model, CNN-Attention, LSTM-Attention, CNN-LSTM, and CNN-LSTM-Attention models in predicting the RUL of the tool, ultimately focusing on evaluating the performance of the CNN-LSTM-Attention-PSA model optimized using the PSA. Upon examining Figure 15 in conjunction with Table 4, it can be seen that the LSTM model demonstrates superior performance over the CNN model in key metrics such as R2, MAE, RMSE, and MAPE, showcasing its distinctive advantages in addressing time-series issues. With the introduction of the attention mechanism, both the CNN and LSTM models exhibit marked improvements in performance, which suggests that the attention mechanism enables the model to concentrate on the most critical data sequences, thus improving the accuracy of predictions. On the foundation of the original CNN-LSTM model, this study employed the attention mechanism to strengthen the model’s capability to capture key information within a time series. Moreover, by integrating the PSA algorithm for hyperparameter optimization, the study not only enhanced the prediction accuracy but also boosted the model’s adaptability to tool wear characteristics across various conditions. The PSA algorithm, as a PID-based search algorithm, offers the advantage of dynamically tuning key hyperparameters (such as the learning rate and the number of units in the hidden layers), enabling real-time model optimization during training. This dynamic adjustment mechanism enables the model to sustain an optimal learning state throughout the training process, thereby ensuring high predictive accuracy and robustness for complex tasks like tool wear prediction. The CNN-LSTM-Attention-PSA model not only achieves an R2 evaluation metric of 0.9942 but also demonstrates significantly lower MAE and RMSE metrics compared to other models. These findings underscore the pivotal role of the PSA algorithm in enhancing model performance. After conducting an in-depth comparative performance analysis, it is evident that relying solely on either CNN or LSTM models, although they perform well in tool wear prediction, may not be adequate for all scenarios. The introduction of the CNN-LSTM-Attention model significantly enhances its deep recognition and analytical capabilities for time-series data. However, the model’s most significant performance optimization comes from applying the PSA algorithm. The PSA algorithm not only optimizes the model’s training process but also significantly improves its ability to recognize complex patterns, which is essential for enhancing both the precision and generalization of tool wear predictions. In summary, by examining the performance disparities among various deep learning models on the tool RUL prediction task, the substantial benefits of applying the PSA to the optimization of the CNN-LSTM-Attention model are demonstrated. The proposed method achieved a minimum reduction of 34.32% in R 2, 18.81% in MAE, 20.55% in RMSE, and 19.74% in MAPE compared to other methods. CNN-LSTM-Attention-PSA outperformed other models in terms of RUL prediction effect. This study constructed an innovative hybrid CNN-LSTM-Attention-PSA model with the aim of achieving high-precision predictions of remaining useful life (RUL) for CNC milling cutters. Through experiments, the following conclusions were drawn: (1) The fusion of multi-channel data features significantly impacts model prediction, enhancing the model’s generalization ability and prediction accuracy and avoiding the problem where single-sensor data might be insufficient to provide all the feature information needed by machine learning. The results obtained demonstrate that multi-channel signal integration, including force, vibration, and current signals, is essential for capturing the full range of tool wear characteristics; (2) Optimizing the model’s hyperparameters with the PID-based search algorithm (PSA) significantly improves the model’s predictive performance. The PSA optimization further enhances the robustness of the model by fine-tuning hyperparameters, thus yielding more reliable predictions compared to other baseline models. In the comparison of CNN-LSTM-Attention models with manually randomized hyperparameters to the hybrid CNN-LSTM-Attention-PSA model, while keeping the algorithmic structure constant, the hybrid model demonstrates superior performance; (3) Compared to CNN, CNN-LSTM, LSTM, CNN-Attention, LSTM-Attention, and CNN-LSTM-Attention models, the hybrid CNN-LSTM-Attention-PSA model outperforms others in predicting tool RUL. CNN algorithms lack in-depth processing of time-series data in RUL prediction. The LSTM algorithm lacks feature extraction and attention mechanism and has limited generalization ability. CNN-LSTM lacks attention mechanism and hyper-parameter optimization and has insufficient prediction accuracy and generalization. The fusion of multi-channel data features significantly impacts model prediction, enhancing the model’s generalization ability and prediction accuracy and avoiding the problem where single-sensor data might be insufficient to provide all the feature information needed by machine learning. The results obtained demonstrate that multi-channel signal integration, including force, vibration, and current signals, is essential for capturing the full range of tool wear characteristics; Optimizing the model’s hyperparameters with the PID-based search algorithm (PSA) significantly improves the model’s predictive performance. The PSA optimization further enhances the robustness of the model by fine-tuning hyperparameters, thus yielding more reliable predictions compared to other baseline models. In the comparison of CNN-LSTM-Attention models with manually randomized hyperparameters to the hybrid CNN-LSTM-Attention-PSA model, while keeping the algorithmic structure constant, the hybrid model demonstrates superior performance; Compared to CNN, CNN-LSTM, LSTM, CNN-Attention, LSTM-Attention, and CNN-LSTM-Attention models, the hybrid CNN-LSTM-Attention-PSA model outperforms others in predicting tool RUL. CNN algorithms lack in-depth processing of time-series data in RUL prediction. The LSTM algorithm lacks feature extraction and attention mechanism and has limited generalization ability. CNN-LSTM lacks attention mechanism and hyper-parameter optimization and has insufficient prediction accuracy and generalization. In conclusion, the proposed hybrid CNN-LSTM-Attention-PSA model demonstrates excellent performance in the RUL prediction tasks of CNC milling cutters. This not only offers a novel research perspective for related fields but also holds practical value for future production and maintenance activities. Conceptualization, M.Z., J.Z. and N.M.; methodology, M.Z. and J.Z.; software, J.Z.; validation, J.Z.; formal analysis, M.Z.; investigation, M.Z; resources, J.Z., L.B., S.N., Y.Z. and N.M.; data curation, M.Z.; writing—original draft preparation, M.Z.; writing—review and editing, J.Z.; visualization, M.Z.; supervision, J.Z., L.B., S.N., Y.B., Y.Z. and N.M.; project administration, J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript. This research was funded by the Shandong Provincial Natural Science Foundation (ZR2023ME146) and the National Natural Science Foundation of China (NO. 52401352, 52476094). The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding authors. Author Lingfan Bu was employed by the company Shandong Wangxin Security Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Initialize PSA parameters: Define the initial settings of the algorithm, including population size, variable boundaries, maximum number of iterations, and PID controller parameters. These parameters lay the foundation for the algorithm’s operation, affecting the scope and efficiency of the search. Initialize the population: The construction of the initial population offers a set of random samples for exploration by the algorithm, establishing the starting point of diversity, laying the foundation for global optimization, and thereby ensuring a broad and unbiased search process. Calculate objective function value: Calculate the objective function values for each individual within the population and assess the fitness of each individual, quantifying the performance of each solution and offering a basis for decision making in subsequent selections and iterations. Find the historical best solution: Identify the best individuals and their fitness values in the current population. Provide a reference benchmark for subsequent iterations, ensure that the optimization direction is consistent with the historical best performance, and effectively guide the population towards a better solution space. Calculate system deviations: Calculate system deviation based on the best individual and the current population, providing feedback information for the PID controller, guiding the population to dynamically adjust the search strategy, and ensuring the algorithm maintains a balance between global search and local refinement. Calculate the external disturbance of Levy flight: Introduce a non-local random exploration mechanism to enhance global exploration capabilities, helping the population escape local optima and improve solution quality. Calculate PID control quantity: Calculate PID control quantities based on system deviation. Use a proportional-integral-derivative (PID) control strategy to dynamically adjust the deviation of individuals relative to the optimal position. The proportional part adjusts according to the magnitude of the deviation, the integral part considers the historical accumulation of deviations, and the derivative part predicts and adjusts based on the trend of deviation changes. Update candidate solutions: Update the positions of individuals in the population based on PID adjustment output values and external disturbances. This step not only enhances the algorithm’s exploration capability in the global solution space but also effectively guides the population evolution towards the global optimum through a dynamic balance of exploration and exploitation, significantly improving the algorithm’s search efficiency and optimization quality. Check and correct boundary violations: Once any search boundaries are exceeded, corresponding measures are taken to adjust and ensure that the population members always remain within the legal solution space. Termination Conditions: Determine whether the termination conditions are met, such as reaching the preset maximum number of iterations or fitness threshold. If met, the algorithm stops running and outputs the best solution and fitness value; otherwise, it continues iterating. Output optimized PSA parameters: Construct a deep learning model using the optimized hyperparameters. First, handle missing values. For continuous signal data, an effective method for handling missing values is interpolation. Linear interpolation is a simple and practical choice, and its mathematical expression is as follows: Y = Y 1 + ( Y 2 − Y 1 ) ( X 2 − X 1 ) X 2 − X 1 (A1) where ( X 1 , Y 1 ) and ( X 2 , Y 2 ) represent the coordinates of two known points and estimate the values of X and Y at any point between these two known points. Next, denoise the signal. In this study, for the signal denoising process, a combined method of ensemble empirical mode decomposition (EEMD) and wavelet threshold denoising, was employed. The aim of this study is to adaptively decompose the signal and apply wavelet analysis for denoising each resulting intrinsic mode function (IMF). First, the signal is decomposed into a series of IMF components using the EEMD method, and this process can be represented as follows: x ( t ) = ∑ i = 1 n I M F i t + r n t (A2) where x ( t ) is the original signal, I M F i t is the i -th IMF component, n is the total number of IMFs, and r n t is the residual component that contains the trend information of the signal. Subsequently, wavelet threshold denoising is applied to each IMF component, which involves wavelet decomposition of the IMFs and applying a soft threshold to the decomposed wavelet coefficients. The mathematical expression for soft thresholding is given below: d j k ′ = s i g n ( d j k ) · m a x ( d j k − λ , 0 ) (A3) where d j k is the k -th coefficient of the j -th level wavelet decomposition, λ is the threshold, and d j k ′ is the denoised coefficient. The determination of the threshold λ is crucial for the denoising effect. This study employs the universal threshold formula proposed by Donoho and Johnstone: λ = σ 2 log N (A4) where σ is the estimated noise level, and N is the number of data points. The noise level estimation is obtained by analyzing the high-frequency IMF components of the original signal. Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. Abstract In modern manufacturing, the prediction of the remaining useful life (RUL) of computer numerical control (CNC) milling cutters is crucial for improving production efficiency and product quality. This study proposes a hybrid CNN-LSTM-Attention-PSA model that combines convolutional neural networks (CNN), long short-term memory (LSTM) networks, and attention mechanisms to predict the RUL of CNC milling cutters. The model integrates cutting force, vibration, and current signals for multi-channel feature extraction during cutter wear. The model’s hyperparameters are optimized using a PID-based search algorithm (PSA), and comparative experiments were conducted with different predictive models. The experimental results demonstrate the proposed model’s superior performance compared to CNN, LSTM, and hybrid CNN-LSTM models, achieving an R 2 score of 99.42% and reducing MAE, RMSE, and MAPE by significant margins. The results validate that the proposed method has significant reference and practical value for RUL prediction research of CNC milling cutters. Keywords: remaining useful life; CNN-LSTM-Attention-PSA; multi-channel feature extraction; milling cutter wear
周老师: 13321314106
王老师: 17793132604
邮箱号码: lub@licp.cas.cn