Bearing Fault Identification Based on Deep Convolution Residual Network

Tong ZHOU*, Yuan LI**, Yijia JING***, Yifei TONG**** *School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210000, People’s Republic of China **Jiangyin Campus, Nanjing University of Science and Technology, Nanjing 214434, People’s Republic of China ***School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210000, People’s Republic of China ****School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210000, People’s Republic of China, E-mail: tyf51129@aliyun.com


Introduction
Bearings are important parts in industrial equipment, and the failure of bearing is one of the main factors leading to the shutdown of mechanical equipment [1]. Therefore, it is significantly important to implement realtime monitoring on the vibration signal of bearing and timely identify the fault type of the bearing from the signal.
Traditional fault identification is mostly based on signal processing. These classical models include two modules, feature extraction and fault classification based on machine learning (ML). Feature extraction maps the original signal to statistical parameters, which can convey information about the state of the machine. In order to obtain high-precision recognition results, the design of the feature extractor plays a very important role [2], because it is directly related to the performance of subsequent classification algorithms. Based on statistics [3], wavelet transform [4] and high order spectrums, impurities in the original data will be removed as much as possible. In this way, the feature specificity can be highlighted and the accuracy of fault diagnosis can be improved. However, these traditional methods often depend on professional prior knowledge [5], and the shallow structure of the conventional ML algorithm has very limited ability in learning to extract the nonlinear relationship of features [6].
In recent years, the rise of deep learning (DL) has promoted the development of artificial intelligence, and DL has obtained rich research results in the fields of speech recognition, image processing, and recommendation systems. Compared to traditional ML models, deep neural networks (DNN) contain more neural units and deep architecture, which can mine the more precious information from raw data. At present, there have been many related research results in the field of bearing fault identification. As deep feature extraction models, autoencoder (AE) [7,8] and its variants such as stacked autoencoder (SAE) [9] and denoising autoencoder (DAE) [10], are used to extract fault features, and then, these extracted features are fed into classification model, which can obtain a high accuracy identification. In this situation, although DNN can obtain accurate feature representations, there is no difference from traditional learning mode. Actually, the models of DNN have a unique way of learning that is called end-to-end learning. In this way, DNN can directly give prediction results based on original data, avoiding the feature extraction [11].
Convolutional neural network (CNN), as a commonly used DNN, has achieved great success in the field of image recognition. According to the current literature, CNN has also been used for bearing fault identification. Because CNN is very good at processing two-dimensional images (2-D data), it is necessary to convert the vibration signal (1-D data) into 2-D data. A simple method is to directly treat the vibration signal of the bearing in time domain as an image with a width of 1, and use one-dimensional convolution for training [12,13]. Sometimes the method of stacking the signal in time domain together to form a matrix can also be adopted [14]. Another method is to convert one-dimensional signals in the time domain into two-dimensional data in the frequency spectrum [15] or other 2-D format such as WPE image [16]. From the perspective of operation difficulty, the first method has a simpler data processing, and does not require two much prior knowledge about signal processing. Moreover, one-dimensional CNN is very effective in calculations and can be easily and cheaply implemented on hardware systems [17].
With advancement of research on DL, the architecture of DNN is continuously deepened, which causes the problem of gradient vanish and explosion and makes the DNN difficult to train. In order to solve this problem, He et al. [18] proposed residual network (ResNet) that is based on skip connection. Skip connection adds a degradation to DNN, because of the skip connection, DNN can degrade to the shallow neural networks, so the DNN can obtain performance equivalent to the shallow neural network. In the field of fault identification, ResNet have also been applied. Ma et al. [19] use the multi-objective optimization algorithm, fusing the ResNet with other neural networks and obtain a high accuracy. Jin et al. [20] proposes a decoupling attentional residual network for compound fault diagnosis and reach a very high accuracy on test set.
In these research above, the architecture of neural network is mostly shallow, which may make DNN unable to exert the powerful fitting ability. In addition, the design of the overall network is also very complicated.
In this paper, deep convolution residual network (DCRN) is proposed for bearing fault recognition, and the contributions of the proposed method are summarized below.
1. To ensure sufficient sample data, overlap sampling is used.
2. End-to-end learning is adopted, and input data is one-dimension vibration signal in time domain, avoiding tedious signal processing and feature extraction. 3. By adding skip connection and stacking residual blocks, DCRN can achieve superior generalization performance on drive end data, the accuracy reaching about 99 %, and on fan end, the accuracy reached 100 %.

Convolution
Different from the fully connected neural network, CNN has the characteristics of local connection and weight sharing [21], which greatly reduces the scale of network parameters and reduces the difficulty of training. In this paper, 1-D convolution will be applied, as shown in Fig.1. (1), we can obtain feature map.

Feature map
where: x is data points overlapping with the convolution kernel; b is bias; () f  is the nonlinear activation function, usually be the ReLU (recited linear unit) function.
Usually, when convolution is set to not change the size of input, padding will be necessary. Suppose kernel size is K×1, the padding will be 1 2 K  .To ensure padding number is an integer, kernel size must be odd number.
In Fig. 2, part (a) shows the two convolutions with 3*3 kernel on 5*5 image. Two 3*3 convolution gives similar representation power as 5*5 convolution. Those two convolution methods can reach the same receptive field, but first method, two 3*3 convolution, is more popular， because only 3*3*2=18 parameters are required while 5*5 convolution requires 5*5=25 parameters. Generally speaking, fewer parameters are more conducive to updating the parameters of neural network. Moreover, two convolutional layers have more non-linear transformation than one convolutional layer.
The VGG network uses a large number of 3*3 convolution kernels and obtain better performance on ImageNet. Thus, 3*3 convolution gradually become mainstream.
However, when sample data is 1-D signal, something will be different. In Fig. 2b，two 3*1 convolution layers can reach the same receptive field as one 5*1 convolution layer. The former requires 3*1*2=6 parameters, the latter only requires 5*1=5 parameters. The former has more parameters than the latter. Therefore, for 1-D signal, it is hard to say which kernel size is more suitable. In this paper, both convolution kernels will be tested Based on design concept of residual network, skip connection is added between the input and output of neural layer, forming a residual block (ResBlock), as is shown in Fig. 3. After passing through neural layer, input x is mapped to F(x), then original input x is added to F(x), and output H(x).
Neural Layer Fig. 3 Residual block Sometimes skip connection may not bring original input x, but ' x that is transformed from x. Then, H(x) will be F(x)+ x' .
In Section 2.1, two kinds of kernel size, 3*1 kernel and 5*1 kernel, are discussed. In this section, two kinds residual blocks with different convolution kernel sizes will be designed, as are shown in Fig.4. Because two 3*1 conv layers can reach the same receptive field as one 5*1 conv layer, in part (a), two 3*1 conv layers are stacked and in part (c), one 5*1 conv layer is used. For further comparison, part (b) is also designed. Meanwhile, a batch normalization layer (BN) is added after each conv layer. The effect of BN layer is to obtain a smoother optimized landscape, which can improve optimization efficiency [22]. Besides, it is also a method of regularization, which can improve the generalization performance of the network [23]. Thus, dropout is not used in this paper, following the practice in [24]. In CNN, commonly used down sampling methods are pooling and convolution with the stride equal to 2. Pooling on feature map can achieve more accurate feature representation, but this method is very time consuming in training. In ResNet [18], few pooling layer is used in the part of convolution. Inspired by the thought of ResNet, we build deep convolution residual networks by stacking residual blocks for bearing fault identification, as is shown in Fig. 5. There are three options for the residual block, as is shown in Fig.4. The convolution layers follow two simple design rules [18]: i) for convolutional layers with the same output feature map size, the layers have the same number of filters; ii) if the feature map size is halved, the number of filters is double so as to preserve the time complexity. In Fig. 5, blue cells represent weighted layers, and green cells represent the unweighted layers. Down sampling is mostly performed by convolution with a stride of 2. The network ends with a global average pooling layer and a dense layer with softmax. The dense layer possesses several units, the number of which is same as the number of fault type.

Updating weights of network
The DCRN in Fig. 5 is a feedforward neural network (FNN), and for this kind of network, the commonly used method of updating network weights is backpropagation algorithm. By using this algorithm, weights are updated according to the Eq. (2) until the network converges.
where: θ and θ' are the weights before and after updating respectively; L is the loss function of predicted value and ground-true value, α is learning rate and decays over time.

Data pre-process
The data in this paper is the part of the open-source bearing data from Case Western Reserve University, using the drive end fault vibration signal and the normal bearing vibration signal that are collected at 12000 samples/second. In the original dataset, there are 3 types of bearing faults, containing inner raceway fault, rolling element fault and outer raceway fault. The signal data of the four kinds of signal is shown in Fig. 6. All fault types are artificially manufactured before the experiment.
In all working conditions, the lowest speed is 1730 rpm. Because data is sampled at a 12000 samples/second, when the shaft rotates for one cycle, about 416 (60/1730*12000=416) data points will be collected, that means one data period is 416 data points. The time span of a single sample should be related to the data period. Generally speaking, the time span of the sample needs to be longer than one data period, and k×2 n data points have the priority to be selected as a sample point. Therefore, in this paper, continuous 512 data points (k=1, n=9) are extracted as a single sample.
In order to obtain as many data samples as possible, overlapping sampling is considered, as is shown in Fig. 7. In this figure., stride represents the interval between two adjacent samples. Overlapping sampling will damage the independence between sample points, and the smaller the stride is, the weaker the independence between sample points will be. Therefore, the stride needs to be set to an appropriate value.  (3) where: L is the number of data points in the mat file, stride is the interval between two adjacent sample points.

Fig. 7 Overlapping sampling
In order to improve the performance of network, a larger proportion of sample data can be selected as the training set. After the sample data set shuffled, 90 % of the sample data is randomly selected for training and the remaining 10 % for validation. The results of the division are shown in Table 1.  5 shows the rough network architecture，and some details can be seen in Table 2. In this experiment, three kinds architecture is designed. Tensor flow 2.3.0, a deep learning framework based on python, is used to write the program. For classification problems, the commonly used loss function is cross-entropy loss function, as shown in Eq.
where: i y is the real label of the sample that is encoded in one-hot encoding; ˆi y is the probability that the sample belongs to the i th fault type. Weights and biases are initialized as in [25]. Adam optimizer with a mini-batch size of 512 is selected, two moving average decay rates are 0.9 and 0.999. The learning rate starts from 0.01 and is set with piecewise constant decay, as Fig. 8 shows. In this experiment, each architecture is trained for 100 epochs three times, and one of the results is in Fig. 9. The upper part of the figure is the accuracy on validation set, and the bottom is the confusion matrix on validation set. From these three training procedures, we can see that three different architectures can achieve very high accuracy on validation set and the average accuracy is shown in Table 3.
From the results, for convolution on 1-D signal, 5*1 kernel size, instead of 3*1, is more suitable. Both architecture 2 and architecture 3 have slightly better generalization performance than architecture 1. Besides, generalization performance of architecture 3 is slightly superior to architecture 2, this is because architecture 2 has three times as many parameters as architecture 3, and in this situation, architecture 2 is prone to slight overfitting.  9 Training procedure and prediction of three architecture

Discussion
Although deep convolution residual network achieves very good result, we cannot know the internal operation mechanism of the network. A popular explanation is that, deep neural network has a very powerful feature extraction ability. The network can be seen as a combination of feature extractor and classifier. Take architecture 3 as an example, the dense layer is dropped and the output of the global average pooling layer is visualized by T-SNE algorithm [26], as is shown in Fig. 10. We can clearly see that, after passing through several residual blocks, various fault types form clusters in high-dimensional space, which makes it easy for dense layer to identify fault types. In order to further verify the robustness of the proposed method, the data on fan end is also chosen to test. We use the trained model of architecture 2. The model is trained on the data of drive end, so it has learned the knowledge about how to identify the fault type. With transfer learning [27], we do not need too many sample points on fan end, and only 882 sample points are required to obtain an excellent model (the batch size is 128 and learning rate starts from 0.01 with piecewise constant decay is also adopted). The result on the validation is shown in Fig. 11. Compared with the curve in Fig. 9, the curve in Fig. 11 possesses a high accuracy in the initial stage of training, because the initialized model has learned relevant knowledge. Throughout the entire training process, the highest identification accuracy can reach 100 %.

Comparison with other methods
In [20], multiple methods are tested on the data of fan end. Li et al. [28] used wavelet packet transform (WPT) to extract features and applied Support Vector Machine (SVM) to classify faults for these features. Dhamande et al. [29] applied artificial neural network (ANN) on the signal in time domain. Some other DNN model is also tested, such as CNN [30] without skip connection and long-short term memory (LSTM), and the results is shown in Fig. 12. Compared with some other methods, the method proposed in this paper has the highest recognition accuracy.  2. In this paper, three different architecture of DCRN are designed, and all three achieve very high accuracy, respectively 99.60 %, 99.71 % and 99.81 %. Furthermore, the model is further tested on the fan end signal and the final result reached 100 %.

% accuracy
3. By simply stacking the residual blocks and using skip connection, DCRN can take better advantage of DNN. Therefore, DCRN has a good application prospect in the field of bearing fault identification.

Statement
The author(s) declare(s) that there is no conflict of interests and the mentioned received funding in the Acknowledgement section did not lead to any conflicts of interest regarding the publication of this manuscript.