Deep Learning


POST ON 2024-03-05 BY WOLVES

UPDATE ON 2025-01-06 BY WOLVES

Deep Learning

You can see the code of this blog on github

1.Neurons and Artificial Neural Networks

Biological NeuronsArtificial Neural Networks
Composed of cell body, dendrites, and axonComposed of nodes (artificial neurons)
Dendrites receive signals, axon transmits signalsNodes receive input signals and perform weighted summation
Signals are transmitted through synapsesSignals are transmitted through connections (weights)
Use chemical or electrical signalsUse mathematical functions and algorithms
Complex biological structureComputational model, divided into input layer, hidden layers, and output layer

This table shows the comparison between biological neurons and artificial neural networks, helping to understand their similarities and differences.

2. Forward Propagation

In a neural network, forward propagation refers to the process of signal transmission from the input layer to the output layer. Each node (neuron) receives input signals, performs a weighted summation, and generates output signals through an activation function. The basic steps of forward propagation are as follows:

  1. Input Layer: Receives input data $\vec{x}$.
  2. Hidden Layer: Each node calculates the weighted sum $z = \sum (w_i \cdot x_i) + b$, where $w_i$ is the weight and $b$ is the bias.
  3. Activation Function: The weighted sum $z$ is passed through an activation function $a = f(z)$ to generate the output.
  4. Output Layer: Outputs the final result.

The purpose of forward propagation is to compute the output of the neural network for prediction or classification.

3. Backward Propagation

Backpropagation is a key algorithm used in training neural networks. It involves propagating the error from the output layer back through the network to update the weights and biases, minimizing the error in predictions. The basic steps of backpropagation are as follows:

  1. Calculate Error: Determine the error at the output layer by comparing the predicted output with the actual target values.
  2. Output Layer: Compute the gradient of the loss function with respect to the output of the network.
  3. Hidden Layers: Propagate the error back through the network, calculating the gradient of the loss function with respect to each layer’s weights and biases.
  4. Update Weights and Biases: Adjust the weights and biases using the calculated gradients and a learning rate to minimize the error.

The purpose of backpropagation is to optimize the neural network’s parameters, improving its accuracy in making predictions or classifications.

4. Activation Function

Activation functions are used to introduce non-linearity into the neural network, allowing it to handle complex patterns in data. Common activation functions include:

  • Sigmoid function: $a = \frac{1}{1 + e^{-z}}$
  • ReLU (Rectified Linear Unit): $a = \max(0, z)$
  • Tanh (Hyperbolic Tangent): $a = \frac{e^z - e^{-z}}{e^z + e^{-z}}$

These functions help the neural network learn and generalize better.

4.1 How to choose activation function

  • Select the activation function based on the value of y

  • The most common activation function is ReLU

  • For output layer (In common classification problems)

    • If the value of y is negative or positive, use sigmoid function
    • If the value of y is always equal or greater than 0, use Linear function
    • Relu is not recommended for output layer
  • For hidden layer

    • Relu is the most common activation function

5. Multiclass Classification

  • from positive and negative to multiple classes

  • Now , We need a new activation function to handle multiple classes like softmax function

5.1 Softmax Function

$$ z_i = w_i \cdot x + b_i $$

$$ a_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}} $$

  • sparse categorical cross entropy function (稀疏分类交叉熵函数) $$ loss = - \sum_{i=1}^{n} y_i \log(a_i) = - \log(a_{true}) $$

5.2 Numerical RoundOff Error

  • Numerical Stability: In calculating the cross-entropy loss (Cross-Entropy Loss), directly inputting logits into the loss function rather than the probability after the sigmoid activation function can improve numerical stability. This helps avoid numerical underflow or overflow issues caused by extreme probability values (close to 0 or 1).

  • Use from_logits=True in loss function When setting from_logits=true, the loss function (such as BinaryCrossentropy or CategoricalCrossentropy) automatically applies the sigmoid or softmax activation function internally. Therefore, the last layer of the model only needs to output logits, without manually adding an activation function. This simplifies the model definition and makes the code more concise.

It is equivalent to skipping the intermediate calculation and directly calculating the final result, thereby improving numerical stability.

5.3 RMSProp Optimizer

  • 在深度学习中,RMSprop(Root Mean Square Propagation)是一种常用的优化算法,主要用于解决梯度下降中的学习率调整问题。它通过自适应地调整每个参数的学习率,能够有效地加速收敛并减少震荡。

$$ E[g^2]t = \beta E[g^2]{t-1} + (1 - \beta) g_t^2 $$

$$ w_{t+1} = w_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t $$

5.4 Adam Optimizer (自动选择学习率)

Adam Optimizer

Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the advantages of two other extensions of stochastic gradient descent, namely adaptive gradient algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp). It computes adaptive learning rates for each parameter.

Advantages of Adam Optimizer

  • Adaptive Learning Rates: Adam computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

  • Efficient: It is computationally efficient and has low memory requirements.

  • Invariance to Diagonal Rescaling of Gradients: The algorithm is invariant to diagonal rescaling of the gradients.

  • Suitable for Non-stationary Objectives: It works well with problems that are large in terms of data/parameters or non-stationary.

$$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$

$$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$

$$ \hat{m}t = \frac{m_t}{1 - \beta_1^t} $$ $$ \hat{v}t = \frac{v_t}{1 - \beta_2^t} $$

$$ w_{t+1} = w_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $$

6.Some Concepts

6.1 Parameter & Hyperparameter

  • Parameter: The weights and biases in the model

  • Hyperparameter: The learning rate, batch size, number of epochs, etc.

  • In the training process, the model parameters are adjusted through backpropagation, while the hyperparameters are set manually by the user.

  • Maybe we can use some methods to find the best hyperparameter in the future, but now we can only set it by experience.

6.2 human brain and deep learning

There are many similarities between the human brain and deep learning. Deep learning models are inspired by the structure and function of the human brain, particularly the way neurons and synapses work. But today, deep learning models are still far from the complexity and flexibility of the human brain.

  • Neurons and Artificial Neurons: The human brain is composed of billions of neurons connected through synapses. Similarly, deep learning models consist of many artificial neurons connected by weights.

  • Learning Process: The human brain continuously adjusts the strength of connections between neurons through experience and learning. Deep learning models adjust weights using training data and backpropagation algorithms to minimize the loss function.

  • Hierarchical Structure: Neurons in the human brain are distributed across different regions and layers, responsible for processing different types of information. Deep learning models also consist of multi-layer neural networks, with each layer extracting different levels of features.

Although deep learning models perform excellently on certain tasks, they are still far from the complexity and flexibility of the human brain. Future research may further draw on the principles of the human brain to enhance the performance and adaptability of deep learning models.

6.3 train/dev/test sets

In the begnning of DL, we need to define the number of layers, hidden units, learning rate, activation function, etc(hyperparameter).

  • train set: used for training the model
  • dev set: used for tuning the model
  • test set: used for testing the model

6.4 vanishing/exploding gradient

  • the gradient is too small, especially in the hidden layer which uses sigmoid/tanh activation function
    • the gradient is too small, so the model will not learn anything
    • the model will be stuck in the local minimum
  • the gradient is too large, especially in the hidden layer which uses ReLU activation function
    • the gradient is too large, so the model will explode
    • the model will be unstable

6.5 weight initialization

  • random initialization
  • He initialization - for ReLU activation function $$W \sim \mathcal{N}(0, \sqrt{\frac{2}{n_{in}}})$$
  • Xavier initialization - for sigmoid/tanh activation function $$W \sim \mathcal{N}(0, \sqrt{\frac{1}{n_{in}}})$$

$n_{in}$ 表示当前层的输入神经元数量

6.6 mini-batch gradient descent

  • mini-batch gradient descent is a variant of gradient descent that uses a small subset of the training data (mini-batch) to update the model parameters.

  • It is used to speed up the training process and reduce the memory usage.

  • Advantages

    • lower memory usage
    • faster convergence
    • better generalization
  • eg. Data set has 10000 samples, we will use 100 of them to update the model parameters at one time

6.7 Gradient Checking

  • Gradient Checking is a technique used to verify the correctness of the gradient calculation in the backpropagation algorithm.
  • It is used to check whether the gradient calculated by the backpropagation algorithm is correct.
  • It is used to check whether the gradient calculated by the backpropagation algorithm is correct.

7. Bias and Variance

  • In this case

    • the left is the high bias, the right shows a low bias
    • the left is the low variance, the right shows a high variance
    • And the left is a overfitting model, the right is a underfitting model
  • Bias: The difference between the expected value and the true value

  • Variance: The difference between the predicted value and the expected value

8. Regularization

  • Regularization is a technique used to prevent overfitting in deep learning models. It adds a penalty term to the loss function to encourage the model to learn simpler patterns.

  • L1 Regularization: $loss = loss + \lambda ||w||_1$

  • L2 Regularization: $loss = loss + \lambda \sum w^2$

  • (L2 Regularization is also called weight decay)

  • $\lambda$ is the regularization parameter, which controls the strength of the regularization.

8.1 Dropout
  • Dropout is a regularization technique that randomly drops out some neurons during training to prevent overfitting.

  • During each training iteration, different neurons may be randomly deactivated, which enhances the learning strength of other neurons and improves the model’s generalization ability.

8.2 Other Regularization Techniques
  • Early Stopping

9.Computer Vision

From this part, we will learn how to use DL to solve some problems in computer vision.

9.1 Some problems

  • Instead of using 64x64x3 to represent an image, we may face 1000x1000x3 to represent an image, it means the input layer will possess a lot of parameters, which will cost a lot of time and memory in calculation.

  • So we need to use some methods to reduce the dimension of the input layer.

9.2 Some concepts

  • Gray Image
    • Every pixel has only one channel, it just represents the brightness of every pixel
    • the value range is [0,255]
    • It`s always used in margin detection or binary classification
  • RGB Image
    • Every pixel has three channels, it represents the brightness of every pixel in the red, green and blue channels
    • the value range is [0,255]
    • It`s always used in object detection or multi-class classification

9.3 SVM

  • Support Vector Machine is a type of machine learning algorithm that is used for classification and regression.

10 Convolutional Neural Network

  • Convolutional Neural Network (CNN) is a type of neural network that is commonly used in computer vision tasks.

  • It is a type of feedforward neural network that is specifically designed for image processing and analysis.

  • it will reduce the parameters and the calculation of the model, it also can extract the features of the image

10.1 Edge detection

  • use a 3x3 filter to detect the edge of the image

  • filters

    • Sobel filter
      • $G_x = \begin{bmatrix} -1 & 0 & 1 \ -2 & 0 & 2 \ -1 & 0 & 1 \end{bmatrix}$
      • $G_y = \begin{bmatrix} -1 & -2 & -1 \ 0 & 0 & 0 \ 1 & 2 & 1 \end{bmatrix}$
    • Prewitt filter
      • $G_x = \begin{bmatrix} -1 & 0 & 1 \ -1 & 0 & 1 \ -1 & 0 & 1 \end{bmatrix}$
      • $G_y = \begin{bmatrix} -1 & -1 & -1 \ 0 & 0 & 0 \ 1 & 1 & 1 \end{bmatrix}$
    • Scharr filter
      • $G_x = \begin{bmatrix} -3 & 0 & 3 \ -10 & 0 & 10 \ -3 & 0 & 3 \end{bmatrix}$
      • $G_y = \begin{bmatrix} -3 & -10 & -3 \ 0 & 0 & 0 \ 3 & 10 & 3 \end{bmatrix}$
  • then, we use the filter with the part of the image to calculate the convolution,we will get a new image(feature map), which is smaller than the original image

  • the result of the convolution will enhance the edge of the image

10.2 Padding

  • Padding is a technique used to keep the size of the image after the convolution

  • when we process a 6x6 image with a 3x3 filter, we will get a 4x4 feature map, because the filter can only move 4x4 times in the image

  • the peak pixel will use a little part of the image, which will cause the loss of information

  • so we need to use padding to keep the size of the image, it will add a border around the image and the peak pixel will use more to express the information

    • $n_{out} = \frac{n_{in} - n_{filter} + 2 \times padding}{stride} + 1$
    • $n_{out}$ is the output size
    • $n_{in}$ is the input size
    • $n_{filter}$ is the filter size
    • $padding$ is the padding size
    • $stride$ is the stride size

10.3 Stride

  • Stride is the step size of the filter when moving in the image

  • when we use a 3x3 filter to process a 6x6 image, we will get a 4x4 feature map, because the filter moves 1 pixel each time

  • we can also use 2 stride to reduce the size of the feature map

10.4 three dimensions

  • RGB image - 三层堆叠

    • 3 channels
    • 3x3x3 filter
    • 3x3x3 feature map
  • The same example 6x6 image with 3 channels, the filter will be 3x3x3 which like a cube. then use the every layer of the filter to process the every layer of the image part by part, and then we will get a 4x4x3 feature map, then add them together, we will get a 4x4x1 feature map.

10.5 One layer of a Convolutional Neural Network

卷积核+偏置+激活函数+(其他)

  • In the previous part, we use a 3x3x3 filter to process a 6x6x3 image to get the single feature map, but in the real situation, we need to use many filters to process the image, and then we will get many feature maps and then stack them up.

  • The result is like this:

    • 6x6x3 image

    • 3x3x3 filter

    • 3x3x3 filter

    • 4x4x2 feature map

  • And that is the one simple layer of a Convolutional Neural Network.

  • Here is a case:

    • 6x6x3 image
    • 10 filters with 3x3x3
  • the output will be:

    • 4x4x10 feature maps
    • (3x3x3 + 1)x10 parameters (We alse have a bias ‘b’ for each filter)
  • In this case, the model will not easy to overfit, because the number of parameters is not too large, we can use more filters to process a image which is very complex and huge.

10.6 Convolutional Neural Network

  • we can use many layers to process the image, and then we will get a more complex and huge feature map, then unrolling them into a very long vector, use the logistic regression or softmax regression to process it which like a normal neural network, and that is the Convolutional Neural Network.

10.7 Pooling layer

  • Pooling layer is a layer that uses a filter to process the image for reducing the size of the feature map

  • The most common pooling is max pooling, it will select the maximum value in the filter’s range, and then we will get a new feature map, which is smaller than the original feature map.

  • The result of the pooling is like this:

    • original 4x4x10 feature maps
    • result 2x2x10 feature maps
  • Pooling is alse a filter, but it will not multiply the weight and the bias, it will just select the maximum value in the filter’s range, and it will keep the same number of channels.

  • There is another pooling is average pooling, it will select the average value in the filter’s range.

10.8 Fully Connected Layer

  • The Fully Connected Layer (FC Layer) is a common type of layer in neural networks, typically used to map the output from the previous layer to the final prediction result.

  • Like the normal machine learning, we can use many FC layer to process the image, and then we will get a very long vector, and then use the logistic regression or softmax regression to process it which like a normal neural network.

10.9 Hyperparameter

  • Hyperparameter is a parameter that is not learned from the data, it is set by the user.

  • We can get the best hyperparameter by looking papers or using some tools instead of ourselves.

11. classic network

11.1 LeNet-5

  • LeNet-5 is a classic network in the field of computer vision, which is used for handwritten digit recognition.

  • You can use max pooling install of average pooling.

  • It has more than 60000 parameters, it is very small than the modern network.

11.2 AlexNet

  • AlexNet is a classic network in the field of computer vision, which is used for image classification.

  • It has more than 60m parameters

  • Local Response Normalization (LRN) - normalize the feature map channel by channel

11.3 VGG

  • VGG is a classic network in the field of computer vision, which is used for image classification.

  • It has more than 138m parameters

  • 3x3 filter with 1 stride

  • 2x2 max pooling with stride 2

11.4 GoogleNet

  • GoogleNet is a classic network in the field of computer vision, which is used for image classification.

  • It has more than 60m parameters

  • 3x3 filter with 1 stride

  • 2x2 max pooling with stride 2

  • It use padding to keep the size of the feature map

  • Vgg16 has more than 138m parameters

  • Vgg19 has more than 140m parameters

12. Modern Network

12.1 ResNet

  • Every depp neural network are difficult to train becase of vanishing and exploding gradient.

  • The function of ResNet is to use a skip connection to solve the problem of vanishing and exploding gradient, which allows you to take the activation from one layer and suddenly feed it to another layer.

12.1.1 Residual Block

  • It didn’t follow the main path of the network, it use a [skip connection](shortcut) pass one or more layers for going much deeper.

  • eg.

before:

$$a^{[l+2]} = g(z^{[l+2]}) = g(W^{[l+2]}a^{[l+1]} + b^{[l+2]})$$

after:

$$a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) = g(W^{[l+2]}a^{[l+1]} + b^{[l+2]} + a^{[l]})$$

  • The normal path is $a^{[l+2]} = g(z^{[l+2]})$ and the path with the layer it skip is a Residual Block.

12.1.2 ResNet

  • The dashed lines in the figure indicate that the dimensions on both sides of skip connction are different and special processing is required

12.1.3 1x1 Convolution

  • 1x1 Convolution is a type of convolution that is used to change the dimension of the feature map

  • It is used to reduce the number of parameters and the calculation of the model

  • It likes a full connection layer, but it is used in the convolutional layer

  • In practice, we will employ a pooling layer to reduce the dimensionality of the feature map, followed by a 1x1 convolution to alter the channel configuration of the feature map.

12.1.4 Inception Network

  • Inception Network is a collection of different layers. we don’t know which layer is the best, we just use all of them and merge them together into a new feature map.

  • Inception Network use parallel structure to process the image, it uses a convolution kernels of different sizes (e.g. 1x1, 3x3, 5x5) in the same layer enables simultaneous extraction of local details and global context information

输入
├── 1x1x64 卷积
├── 3x3x128 卷积 → 3x3 卷积
├── 5x5x32 卷积 → 5x5 卷积
└── 3x3x32 最大池化 → 1x1 卷积
输出(拼接所有分支的结果)

  • But the Inception Network has a lot of parameters, the scale of computation is too huge. To address this issue, we can use 1x1 convolution to reduce the number of channels. It will reduce an order of magnitude.

13. Transfer Learning

  • There is a huge amount of case which has been trained by others, we can use the pre-trained model to solve a new problem.

  • The new layer will learn the new feature, and the pre-trained layer will still learn the old feature.

  • The greater the influx of new data, the more layers you can replace.

14. Object Detection

Image Classification -> Classification with locatization -> Object Detection

14.1 Classification with locatization

  • The output will not only the classification result, but also the locatization result - bounding box.

  • the datset need to provide the bounding box of the object, not only the detected object, but also the background.

  • Sliding Window Detection

  • The sliding window will slide over the image, and then use the classification model to classify the image.

  • The bounding box will be the same size, so it will be very slow. And It gets bigger every round.

  • In this case, the bounding box is not accurate, and it is not a good result even it can not detect the object.

14.2 landmark detection

  • landmark detection is a technique that detects the landmark of the object, such as the eyes, mouth, nose, etc. It will provide the key points of the object.

  • It can be used in the face detection, the object detection, the pose estimation, etc.

14.3 Yolo

  • Yolo is means You Only Look Once, it is a method that can detect the object in the image.

  • First, we need to divide the image into a grid

  • Then, For each grid, we need to predict the bounding box of the object and the class of the object.

  • It will use a key point to represent the object, and the key point will be the center of the object, yolo use this point to distribute the object to the grid.

  • it will output S x S x (bounding box + class), and it is a matrix. (S is the number of grids)

14.4 Intersection Over Union

  • Intersection Over Union is a metric that is used to evaluate the performance of the object detection model.

  • It is the ratio of the intersection area to the union area of the predicted bounding box and the ground truth bounding box.

  • “Correct” if IoU > 0.5, it is just a personal defined threshold.

    • The IoU is the ratio of the intersection area to the union area of the predicted bounding box and the ground truth bounding box.

14.5 Non-max suppression

  • Non-max suppression is a technique for making sure that your algorithm detects each object only once.

  • In a specific grid, there may be multiple bounding boxes, we maybe check it more than one time

  • So, we use Non-max suppression to make sure that the algorithm detects each object only once.

  • It will choose the eligible confidence score bounding box, and then suppress the other bounding box with the lower confidence score.

14.6 Anchor Box

  • Anchor Box is a technique that is used to detect the object in the image.

  • It will use a bounding box to detect the object, and then use the anchor box to detect the object.

14.7 R-CNN (regions with CNN)

  • It picks a few regions that makes sense to run conv net classifier.

  • We use a segmentation algorithm find blob points, it will find the prominent area of the image, and then we can use conv net to process it.

  • 根据分割出的尺度,选择合适的区域,然后进行卷积运算。

  • 优点

    • 准确性高
  • 缺点

    • 速度慢
    • 计算量大
  • Fast R-CNN

    • 使用卷积运算,而不是滑动窗口
    • 使用RoI池化,而不是全连接层,将候选区域映射到特征图上
    • 使用多任务损失函数,而不是单独的分类损失函数
  • Faster R-CNN

    • 使用RPN网络,而不是选择性搜索,滑动搜索框,生成候选区域
    • 使用锚框,而不是选择性搜索

15. Face Recognition

  • 1.Detect the face
  • 2.Detect live or not

15.1 Conceptions

15.1.1 Face Verification vs. face recognition
  • Face Verification

    • input: image, face image
    • output: whether is the same person
  • Face Recognition

    • input: Has a database of K person
    • get a new image
    • output: judge which person

15.2 One-shot learning

  • If we use cnn, it will be not work, because we just have a little data, it is not enough to train a good model. And if there is a new person, we need re-train the model.

  • similarity function

    • $d(img1, img2) = degree of difference$
    • if $d(img1, img2) < \delta$, then they are the same person

15.3 Siamese Network

$$d(x^{(1)}, x^{(2)}) = ||f(x^{(1)}) - f(x^{(2)})||^2$$

  • if the result is small, then they are the same person

15.4 Triplet Loss

  • Anchor, Positive, Negative

  • Anchor: the image of the person

  • Positive: the image of the same person

  • Negative: the image of the other person

  • want: $||f(Anchor) - f(Positive)||^2 - ||f(Anchor) - f(Negative)||^2 + \alpha \leq 0$

  • so the formula is:

$$L = \max(0, ||f(Anchor) - f(Positive)||^2 - ||f(Anchor) - f(Negative)||^2 + \alpha)$$

$$ I = \sum_{i=1}^{m} L(a^{(i)}, p^{(i)}, n^{(i)}) $$

  • we will need a lot of data to train the model, And the relations between Anchor, Positive, Negative is not eays to define, it is difficult to train on.

15.5 Face Recognition

$$ \hat{y} = \sigma(\sum_{k=1}^{k}w_i|f(x^{(i)})_k - f(f^{(j)})_k| + b) $$

or

$$ \hat{y} = \sigma(\sum_{k=1}^{k}w_i\frac{(f(x^{(i)})_k - f(f^{(j)})_k)^2}{f(x^{(i)})_k + f(f^{(j)})_k} + b) $$

  • $w_i$ is the weight of the feature
  • $b$ is the bias
  • $\sigma$ is the sigmoid function

16. Neural Style Transfer

  • Neural Style Transfer is a technique that is used to transfer the style of one image to another image.

  • It will use a content image and a style image, and then use a neural network to transfer the style of the style image to the content image.