-
Bias is same for all neurons in one layer
-
Bias is learnable too
-
TF: weights = kernel, bias = bias
- good for
- blablabla
- good for
-
$$y = max(0, z)$$ meaning z for z greater than 0, otherwise 0 - $$\dfrac{\delta y}{\delta z} = \begin{cases} 0
\text{for}z \leq 0 \ 1\text{for}z > 0 \end{cases}$$
a_i = label, -> change to y_j; Maybe use the uncomplicated version from categorical cross entropy below..
- Softmax function takes an N-dimensional vector of real numbers and transforms it into a vector of real numbers in range [0,1] which add up to 1.
- This property of softmax function that it outputs a probability distribution makes it suitable for probabilistic interpretation in classification tasks.
- $$ p_i = \dfrac{e^{a_i}}{\sum_{k=1}^N e^{a_k}} $$
- for float64: upper bound
$$10^{308}$$ , so make it numerical stable by multiplying numerator and demoniator with constant C: $$ p_i = \dfrac{C \cdot e^{a_i}}{C \cdot\sum_{k=1}^N e^{a_k}} $$ which results in $$ p_i = \dfrac{e^{a_i + \log(C)}}{\sum_{k=1}^N e^{a_k + \log(c)}}$$. C can be choosen free, but normally you use $$ \log(C) = -\max(a)$$ - $$\dfrac{\delta p_i}{\delta a_j} = \begin{cases} p_i(1-p_j)
\text{if}i = j \ -p_j p_i ~~~~~~~~\text{if}~ i \neq j \end{cases}$$ - source: https://deepnotes.io/softmax-crossentropy#cross-entropy-loss
- t = label/truth/actual; y = predicted
- for one sample:
$$E = (t_i -y_i)^2$$ - for many samples:
$$E = \dfrac{1}{n} \sum^n_{i = 1} \dfrac{1}{2}(t_i - y_i)^2$$ $$\dfrac{\delta E_i}{\delta y_i} = 2*\dfrac{1}{2}(t_i-y_i)*1 = (t_i - y_i)$$ - good for
$$RMSE = \sqrt{MSE}$$
- MAE =
$$(t_i - y_i)^1$$ L1 Loss
aka: Logistic loss, Multinomial Logistic Loss
Classification problems can be divided into multi-class classification problems and multi-label classification problems. In multi-class classificiation one sample is classified to ONE class. In multi-label classification one sample is classified to multiple classes.
The cross-entropy loss is a group of cross-entropy functions to solve those problems.
The cross entropy is defined as:
-
for many samples:
$$CE = - \dfrac{1}{N} \sum\limits_{k=1}^N \sum\limits_{i=1}^C y_i \log(\hat{y_i}) $$ -
for one sample:
$$CE = - \sum\limits_{i=1}^C y_i \log(\hat{y_i}) $$
https://www.youtube.com/watch?v=tRsSi_sqXjI
https://gombru.github.io/2018/05/23/cross_entropy_loss/
https://towardsdatascience.com/intuitive-explanation-of-cross-entropy-5d45fc9fd240
This is used, when ...
-
there are only two labels and you have one single output neuron
0 for C_0 1 for C_1 -
we split a multi-label problem in C binary classification problems. => every output neuron (of the C neurons) will be handled as an individual binary classification problem (is in class C_i/ not in class C_i)
In a binary classification problem, there are only 2 classes! Thus we can simplify to:
for one sample:
for many samples:
If you do not use an activation function like sigmoid -> so it is linear, you use:
Tensorflow: https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/losses/log_loss
If you add an activation function like sigmoid, the
Tensorflow: https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy
aka: Softmax loss, softmax activation plus cross entropy loss
Here a dense output layer with softmax activation function is used and we process the output with a cross entropy loss function.
$$grad f = \nabla(x1, .., x_n) = (\dfrac{\delta f}{\delta x_1}, ..., \dfrac{\delta f}{\delta x_n} ) $$ - always directs to the direction of the steepest slope
- usage: negative gradient for minimization of loss function
t_i = label, y_i = prediction -> change to y_i and \hat{y_i}
- Training of neural network by using gradient descent method
- Algorithm: for a single neuron! This can be done with matrices for every layer!
Step 1: Forward pass: Calculate output by multiplying input matrix with layers
-
net function: weighted sum $$ z = \sum^n_{i=1} x_i \cdot w_i$$
-
activation function: e. g. simgoid
$$y_i(z_i)= \dfrac{1}{1+e^-z_i} $$ -
output function: usually identity function:
$$f_{out}(a) = a$$ , this will not be mentioned further, because it is always the same as the activation value -
Process:
$$z_i = x_i w_i $$ , then$$y_i = sigmoid(z_i)$$
Step 2: Calculate loss and the gradient for the weight
-
Loss:
$$\mathcal{L}_i = \dfrac{1}{2} \cdot (t_i - y_i)$$ store all losses for plots and testing (and for loss over all samples, which is the goal to minimize) -
update rule:
$$w_{k+1} = w_k + \Delta w_{k+1}$$ with$$\Delta w = \eta \cdot \dfrac{\delta E}{w}$$ - $$\dfrac{\delta E_i}{\delta w} = \delta_i = \begin{cases} \dfrac{\delta E_i}{\delta y_i} \cdot \dfrac{\delta y_i}{\delta z_i} = \dfrac{\delta E_i}{\delta y_i} \cdot (t_i - y_i) ~\text{for output layer} \ \dfrac{\delta E_i}{\delta y_i} \cdot \dfrac{\delta y_i}{\delta z_i} \cdot \dfrac{\delta z_i}{\delta w_i} = \dfrac{\delta E_i}{\delta y_i} \cdot \sum_L (\delta_l \cdot w_{l,i}) ~\text{for hidden layer}\end{cases} $$
- for output layer: only derivate of loss fct multiplied with derivate of activation function
- for hidden layer: derivate of loss fct multiplied with the weighted sum of the next layer.
-
gradient:
$$\nabla_{\theta} = x_i \cdot \dfrac{\delta E_i}{\delta w_i}$$
Step 3: Backpropagate it
- Steps 1 to 3 can be repeated for e epochs or until the average loss is smaller than
$$\epsilon$$ or early stopping ($$L_{t-1} - L_{t} < \epsilon$$ )
t_i = label, y_i = prediction -> change to y and \hat{y}
-
update rule in backpropagation for weights w:
$$w_{k+1} = w_k + \Delta w_{k+1}$$ with$$\Delta w = \eta \cdot \dfrac{\delta E}{\delta w}$$ -
Calculate gradient for loss function with respect to the weights:
-
need to apply chain rule:
$$\dfrac{\delta E_i}{\delta w} = \dfrac{\delta E_i}{\delta y_i} \cdot \dfrac{\delta y_i}{\delta z_i} \cdot \dfrac{\delta z_i}{\delta w}$$ -
$$E_i$$ = loss = binary cross entropy =$$E(t_i, y_i) = -t_i \cdot log(y_i)-(1-t_i) \cdot log(1-y_i)$$ -
$$y_i$$ = activation = sigmoid =$$y_i = \dfrac{1}{1+e^-z_i}$$ -
$$z_i$$ = weighted sum =$$x_i \cdot w^T$$ with$$\dfrac{\delta z_i}{\delta w} = x_i$$ -
with
$$\dfrac{\delta E_i}{\delta y_i} = \dfrac{y_i - t_i}{y_i (1-y_i)}$$ ;$$\dfrac{\delta y_i}{\delta z_i} = y_i(1-y_i)$$ and$$\dfrac{\delta z_i}{\delta w} = x_i$$ -
$$\dfrac{\delta E_i}{\delta w} = \dfrac{\delta E_i}{\delta y_i} \cdot \dfrac{\delta y_i}{\delta z_i} \cdot \dfrac{\delta z_i}{\delta w} = {\dfrac{y_i - t_i}{y_i (1-y_i)}} \cdot y_i(1-y_i) \cdot x_i = (y_i - t_i) \cdot x_i$$ -
with
$$y_i$$ being the predicted value,$$t_i$$ being label/truth and$$x_i$$ being input value -
See in the web:
-
Neural network implemetation - classification: https://peterroelants.github.io/posts/neural-network-implementation-part02/
-
Deeper explanation of the math: https://peterroelants.github.io/posts/cross-entropy-logistic/
- gradient descent is a mathematical method to find the steepest slope. Its negative
$$-\nabla$$ form is used to find the steepest negative slope (to find minima). - Backpropagation is an evective algorithm for applying Gradient Descent to a neural network in supervised learning. Therefore gradient descent is used to minimize the loss of a loss function.
- With Backpropation one can calculate the gradients for the neurons in each layer with the optimizer one optimizes (=adjust) the neurons.
- Common parameters for optimizers are: learning_rate, regularization (which gives penalty if a weight dominates by becoming very big) -> overfitting, often SGD (Stochastic Gradient Descent) is used.
- Optimizers change the neural network with respect to the result of the loss function in order to minimize the loss (or error)
- Some popular opzimizers, based on gradient descent are listed below:
- https://algorithmia.com/blog/introduction-to-optimizers
parameters: learning rate
- gradient descent is the grand daddy of all gradient descent based optimizers
- it has variants:
- "full batched" gradient descent: take all samples and do a gradient descent (to much cost for large datasets)
- Stochastic Gradient Descent (SGD): only take one sample per gradient descent step
- mini batch SGD: Use a batch of independent equally distributed n samples per gradient descent
SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum.
Left: without moment, Right: with momentum
Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction
paramters: fill in
- adapts learning rate specifically to individual features
- though, some weights will have different learning rates
- works good for sparse data
- Problem: learning rate tends to get really small over time
paramters: fill in
- special version of Adagard
- Instead of letting all of the gradients accumulate for momentum, it only accumulates gradients in a fixed window
- RMSprop is similar to Adaprop
paramters: fill in
- uses past gradients to calculate current gradients
- utilizes the concept of momentum by adding fractions of previous gradients to the current one
Left: SGD optimization on loss surface contours Right: SGD optimization on saddle point
still open to read:** https://ruder.io/optimizing-gradient-descent/
distributions: Normal, Exponential, Gamma ... (and many more)
- The normal distribution gives the probability for a point x with given mean (average)
$$\mu$$ and standard deviation$$\sigma$$ . - mean describes the point, with the highest probability, and the standard deviation how strong the points will differ from this mean point (it is the width of the curve)
- for normal distribution:
- intervall of +/-
$$\sigma$$ : 68,27 % of all measurements - intervall of +/-
$$2\sigma$$ : 95,45 % of all measurements - intervall of +/-
$$3\sigma$$ : 99,73 % of all measurements
- intervall of +/-
- Distributions are used to describe the probability, where points from experiments will be measured. Therefore a distribution is chosen, that is as similar as possible to the occurence of the measured points.
goal: Find the optimal way to fit a distribution to the data
- Look at the data and choose the right distribution (here normal distribution chosen)
- shift the normal distribution from left to right to find the right position for
$$\mu$$ , let$$\sigma$$ be constant for that
- the maximum of the likelihood is the best position for the given data
- this is the maximum likelihood estimate for the mean
- Change
$$\sigma$$ to change the width of the curve. Observe how the likelihood changes. Take the$$\sigma$$ where the likelihood is maximized.
Now the distribution fits best to the observed data, by using the maximum likelihood estimations for mean and standard deviation
Youtube: https://www.youtube.com/watch?v=XepXtl9YKwc
probability for data is measured the area under the curve for a given mean and standard deviation
This is used to estimate in which area the new data point will be:
likelihood takes the point on the curve for given data and we can shift the distribution to the right (by changing
Youtube: https://www.youtube.com/watch?v=pYxNSUDSFH4
Summary:
- Normal distribution:
$$P(x|\mu,\sigma) = \dfrac{1}{\sqrt{2 \pi \sigma^2}}e^{\dfrac{-(x-\mu)^2}{2\sigma^2}}$$ - with arithmetic mean
$$\bar{x} = \mu = \dfrac{1}{n} \sum_{i=1}^n x_i$$ - and standard deviation
$$\sigma = \sqrt{\dfrac{1}{n-1} \sum_{i=1}^n (x_i - \mu )^2}$$ - smaller
$$\mu$$ shifts mean to the left, greater$$\mu$$ shifts to the right - smaller
$$\sigma$$ makes Amplitude higher and width smaller, and reverse
Find maximum for
- Take log of likelihood function to make derivatives easier. Use for simplification, the following rules:
$$log(a \cdot b) = log(a) + log(b)$$ $$log(a^b) = b \cdot log(a)$$
See math in Youtube ...
Take partial derivatives:
Now solve the partial derivatives for 0
Youtube: https://www.youtube.com/watch?v=Dn6b9fCIUpM
- process of adding additional information to prevent overfitting
- is used in loss function
- Model gets parsimonious
1-Norm (L1 Norm)
2-Norm (L2 Norm)
squared 2-Norm (squred L2 Norm) (see it as complexity of model Google-Developers)
P-Norm
- Lasso regression = Regression that uses L1 Norm for regularization
- Ridge regeression = Regression that uses L2 Norm for regularization
Example for L0, L1, L2 with linear regression, with only one parameter
- Loss function without regularization is not prone to overfitting
-
$\lambda$ makes the loss shift away from the ideal weights, so overfitting is reduced -
$\lambda$ is independent from the model, thus it prevents from overfitting to very less data - See this as complexity: L2-Norm adds the complexity of the model
Example:
Source for "Complexity view": https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/l2-regularization
Deeper reading: https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261
-
L1 penalizes sum of absolute value of weights.
-
L1 has a sparse solution
-
L1 has multiple solutions
-
L1 has built in feature selection
-
L1 is robust to outliers
-
L1 generates model that are simple and interpretable but cannot learn complex patterns
-
L2 regularization penalizes sum of square weights.
-
L2 has a non sparse solution
-
L2 has one solution
-
L2 has no feature selection
-
L2 is not robust to outliers
-
L2 gives better prediction when output variable is a function of all input features
-
L2 regularization is able to learn complex data patterns
List from: https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2




















