What i learned so far on neural network
Neural Network, Recurrent Neural Network, Generative AI
“Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.” - Yann LeCun.
Preface
Deep learning has undoubtedly established itself as the outstanding machine learning technique of recent times. This dominant position has been claimed through a series of overwhelming successes in widely different application areas such as speech processing, autonomous driving, medical AI and security.
Deep learning refers to techniques where deep neural networks are trained with gradient-based methods.
In “supervised” deep learning we can equate the whole problem to \(X \mapsto Y\). In another word, we have \(X\) as input space, and \(Y\) output space, and the learning is a process of mapping from \(X\) to \(Y\). The learner receives a set of labeled examples as training data and makes predictions for all unseen points.
Data as Vectors
We assume that our data can be read by a computer, and represented adequately in a numerical format. Our data can come from tabular data, genomic sequences, text, image or audio. each input \(x_{n}\) is a \(D\)-dimensional vector of real numbers, which are called features, attributes, or covariates.
For supervised learning problem where we have a label \(y_{n}\) associated with each example \(x_{n}\). The label \(y_{n}\) has various other names, including target, response variable, and annotation. A dataset is written as a set of example label pairs \(\{(x_1,y_1),...,(x_n,y_n)\}\)
Since we have vector representations of data, we can manipulate data to find potentially better representations of it.From original feature vector, we can represent into lower-dimensional approximations (e.g. principal components). On the other hand, if we can represent it into high-dimensional representation will see an explicit feature map \(\phi(\cdot)\) that allows us to represent inputs \(x_{n}\) using a higher-dimensional representation \(\phi(x_{n})\). The main motivation for higher-dimensional representations is that we can construct new features as non-linear combinations of the original features, which in turn may make the learning problem easier.
Before venture into the wild further, let’s revise our vector concept. You can think of a vector in simple terms as a list of numbers where the position of each item in this structure matters.
Consider the following column vector, where \(\nu_{1}\) and \(\nu_{2}\) each represent “some” feature.
\[\vec{\nu} = \begin{bmatrix} \nu_{1} \\ \nu_{2} \\ \end{bmatrix} = \begin{bmatrix} 1.64 \\ 64 \\ \end{bmatrix} \]
Additional operation is straight forward.
\[ \vec{z} = \vec{\nu} + \vec{w} \]
\[\vec{z} = \begin{bmatrix} 2 \\ 3 \\ \end{bmatrix} + \begin{bmatrix} 2 \\ 1 \\ \end{bmatrix} \]
\[\vec{z} = \begin{bmatrix} 2 + 2 \\ 3 + 1 \\ \end{bmatrix} = \begin{bmatrix} 4 \\ 4 \\ \end{bmatrix} \]
Then for multiplying vectors we can use what we called the dot product. The other is called the cross product, which won’t be covered in this note. The main difference between the dot product and the cross product is the result: the dot product result is a scalar, and what comes from the cross product is another vector.
So, for example, consider the vectors:
\[\vec{\nu} = \begin{bmatrix} 2 \\ 3 \\ \end{bmatrix} , \vec{w} = \begin{bmatrix} 2 \\ 1 \\ \end{bmatrix} \]
We can then calculate the dot product between \(\vec{\nu}\) and \(\vec{w}\) as such:
\[\vec{\nu} \cdot \vec{w}= \begin{bmatrix} 2 \\ 3 \\ \end{bmatrix} \cdot \begin{bmatrix} 2 \\ 1 \\ \end{bmatrix} = 2 \cdot 2 + 3 \cdot 1 = 7 \]
Then definition for this same concept given any two vectors, \(\vec{\nu}\) and \(\vec{w}\) where each of them have \(n\) elements, which is the same as saying that the vectors are of size \(n\) is:
\[\vec{\nu} \cdot \vec{w} = \sum_{n=1}^{n} {\nu_{i} \cdot w_{i}} \]
So,
\[ \sum_{n=1}^{n} {\nu_{i} \cdot w_{i}} = \nu_{1}\cdot w_{1} + \nu_{2}\cdot w_{2} + \ldots + \nu_{n}\cdot w_{n} \]
Models as Functions
Once we have data in an appropriate vector representation, we can get to the business of constructing a predictive function (known as a predictor).
Computable function (or computable predictor); that is, a function whose values can be calculated in some kind of automatic or effective way.1
Often in computability we shall encounter functions, or expressions involving functions, that are not always defined. In such situations the following notation is very useful. Suppose that \(\alpha(x)\) and \(\beta(x)\) are expressions involving the variables \(x=(x_1,\ldots , x_n)\). Then we write
\[\alpha(x) \simeq \beta(x)\] to mean that for any \(x\), the expressions \(\alpha(x)\) and \(\beta(x)\) are either both defined, or both undefined, and if defined they are equal.Thus, for example, if \(f, g\) are functions, writing \(f(x) \simeq g(x)\) is another way of saying that \(f = g\).
So, a phrase such as ‘Let \(f(x_1,\ldots , x_n)\) be a function …’ as a means of indicating that \(f\) is an \(n\)-array function.
A predictor is a function that, when given a particular input example (in our case, a vector of features),produces an output. Let’s consider non-linear case. Instead of considering a predictor as a single function, we could consider predictors to be probabilistic models, i.e., models describing the distribution of possible functions.
Learning from Data
As mention earlier, neural networks are trained with gradient-based learning. And this “learning” can modelled as functions.
The learning machine computes a function \(Y^p = F(Z^p, W)\) where \(Z^p\) is the \(p\)-th input pattern and \(W\) represents the collection of adjustable parameters in the system. In a pattern recognition setting, the output \(Y^p\) may be interpreted as the recognized class label of pattern \(Z^p\), or as scores or probabilities associated with each class.
A loss function \(E^p = D(D^p,F(W, Z^p))\) measures the discrepancy between \(D^p\), the “correct” or desired output for pattern \(Z^p\), and the output produced by the system. The average loss function \(E_{train}(W)\) is the average of the errors \(E^p\) over a set of labeled examples called the training set \(\{(Z^{1}, D^{1}), \ldots, (Z^{p}, D^{p})\}\)
In the simplest setting, the learning problem consists in finding the value of \(W\) that minimizes \(E_{train}(W)\). This minimization is usually called structural risk minimization.
1 Neural Network
A single neuron is the fundamental building block of a neural network (NN). Consider single neuron like Figure 1 where we have 1 input vector (represented as \(p\)), 1 neuron and 1 ouput (represented as \(a\)).
1 neuron is represented as the summation and a transfer function (a type of non-linear activation function). \(w\) is weight, \(b\) is bias(another weight which is usually to give non-zero value). So, the single neuron out is calculated as
\[ a = f(wp +b)\]
And if for example, \(w=3, p =2,\) and \(b =-1.5\), then
\[ a = f(3(2)-1.5) = f(4.5)\] From vector perspective, we can wrote above like the following:
input vector, \(\vec{i} = \begin{bmatrix} p \\ 1 \\ \end{bmatrix}\)
parameter vector, \(\vec{r} = \begin{bmatrix} w \\ b \\ \end{bmatrix}\)
So, to compute:
\[wp + b = [w b] \begin{bmatrix} p \\ 1 \\ \end{bmatrix} \qquad(1)\]
From Equation 1, we can deduce that:
\[wp + b = \vec{r}^T \cdot \vec{i} \]
So, the dot product between the parameter vector and input vector is taken, and a function is applied to it and given as output. The dot product between two vectors is the projection of one vector on the other.We can say then, a neuron maps the similarity(from projection) between the parameter vector and the input vector to the output through the transfer function. It “maps” the amount the parameter vector agrees with the input vector to the output through the transfer function.
Output \(a\) from Figure 1 is called a representation of the input \(p\).
Now, consider the following:
\[ f(X) = \sum_{m=1}^{M}g_{m}(\omega_{m}^TX) \]
where we have an input vector \(X\) with p components, and a target \(Y\). Let \(\omega_{m}\), \(m=1,2,...,M\), be unit p-vectors of unknown parameters. Above algorithms approximate a general function of p variables by a sum of nonlinear functions of linear combinations.2
The scalar variable \(V_{m}=\omega_{m}^TX\) is the projection of \(X\) onto the unit vector \(\omega_{m}\), and we want \(\omega_{m}\) so that the model fits well.
To visualize above statement consider Figure 2 and Figure 3 below.
Another way
Thus, we can say that the central idea of NN is to extract linear combinations of the inputs as derived features, and then model the target as a nonlinear function of these features.
So, NN is just a very complex geometric transformation in a high-dimensional space, of the input data. Uncrumpling the “input data” is what machine learning is about: finding neat representations for complex, highly folded data manifolds in high-dimensional spaces.5
For classification problem, a “decision boudary” is build in feature space hyperplane where features are “seperable”.
2 Recurrent neural network (RNN)
RNN have already been applied to a wide variety of problems involving time sequences of events and ordered data such as characters in words.
A recurrent neural network (RNN) is any network whose neurons send feedback signals to each other.
Thus idea behind RNN is straightforward: to predict the next word in a sentence (or text sequence), we need to know the previous words. If we pass to the RNN this sentence “Kuala Lumpur is the capital …”, then the result will be this “… capital of the Malaysia”.
Figure 5 shows a generic neural architecture that models sequence information called a recurrent neural network (RNN). Each of the blocks in the architecture is called a cell. Each cell is assigned to one individual element in the sequence. The job of cell \(i\) is to produce an output vector \(y_{i}\) using as input an input vector \(x_{i}\) and a state vector \(S_{i-1}\), which captures the information that has “flown” through the sequence until this cell. The state and output vectors are computed using the \(R\) and \(O\) functions, respectively. That is, in general:
\[S_{i} = R(S_{i-1},x_{i}) \qquad(2)\]
\[ y_{i} = O(S_{i}) \qquad(3)\]
Note that even though Equation Equation 3 states that the output vector \(y_{i}\) depends only on the current state vector \(S_{i}\), this state vector encodes both current input \((x_{i})\) as well as the information that has flown through the network thus far \((S_{i-1})\).
The simplest RNN uses the following implementations for \(R\) and \(O\):
\[S_{i} = R(S_{i-1},x_{i}) = f(W^s \cdot S_{i-1} + W^x \cdot x_{i} + b) \qquad(4)\]
\[ y_{i} = O(S_{i}) = S_{i} \qquad(5)\]
where \(W^x\), \(W^s\) and \(b\) are the parameters that are shared between all cells in the network. Thus, the \(R\) function is very similar to one layer of an Feed Forward Neural Network, but it has two sets of weights: one that operates over the input vector \(x_{i}(W^x)\), and one that operates over the state vector \(s_{i-1}(W^S)\). \(b\) contains the bias weights , and \(f\) is a nonlinear function such as a sigmoid or hyperbolic tangent.
From Equation 4 and Equation 5, we can see network capture both input information, which flows in through the vertical arrows in Figure 5 and information about the sequence, which travels through the horizontal arrows in the figure. Thus, an RNN is just another neural network architecture. We can train it, which in this case, means learning the cell parameters \(W^x\), \(W^s\) and \(b\) using essentially the same gradient-based method!
All mathematical descriptive above for RNN can be summarize as pseudocode below:
```{python}
state_t = 0
for input_t in input_sequence:
output_t = activation(dot(W_s, input_t) + dot(W_x, state_t) + b)
state_t = output_t
```
It is worth mentioning that RNNs can be composed into more complex architectures in multiple ways.
RNNs can be used in three different ways:
- as acceptors, where a classification layer is added on top of the last network cell;
- as transducers, which add a classification layer for each cell; and
- as encoder-decoders, in which case two RNNs are combined: an encoder that codes an input sequence into a single vector, and a decoder that generates one element at a time from an output sequence.
2.1 Problem with Simple RNNs
Simple RNN cell introduced in Figure 5 suffers from the vanishing gradient problem – that is, when gradient values become too small to impact the parameter updates in a meaningful way. In other words, vanishing gradients cause learning to stop prematurely. This can be seen from Equation 4 where the computation of the state vector relies on chaining several multiplications of the nonlinear function \(f\). Recall that since the activation functions \(f\) typically produce small values, multiplying several such values will quickly yield very small values, which in turn, will cause vanishingly small parameter updates.
2.2 Long short-term memory networks (LSTM)
Long short-term memory networks, or LSTMs, are RNNs that address the vanishing gradient problem. LSTMs replace the multiplicative architecture used by the vanilla RNNs with an additive architecture – that is, transitions between cells are handled (mostly) with additions and subtractions rather than multiplications.
The intuition behind LSTMs is simple: imagine that cells in an RNN are connected by a “conveyor belt” that carries information throughout the whole sequence (see Figure 6). Each cell subtracts information that is no longer needed from the conveyor belt and adds new information from the current input. Then, the state vector \(S_{i}\) for cell \(i\) is computed using the information available at this moment on the conveyor belt.
The LSTMs use three types of gates:
- Forget gate – controls how much of the content on the “conveyor belt” to preserve in the current cell,
- Input gate – decides how much of the input local to the current cell to add to the “conveyor belt”, and
- Output gate – controls how much of the “conveyor belt” vector to include in the hidden state vector for each cell.
These neural gates, which are simple mechanisms that control access to multidimensional vectors.
2.3 Drawbacks of Recurrent Neural Networks
Despite the fact that RNN architectures such as LSTMs are designed to capture arbitrarily long sequences, in practice, they become “fuzzy far away” – that is, they tend to forget about word order for long-range contexts such as beyond 50 words.