Peeking into Tensorflow-Keras

Peeking into Tensorflow-Keras

Author

Amir Fawwaz

Published

November 11, 2023

1 Introduction

Machine learning algorithms can be broadly categorized as unsupervised or supervised by what kind of “experience” they are allowed to have during the learning process. In these case, “experience” is called dataset. Sometimes we call them data points. Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target.

1.1 Neural Network

Figure 1: Depiction of ANN

Neural networks or connectionist architectures provide an alternative computational paradigm, and can be seen as a step towards the understanding of intelligence. It departs from the traditional von Neumann serial processing and instead is based on distributed processing via connections between simple elements.

The goal of a neural network is to approximate some function by learning parameters that results in the best approximation. Another way of saying this is minimising the difference (loss function is used to perform the estimation) between the expected output and the actual one.

1.2 Modelling

Models are abstractions of reality to which experiments can be applied to improve our understanding of phenomena in the world. They are at the heart of science in which models can be used to process data to predict future events or to organise data in ways that allow information to be extracted from it. There are two common approaches to constructing models.

The first is of a deductive nature. It relies on subdividing the system being modelled into subsystems that can be expressed by accepted relationships and physical laws. These subsystems are typically arranged in the form of simulation blocks and sets of differential equations. The model is consequently obtained by combining all the sub-models.

The second approach favours the inductive strategy of estimating models from measured data. This estimation process will be referred to as “learning from data” or simply “learning” for short.

In general, a neural network consists of layers of neurons where each neuron computes the following activation function:

\[ f(x) = \phi(\mathbf{w}^Tx+b) \]

where \(x\) is the input to the neuron, \(w\) is a weight vector, \(b\) is a bias term and \(\phi\) is a nonlinearity function. Each neuron receives potentially many inputs, and outputs a single number. The nonlinearity is important because it allows layers of neurons to learn non-linear functions. In these layered structures, the output of one layer of units becomes the inputs to the next layer of units.

We need to find the weights and biases so that the outputs of the net comes as close as possible to their true values. Since we know that loss function will be used to measure this close value, adjustment values of weights and biases is via optimizer.

1.3 Tensor

Mathematically, a tensor is a generalization of vector and matrices. It the context of Tensorflow, a tensor is considered as a multidimensional array.

(a) tensor visualization 1

(b) tensor visualization 1

Figure 2: Depiction of tensor

2 Tensorflow

  • Tensorflow 2.x has adopted keras API as standard method writing neural network
  • Tensorflow 2.x use eager execution by default

When writing a TensorFlow program, the main object that is manipulated and passed around is the tf.Tensor. TensorFlow supports eager execution and graph execution. In eager execution, operations are evaluated immediately. In graph execution, a computational graph is constructed for later evaluation.

tf.Tensor computation is accelerated via GPU’s, TPU’s!

(a) Typical Keras API

(b) Typical Tensorflow architecture

Figure 3: Where Keras-Tensorflow fit in?

2.1 Available optimizers in Tensorflow

Figure 4: Comparison of different optimizer

2.2 Available loss function in Tensorflow

Probabilistic losses

  • BinaryCrossentropy class
  • CategoricalCrossentropy class
  • SparseCategoricalCrossentropy class
  • Poisson class
  • binary_crossentropy function
  • categorical_crossentropy function
  • sparse_categorical_crossentropy function
  • poisson function
  • KLDivergence class
  • kl_divergence function

Regression losses

  • MeanSquaredError class
  • MeanAbsoluteError class
  • MeanAbsolutePercentageError class
  • MeanSquaredLogarithmicError class
  • CosineSimilarity class
  • mean_squared_error function
  • mean_absolute_error function
  • mean_absolute_percentage_error function
  • mean_squared_logarithmic_error function
  • cosine_similarity function
  • Huber class
  • huber function
  • LogCosh class
  • log_cosh function

more here

Figure 5: loss function minimization via learning rate

3 Tensorflow in Action

3.1 Data preparation

Data usually formatted in 3-dimension : (60000, 28, 28)

This data images is stored in a 3D tensor of axes 3 and having shape representing 60,000 matrices of 28×28 integers.

Figure 6: Data tensor visualize

3.2 Neural Network Stacking

Multi Layer Perceptron

Figure 7: Simple MLP
model_2 = tf.keras.models.Sequential(name="simple-MLP")
model_2.add(tf.keras.layers.Dense(2, input_shape = (1,)))
model_2.add(tf.keras.layers.Dense(1, activation='sigmoid'))

MLP with Feature Extraction

model = tf.keras.models.Sequential(name="simple-CNN")
model.add(tf.keras.layers.Conv2D(filters = 32, kernel_size = (5, 5), activation='relu', padding='same', input_shape = (IMG_SIZE,IMG_SIZE,1)))
model.add(tf.keras.layers.MaxPooling2D(pool_size = (2, 2)))

model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dense(3, activation='softmax'))

Simple Autoencoder

input_layer = keras.Input(shape=(height, width, 1))
# encoding
x = keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(input_layer)
x = keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.MaxPooling2D((2, 2), padding='same')(x)
x = keras.layers.Dropout(0.5)(x)

# decoding
x = keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.UpSampling2D((2, 2))(x)

output_layer = keras.layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)

model = tf.keras.Model(inputs=[input_layer], outputs=[output_layer])

When we are dealing with network that has feature extraction, convolution operation is used.

(a) convolve stride with no padding

(b) convolve stride with padding

Figure 8: Convolution operation

Model: "simple-CNN"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 180, 180, 32)      832       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 90, 90, 32)       0         
 )                                                               
                                                                 
 flatten (Flatten)           (None, 259200)            0         
                                                                 
 dense (Dense)               (None, 128)               33177728  
                                                                 
 activation (Activation)     (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 3)                 387       
                                                                 
=================================================================
Total params: 33,178,947
Trainable params: 33,178,947
Non-trainable params: 0
_________________________________________________________________

3.3 Training

(a) Geometrical view of loss function over weight space

(b) 3D view of loss function over weight space

Figure 9: Depiction of how \(E(w)\) takes its smallest value

Batch size defines the number of samples we use in one epoch to train a neural network. There are three types of gradient descent in respect to the batch size:

  • Batch gradient descent – uses all samples from the training set in one epoch.
  • Stochastic gradient descent – uses only one random sample from the training set in one epoch.
  • Mini-batch gradient descent – uses a predefined number of samples from the training set in one epoch.

Figure 10: 3D view of loss function over weight space