Peeking into Tensorflow-Keras

Author

Amir Fawwaz

Published

November 11, 2023

1 Introduction

Machine learning algorithms can be broadly categorized as unsupervised or supervised by what kind of “experience” they are allowed to have during the learning process. In these case, “experience” is called dataset. Sometimes we call them data points. Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target.

1.1 Neural Network

Neural networks or connectionist architectures provide an alternative computational paradigm, and can be seen as a step towards the understanding of intelligence. It departs from the traditional von Neumann serial processing and instead is based on distributed processing via connections between simple elements.

The goal of a neural network is to approximate some function by learning parameters that results in the best approximation. Another way of saying this is minimising the difference (loss function is used to perform the estimation) between the expected output and the actual one.

1.2 Modelling

Models are abstractions of reality to which experiments can be applied to improve our understanding of phenomena in the world. They are at the heart of science in which models can be used to process data to predict future events or to organise data in ways that allow information to be extracted from it. There are two common approaches to constructing models.

The first is of a deductive nature. It relies on subdividing the system being modelled into subsystems that can be expressed by accepted relationships and physical laws. These subsystems are typically arranged in the form of simulation blocks and sets of differential equations. The model is consequently obtained by combining all the sub-models.

The second approach favours the inductive strategy of estimating models from measured data. This estimation process will be referred to as “learning from data” or simply “learning” for short.

In general, a neural network consists of layers of neurons where each neuron computes the following activation function:

\[ f(x) = \phi(\mathbf{w}^Tx+b) \]

where \(x\) is the input to the neuron, \(w\) is a weight vector, \(b\) is a bias term and \(\phi\) is a nonlinearity function. Each neuron receives potentially many inputs, and outputs a single number. The nonlinearity is important because it allows layers of neurons to learn non-linear functions. In these layered structures, the output of one layer of units becomes the inputs to the next layer of units.

We need to find the weights and biases so that the outputs of the net comes as close as possible to their true values. Since we know that loss function will be used to measure this close value, adjustment values of weights and biases is via optimizer.

1.3 Tensor

Mathematically, a tensor is a generalization of vector and matrices. It the context of Tensorflow, a tensor is considered as a multidimensional array.

2 Tensorflow

Tensorflow 2.x has adopted keras API as standard method writing neural network
Tensorflow 2.x use eager execution by default

When writing a TensorFlow program, the main object that is manipulated and passed around is the tf.Tensor. TensorFlow supports eager execution and graph execution. In eager execution, operations are evaluated immediately. In graph execution, a computational graph is constructed for later evaluation.

tf.Tensor computation is accelerated via GPU’s, TPU’s!

2.1 Available optimizers in Tensorflow

Stochastic gradient descent (SGD)
RMSprop
Adam
AdamW
Adadelta
Adagrad
Adamax
Adafactor
Nadam
Ftrl

Figure 4: Comparison of different optimizer

2.2 Available loss function in Tensorflow

Probabilistic losses

BinaryCrossentropy class
CategoricalCrossentropy class
SparseCategoricalCrossentropy class
Poisson class
binary_crossentropy function
categorical_crossentropy function
sparse_categorical_crossentropy function
poisson function
KLDivergence class
kl_divergence function

Regression losses

MeanSquaredError class
MeanAbsoluteError class
MeanAbsolutePercentageError class
MeanSquaredLogarithmicError class
CosineSimilarity class
mean_squared_error function
mean_absolute_error function
mean_absolute_percentage_error function
mean_squared_logarithmic_error function
cosine_similarity function
Huber class
huber function
LogCosh class
log_cosh function

more here

Figure 5: loss function minimization via learning rate

3 Tensorflow in Action

3.1 Data preparation

read image using OpenCV
read image using Pillow
read image using tf.keras.utils

Data usually formatted in 3-dimension : (60000, 28, 28)

This data images is stored in a 3D tensor of axes 3 and having shape representing 60,000 matrices of 28×28 integers.

3.2 Neural Network Stacking

Multi Layer Perceptron

model_2 = tf.keras.models.Sequential(name="simple-MLP")
model_2.add(tf.keras.layers.Dense(2, input_shape = (1,)))
model_2.add(tf.keras.layers.Dense(1, activation='sigmoid'))

MLP with Feature Extraction

model = tf.keras.models.Sequential(name="simple-CNN")
model.add(tf.keras.layers.Conv2D(filters = 32, kernel_size = (5, 5), activation='relu', padding='same', input_shape = (IMG_SIZE,IMG_SIZE,1)))
model.add(tf.keras.layers.MaxPooling2D(pool_size = (2, 2)))

model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dense(3, activation='softmax'))

Simple Autoencoder

input_layer = keras.Input(shape=(height, width, 1))
# encoding
x = keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(input_layer)
x = keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.MaxPooling2D((2, 2), padding='same')(x)
x = keras.layers.Dropout(0.5)(x)

# decoding
x = keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.UpSampling2D((2, 2))(x)

output_layer = keras.layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)

model = tf.keras.Model(inputs=[input_layer], outputs=[output_layer])

When we are dealing with network that has feature extraction, convolution operation is used.

Model: "simple-CNN"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 180, 180, 32)      832       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 90, 90, 32)       0         
 )                                                               
                                                                 
 flatten (Flatten)           (None, 259200)            0         
                                                                 
 dense (Dense)               (None, 128)               33177728  
                                                                 
 activation (Activation)     (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 3)                 387       
                                                                 
=================================================================
Total params: 33,178,947
Trainable params: 33,178,947
Non-trainable params: 0
_________________________________________________________________

3.3 Training

(a) Geometrical view of loss function over weight space

Batch size defines the number of samples we use in one epoch to train a neural network. There are three types of gradient descent in respect to the batch size:

Batch gradient descent – uses all samples from the training set in one epoch.
Stochastic gradient descent – uses only one random sample from the training set in one epoch.
Mini-batch gradient descent – uses a predefined number of samples from the training set in one epoch.

Figure 10: 3D view of loss function over weight space