Variant 4 | Notion

Initialize a single-hidden-layer neural network with Xavier initialization. Train it for 400 iterations. Plot weight norms over time for all trainable parameters (W1, b1, W2, b2)․ Change the initialization mode to Uniform[-3,3] and plot the same.
Take a single-hidden-layer neural network. Initialize with Uniform[-3,3] initialization. Train it for 300 iterations and plot the gradient norm of the bias vectors. Add BatchNorm without removing “bias” weights of the hidden layers, and plot the gradient norm again. Do the bias gradients become zero?
Take a two-hidden-layer neural network, 100 neurons in each hidden layer, batch normalization is applied to both layers. Experiments are done on the iris dataset. Always train for 200 iterations. Implement a grid search to determine the best initialization (Kaiming fan_in, Kaiming fan_out, Xavier, Uniform[-3,3]) and learning rate for ADAM (0.001, 0.003, 0.01, 0.03). Create a 4x4 matrix that holds the final loss: one row per initialization, one column per learning rate. Print or plot the matrix.