Conv net architecture

Support/Figures/Pasted image 20241002045115.png

Basically, res nets allow an easy learning of the identity map from $a^{[l]} \to a^{[l + 2]}$ , allowing it to skip a hidden unit.

Support/Figures/Pasted image 20241002060841.png

Support/Figures/Pasted image 20241002060920.png

Support/Figures/Pasted image 20241002061339.png

Support/Figures/Pasted image 20241002133717.png

Essentially, in addition to a one hot vector that has pederstrain or car or motorcycle or background (remember localization means only 1 object) we might have 4 more (regressive-like) features, bx, by for the midpoint of the bounding box, and the bw for width, and bh for height, so this would allow us to supervise learn on these new ground truth vectors.

Support/Figures/Pasted image 20241002140025.png

Support/Figures/Pasted image 20241002140433.png

the labels and landmark positions must be consistent across all target vectors.

The next idea for detecting multiple objects is to roughly: train on closely cropped images and do classification. then use a sliding window, and train with localization too.
Support/Figures/Pasted image 20241003044915.png

There is a large issue with computational cost, the sliding windows choosing the granularity, pushing each crop through a conv net, might take a mine ahha :)
![[Support/Figures/Pasted image 20241003050234.png#invert_B]]
The cool idea is, if our conv net specifically works on 14x14 images, we thing about 14x14 as out sliding window size!!!! then if we for example pass a 28x28,
the last 8x8 outputs are equivalent to taking 14x14 chuncks with a stride of two (max pool layer 2x2 means stride of 2) and then passing them through our conv net!
better bounding boxes:

Support/Figures/Pasted image 20241003052316.png

choose a reaosnoable grid size, in each grid cell we expect to have 1 object, repesetned by the bounding box midpoint. So those will be the target vector. even if the object spans multiple grid cells, if a grid cell is fine enough, the midpoint of each bounding box better be present in exactly one grid cell.
Support/Figures/Pasted image 20241003053316.png

we might detect many grid cells, that have a car.

it might be useful to get more detailed/low level features from the leftmost hidden layer, as well as some higher level spatial/contextual features from the previous layer into the last hidden layer.
Support/Figures/Pasted image 20241003070732.png