Conv net architecture

Support/Figures/Pasted image 20241002045115.png
Support/Figures/Pasted image 20241002045350.png
Support/Figures/Pasted image 20241002045516.png
Support/Figures/Pasted image 20241002060532.png
Support/Figures/Pasted image 20241002060602.png
Basically, res nets allow an easy learning of the identity map from a[l]  a[l+2], allowing it to skip a hidden unit.
Support/Figures/Pasted image 20241002060735.png

Support/Figures/Pasted image 20241002060841.png

Support/Figures/Pasted image 20241002060920.png

Support/Figures/Pasted image 20241002061339.png
Support/Figures/Pasted image 20241002062002.png
Support/Figures/Pasted image 20241002062041.png
Support/Figures/Pasted image 20241002062118.png
Support/Figures/Pasted image 20241002062232.png
Support/Figures/Pasted image 20241002062357.png
Support/Figures/Pasted image 20241002062419.png
Support/Figures/Pasted image 20241002062537.png
Support/Figures/Pasted image 20241002062701.png
Support/Figures/Pasted image 20241002062758.png
Support/Figures/Pasted image 20241002062923.png
Support/Figures/Pasted image 20241002063012.png
Support/Figures/Pasted image 20241002063314.png
Support/Figures/Pasted image 20241002063333.png
Support/Figures/Pasted image 20241002063414.png
Support/Figures/Pasted image 20241002063445.png
Support/Figures/Pasted image 20241002065927.png
Support/Figures/Pasted image 20241002070116.png
Support/Figures/Pasted image 20241002071455.png
Support/Figures/Pasted image 20241002072813.png
Support/Figures/Pasted image 20241002080742.png
Support/Figures/Pasted image 20241002080950.png
Support/Figures/Pasted image 20241002081106.png
Support/Figures/Pasted image 20241002081528.png
Support/Figures/Pasted image 20241002082214.png
Support/Figures/Pasted image 20241002082252.png

Support/Figures/Pasted image 20241002133717.png
Support/Figures/Pasted image 20241002133957.png
Essentially, in addition to a one hot vector that has pederstrain or car or motorcycle or background (remember localization means only 1 object) we might have 4 more (regressive-like) features, bx, by for the midpoint of the bounding box, and the bw for width, and bh for height, so this would allow us to supervise learn on these new ground truth vectors.

Support/Figures/Pasted image 20241002140025.png

Support/Figures/Pasted image 20241002140433.png
Support/Figures/Pasted image 20241003044540.png
the labels and landmark positions must be consistent across all target vectors.

The next idea for detecting multiple objects is to roughly: train on closely cropped images and do classification. then use a sliding window, and train with localization too.
Support/Figures/Pasted image 20241003044915.png
Support/Figures/Pasted image 20241003045035.png
Support/Figures/Pasted image 20241003045144.png

There is a large issue with computational cost, the sliding windows choosing the granularity, pushing each crop through a conv net, might take a mine ahha :)
![[Support/Figures/Pasted image 20241003050234.png#invert_B]]
The cool idea is, if our conv net specifically works on 14x14 images, we thing about 14x14 as out sliding window size!!!! then if we for example pass a 28x28,
Support/Figures/Pasted image 20241003050742.png the last 8x8 outputs are equivalent to taking 14x14 chuncks with a stride of two (max pool layer 2x2 means stride of 2) and then passing them through our conv net!
better bounding boxes:

Support/Figures/Pasted image 20241003052316.png
Support/Figures/Pasted image 20241003052346.png
choose a reaosnoable grid size, in each grid cell we expect to have 1 object, repesetned by the bounding box midpoint. So those will be the target vector. even if the object spans multiple grid cells, if a grid cell is fine enough, the midpoint of each bounding box better be present in exactly one grid cell.
Support/Figures/Pasted image 20241003053316.png
Support/Figures/Pasted image 20241003053454.png
we might detect many grid cells, that have a car.
Support/Figures/Pasted image 20241003053653.png
Support/Figures/Pasted image 20241003054026.png
Support/Figures/Pasted image 20241003054147.png
Support/Figures/Pasted image 20241003054220.png
Support/Figures/Pasted image 20241003054633.png
Support/Figures/Pasted image 20241003054722.png
Support/Figures/Pasted image 20241003054804.png
Support/Figures/Pasted image 20241003055030.png
Support/Figures/Pasted image 20241003055116.png
Support/Figures/Pasted image 20241003055355.png
Support/Figures/Pasted image 20241003055431.png
Support/Figures/Pasted image 20241003063945.png
Support/Figures/Pasted image 20241003065521.png
Support/Figures/Pasted image 20241003065535.png
Support/Figures/Pasted image 20241003070255.png
Support/Figures/Pasted image 20241003070354.png
Support/Figures/Pasted image 20241003070535.png
it might be useful to get more detailed/low level features from the leftmost hidden layer, as well as some higher level spatial/contextual features from the previous layer into the last hidden layer.
Support/Figures/Pasted image 20241003070732.png
Support/Figures/Pasted image 20241004101645.png
Support/Figures/Pasted image 20241004101722.png
Support/Figures/Pasted image 20241004101955.png
Support/Figures/Pasted image 20241004102112.png
Support/Figures/Pasted image 20241004102254.png
Support/Figures/Pasted image 20241004102508.png
Support/Figures/Pasted image 20241004102737.png
Support/Figures/Pasted image 20241004102828.png
Support/Figures/Pasted image 20241004102910.png
Support/Figures/Pasted image 20241004102951.png
Support/Figures/Pasted image 20241004103110.png
Support/Figures/Pasted image 20241004103201.png
Support/Figures/Pasted image 20241004103250.png
Support/Figures/Pasted image 20241004103428.png
Support/Figures/Pasted image 20241004103753.png
Support/Figures/Pasted image 20241004103946.png
Support/Figures/Pasted image 20241004104106.png
Support/Figures/Pasted image 20241004104203.png
Support/Figures/Pasted image 20241004104402.png
Support/Figures/Pasted image 20241004104624.png
Support/Figures/Pasted image 20241004104915.png
Support/Figures/Pasted image 20241004104935.png