Learning like a toddler, from few instances or by imagination

Excuse the typos and bad grammar, my fluency is reciprocal to good grammar

deep learning bests human performance on tasks when abundent data is available which is both painful and time-consuming. Its not full proof either,manual annotaters can label things wrong all the time due to various reasons or sometimes the categories are so fine-grained or subjective or both that it is not feasible.

Two solutions can be there for this: a) Semi-Supervised learning which is getting attention lately b) Few-shot learning which is the matter of discussion here

What is few-shot Learning?

its learning from "few Shots" or "few samples", the model is given one or two looks at a labelled image and that's all the training which is required (ofcourse have to run it many times but I ran 5k epochs in 1.5 hours, not bad!!!)

how it can classify based on two samples?

what conv nets generally learn is the template of a class/object, all that image data passing through layers leave some specific input about that particular image which when aggregated gives a general picture what that image should look like under any departure from normal conditions.

what FSL does is to collect the feature of images(not very well-defined) and group them together, making small clusters of that class so for 10 different images 10 different clusters and then for a new image the distance of its feature is calculated from all the clusters, whichever cluster is nearest is classified as the label (something like KNN).

Does this work?

surprisingly very well, when I first used it I wasn't sure if it was going to work but seeing accuracies >90 was very surprising and then the feeling of the power of Deep learning sets in which lacks when you walk through custom datasets like CIFAR or ImageNet.

Why it works?

now that we know it works a lot of theories can be made as to why it works. On first hand it makes sense that since ConvNets are powerful feature extractors they can extract features very efficiently and that is the central part of this working (IMO). you have features which are quite definitive all you do is compare the distances which brings to another performance improving part, what kind of distance?

is it L1 or L2 or cosine or some other metric? First it was cosine and then it is L2, just switching it gives a measurable performance boost. Why? L2 shows bergman divergance that's why (if that makes sense please explain it to me too)

in further posts of this series, we will get into few more details like which model is the alexnet of FSL and which dataset is the ImageNet and progressions since then and current SOTA.