AnoVL: Adapting Vision-Language Models for Unified Zero-shot...
Contrastive Language-Image Pre-training (CLIP) models have shown promising performance on zero-shot visual recognition tasks by learning visual representations under natural language supervision....
https://arxiv.org/abs/2308.15939