Do Vision-Language Models Understand CultureMix?
Culture mix refers to scenes where cultural cues like food, attire, or architecture from different cultures appear together. (E.g., global potluck, enjoying food from your own culture while traveling abroad)
Finding: Frontier VLMs are less accurate when cultural cues are mixed up to
food mixtures -14%
geographical cues -41%
person/face cues -58%
How the three projects complement each other
Each project pairs a cultural target with a different perturbation axis, together covering the main ways cultural cues collide inside a scene.
Unified CultureMix corpus
30,957 samples test whether VLMs preserve cultural identity under mixed visual cues.
Citations
Contributors
Maintained by collaborators across institutions.