Do Vision-Language Models Understand CultureMix?

Culture mix refers to scenes where cultural cues like food, attire, or architecture from different cultures appear together. (E.g., global potluck, enjoying food from your own culture while traveling abroad)

Culture-mixed food and context example 1

Culture-mixed food and context example 2

Culture-mixed food and context example 3

Paper Test It Yourself Findings Contributors

Why is this important?

Culture mixing reduces VLM accuracy

Finding: Frontier VLMs are less accurate when cultural cues are mixed up to
food mixtures -14% geographical cues -41% person/face cues -58%

Models often rely on background, geography, and appearance instead of the target cultural item.

Our three papers ask one shared question

When cultural cues conflict in a single scene, which cues do models rely on most?

Each paper isolates a different failure mode and reveals where models collapse under culturally mixed context.

Explore our dataset

Copy a prompt and image to test whether your model is robust.

How the three projects complement each other

Each project pairs a cultural target with a different perturbation axis, together covering the main ways cultural cues collide inside a scene.

Unified CultureMix corpus

30,957 samples test whether VLMs preserve cultural identity under mixed visual cues.

4 target families 5 perturbation cues 3 complementary benchmarks

Shared takeaways across the papers

Citations

Contributors

Maintained by collaborators across institutions.