IEEE Croatia Section's Computer Chapter would like to invite you to the following lecture:

How Many Words is A Picture Really Worth? On Training and Evaluating Large Vision-Language Models

which will be presented by Goran Glavaš, PhD on Wednesday, 29th May 2024 at 10:00 o'clock in the Grey Hall of the Faculty of Electrical Engineering and Computing.

All those interested are welcome, especially students. The lecture will be held in English.

The author's biography and the lecture's summary can be found below.

Lecture summary:

Large Vision-Language Models (LVLMs), commonly obtained by aligning a pretrained visual encoder (e.g., a Vision Transformer, ViT) to a pretrained large language model (LLM), have recently led to impressive results not only in image captioning, but also on a wide range of visual understanding and reasoning tasks (e.g., visual question answering). Nonetheless, there are a number of factors involved, ranging from the architecture of the alignment module to the exact "training mix" (i.e., training tasks and data) that strongly determine the effectiveness of the resulting LVLM. Moreover, LVLMs (much like their text-only counterparts), are not inherently multilingual and suffer from hallucination. In this talk, I'll explore training and evaluation protocols for LVLMs, focusing in particular on (i) efficiently training  competitive massively multilingual LVLMs, (ii) training with grounding objectives, reported to reduce hallucinative tendencies of LVLMs, and (iii) pitfalls of existing LVLM evaluation and possible remedies.


About the author:

Goran Glavaš is a Full Professor for Natural Language Processing at the University of Würzburg (Germany), Center for AI and Data Science (CAIDAS). He obtained his Ph.D. at the Text Analysis and Knowledge Engineering Lab (TakeLab), Faculty of Electrical Engineering and Computing, University of Zagreb. His research interests are in the areas of Natural Language Processing and Information Retrieval, with focus on multilingual NLP and IR and cross-lingual transfer, vision-and-language models and multimodal representation learning, information extraction, and NLP applications (primarily for social sciences and humanities). He has (co-)authored over 120 publications in the areas of NLP and IR, publishing regularly at top-tier NLP and IR venues (ACL, EMNLP, NAACL, EACL, TACL, SIGIR, ECIR). He is a prominent member of the Association for Computational Linguistics (ACL), where he served as an Editor-in-Chief of the ACL Rolling Review (ARR), a central reviewing service of the ACL, and regularly serves as an (Senior) Area Chair for top-tier conferences. He is a member of the Association for Computational Linguistics and German Society for Computational Linguistics (GSCL).

Author: Lucija Petricioli
