ViSen Prepositions Dataset

Overview

We explore the task of predicting the preposition that best expresses the relation between two visual entities.

More specifically, given a trajector entity and a landmark entity and their location and size in an image, predict the most suitable preposition that connects these two entities (as used in human-authored image descriptions).

For example, for the instance "boy ___ sled", where "boy" is the trajector and "sled" the landmark, select the best preposition to fill in the blank given the category labels and their corresponding bounding boxes.

Dataset

This dataset has two main appeals:

It is extracted from two large-scale image datasets with human-authored descriptions, with a reasonable amount of noise as extraction was performed automatically.
The prepositions are based on real-world usage as used by humans in image descriptions, making it attractive for exploring prepositional usage specifically in image descriptions.

The ZIP archive contains instances of triples <trajector, preposition, landmark> extracted from MSCOCO and Flickr30k Entities, as used in the experiments described in our EMNLP 2015 paper and its supplementary material. We only provide the ID's of the images and captions from the source datasets -- please obtain the images and annotations directly from the original datasets.

Dataset R201509: Download (3.5MB)

Citation

If you use this dataset, please cite the following work:

Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions [ Paper | Supplementary Material ]
Arnau Ramisa*, Josiah Wang*, Ying Lu, Emmanuel Dellandrea, Francesc Moreno-Noguer, Robert Gaizauskas (* = equal contribution)
In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015).

@InProceedings{Ramisa-EtAl:2015:EMNLP,
  author    = {Ramisa, Arnau  and  Wang, Josiah  and  Lu, Ying  and  Dellandrea, Emmanuel  and  Moreno-Noguer, Francesc  and  Gaizauskas, Robert},
  title     = {Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions},
  booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
  month     = {September},
  year      = {2015},
  address   = {Lisbon, Portugal},
  publisher = {Association for Computational Linguistics},
  pages     = {214--220},
  url       = {https://aclanthology.org/D15-1022/},
  doi       = {10.18653/v1/D15-1022},
}

Related Publications

Arnau Ramisa*, Josiah Wang*, Ying Lu, Emmanuel Dellandrea, Francesc Moreno-Noguer, Robert Gaizauskas (* = equal contribution)
Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions
In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015).
[ Paper | Supplementary Material ]

Contact

For any enquiries please contact Josiah Wang.

Acknowledgements

This work was funded by the EU CHIST-ERA D2K 2011 Visual Sense (ViSen) project.

It was also partly funded by the Spanish MINECO RobInstruct project.