Skip to main content

Advertisement

Springer Nature Link
Log in
Menu
Find a journal Publish with us Track your research
Search
Saved research
Cart
  1. Home
  2. Computer Vision – ECCV 2010
  3. Conference paper

Every Picture Tells a Story: Generating Sentences from Images

  • Conference paper
  • pp 15–29
  • Cite this conference paper
Image Computer Vision – ECCV 2010 (ECCV 2010)
Every Picture Tells a Story: Generating Sentences from Images
  • Ali Farhadi19,
  • Mohsen Hejrati20,
  • Mohammad Amin Sadeghi20,
  • Peter Young19,
  • Cyrus Rashtchian19,
  • Julia Hockenmaier19 &
  • …
  • David Forsyth19 

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 6314))

Included in the following conference series:

  • European Conference on Computer Vision
  • 20k Accesses

  • 907 Citations

  • 6 Altmetric

Abstract

Humans can prepare concise descriptions of pictures, focusing on what they find important. We demonstrate that automatic methods can do so too. We describe a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate of meaning obtained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned using data. We evaluate on a novel dataset consisting of human-annotated images. While our underlying estimate of meaning is impoverished, it is sufficient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche.

Download to read the full chapter text

Chapter PDF

Similar content being viewed by others

Image

CapGen: A Neural Image Caption Generator with Speech Synthesis

Chapter © 2021
Image

An Enhanced Deep Learning Method to Generate Synthetic Images with Features That are Comparable to Original Images Using Neural Style Transfer

Chapter © 2024
Image

Describing Image Using Neural Networks

Chapter © 2020

Explore related subjects

Discover the latest articles, books and news in related subjects, suggested using machine learning.
  • Embodiment
  • Machine Translation
  • Natural Language Processing (NLP)
  • Object Recognition
  • Visual Culture
  • Visual Journalism
  • Multimodal Captioning in Visual Language Processing

References

  1. Barnard, K., Duygulu, P., Forsyth, D.: Clustering art. In: CVPR, vol. II, pp. 434–441 (2001)

    Google Scholar 

  2. Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: WMISR (1999)

    Google Scholar 

  3. Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.: Object recognition as machine translation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  4. Datta, R., Li, J., Wang, J.Z.: Content-based image retrieval: approaches and trends of the new age. In: MIR 2005, pp. 253–262 (2005)

    Google Scholar 

  5. Forsyth, D., Berg, T., Alm, C., Farhadi, A., Hockenmaier, J., Loeff, N., Wang, G.: Words and pictures: Categories, modifiers, depiction and iconography. In: Object Categorization: Computer and Human Vision Perspectives, CUP (2009)

    Google Scholar 

  6. Phillips, P.J., Newton, E.: Meta-analysis of face recognition algorithms. In: ICAFGR (2002)

    Google Scholar 

  7. Gupta, A., Davis, L.: Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 16–29. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  8. Li, L.J., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition. In: ICCV (2007)

    Google Scholar 

  9. Li, L.J., Socher, R., Fei-Fei, L.: Towards total scene understanding:classification, annotation and segmentation in an automatic framework. In: CVPR (2009)

    Google Scholar 

  10. Gupta, A., Davis, L.: Objects in action: An approach for combining action understanding and object perception. In: CVPR (2007)

    Google Scholar 

  11. Gupta, A., Davis, A.K.,, L.: Observing human-object interactions: Using spatial and functional compatibility for recognition. Trans. on PAMI (2009)

    Google Scholar 

  12. Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR (2010)

    Google Scholar 

  13. Berg, T.L., Berg, A.C., Edwards, J., Forsyth, D.A.: Who’s in the picture. In: Advances in Neural Information Processing (2004)

    Google Scholar 

  14. Mensink, T., Verbeek, J.: Improving people search using query expansions: How friends help to find people. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 86–99. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  15. Luo, J., Caputo, B., Ferrari, V.: Who’s doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In: NIPS (2009)

    Google Scholar 

  16. Coyne, B., Sproat, R.: Wordseye: an automatic text-to-scene conversion system. In: SIGGRAPH 2001 (2001)

    Google Scholar 

  17. Gupta, A., Srinivasan, P., Shi, J., Davis, L.: Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In: CVPR (2009)

    Google Scholar 

  18. Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2t: Image parsing to text description. Proc. IEEE (2010) (in Press)

    Google Scholar 

  19. Felzenszwalb, P., Mcallester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR 2008 (2008)

    Google Scholar 

  20. Hoiem, D., Divvala, S., Hays, J.: Pascal voc 2009 challenge. In: PASCAL challenge workshop in ECCV (2009)

    Google Scholar 

  21. Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in recognition. In: Progress in Brain Research, p. 2006 (2006)

    Google Scholar 

  22. Curran, J., Clark, S., Bos, J.: Linguistically motivated large-scale nlp with c&c and boxer. In: ACL, pp. 33–36

    Google Scholar 

  23. Lin, D.: An information-theoretic definition of similarity. In: ICML, 296–304 (1998)

    Google Scholar 

  24. Taskar, B., Chatalbashev, V., Koller, D., Guestrin, C.: Learning structured prediction models: a large margin approach. In: ICML, pp. 896–903 (2005)

    Google Scholar 

  25. Ratliff, N., Bagnell, J.A., Zinkevich, M.: Subgradient methods for maximum margin structured learning. In: ICML (2006)

    Google Scholar 

  26. Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using amazon’s mechanical turk. In: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Computer Science Department, University of Illinois at Urbana-Champaign,  

    Ali Farhadi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier & David Forsyth

  2. Computer Vision Group, School of Mathematics, Institute for studies in theoretical Physics and Mathematics(IPM),  

    Mohsen Hejrati & Mohammad Amin Sadeghi

Authors
  1. Ali Farhadi
    View author publications

    Search author on:PubMed Google Scholar

  2. Mohsen Hejrati
    View author publications

    Search author on:PubMed Google Scholar

  3. Mohammad Amin Sadeghi
    View author publications

    Search author on:PubMed Google Scholar

  4. Peter Young
    View author publications

    Search author on:PubMed Google Scholar

  5. Cyrus Rashtchian
    View author publications

    Search author on:PubMed Google Scholar

  6. Julia Hockenmaier
    View author publications

    Search author on:PubMed Google Scholar

  7. David Forsyth
    View author publications

    Search author on:PubMed Google Scholar

Editor information

Editors and Affiliations

  1. GRASP Laboratory, University of Pennsylvania, 3330 Walnut Street, 19104, Philadelphia, PA, USA

    Kostas Daniilidis

  2. School of Electrical and Computer Engineering, National Technical University of Athens, 15773, Athens, Greece

    Petros Maragos

  3. Department of Applied Mathematics, Ecole Centrale de Paris, Grande Voie des Vignes, 92295, Chatenay-Malabry, France

    Nikos Paragios

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Farhadi, A. et al. (2010). Every Picture Tells a Story: Generating Sentences from Images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds) Computer Vision – ECCV 2010. ECCV 2010. Lecture Notes in Computer Science, vol 6314. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15561-1_2

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-15561-1_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15560-4

  • Online ISBN: 978-3-642-15561-1

  • eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Machine Translation
  • Head Noun
  • Node Feature
  • Generate Sentence
  • Edge Potential

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Publish with us

Policies and ethics

Search

Navigation

  • Find a journal
  • Publish with us
  • Track your research

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Journal finder
  • Publish your research
  • Language editing
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our brands

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Discover
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support
  • Legal notice
  • Cancel contracts here

104.23.197.147

Not affiliated

Springer Nature

© 2026 Springer Nature