Skip to main content

Advertisement

Log in

Efficiently Scaling up Crowdsourced Video Annotation

A Set of Best Practices for High Quality, Economical Video Labeling

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We present an extensive three year study on economically annotating video with crowdsourced marketplaces. Our public framework has annotated thousands of real world videos, including massive data sets unprecedented for their size, complexity, and cost. To accomplish this, we designed a state-of-the-art video annotation user interface and demonstrate that, despite common intuition, many contemporary interfaces are sub-optimal. We present several user studies that evaluate different aspects of our system and demonstrate that minimizing the cognitive load of the user is crucial when designing an annotation platform. We then deploy this interface on Amazon Mechanical Turk and discover expert and talented workers who are capable of annotating difficult videos with dense and closely cropped labels. We argue that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols. We show that traditional crowdsourced micro-tasks are not suitable for video annotation and instead demonstrate that deploying time-consuming macro-tasks on MTurk is effective. Finally, we show that by extracting pixel-based features from manually labeled key frames, we are able to leverage more sophisticated interpolation strategies to maximize performance given a fixed budget. We validate the power of our framework on difficult, real-world data sets and we demonstrate an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling. We further introduce a novel, cost-based evaluation criteria that compares vision algorithms by the budget required to achieve an acceptable performance. We hope our findings will spur innovation in the creation of massive labeled video data sets and enable novel data-driven computer vision applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. The software and data sets can be downloaded from our website at http://mit.edu/vondrick/vatic.

References

  • Agarwala, A., Hertzmann, A., Salesin, D., & Seitz, S. (2004). Keyframe-based tracking for rotoscoping and animation. ACM Transactions on Graphics, ACM, 23, 584–591.

    Article  Google Scholar 

  • Ali, K., Hasler, D., & Fleuret, F. (2011). Flowboost–appearance learning from sparsely annotated video. In IEEE computer vision and pattern recognition.

    Google Scholar 

  • Anonymous (2012). http://www.visint.org/.

  • Aydemir, A., Henell, D., Jensfelt, P., & Shilkrot, R. (2012). Kinect@ home: crowdsourcing a large 3d dataset of real environments. In 2012 AAAI spring symposium series.

    Google Scholar 

  • Bailey, B., & Konstan, J. (2006). On the need for attention-aware systems: measuring effects of interruption on task performance, error rate, and affective state. Computers in Human Behavior, 22(4), 685–708.

    Article  Google Scholar 

  • Bellman, R. (1956). Dynamic programming and Lagrange multipliers. Proceedings of the National Academy of Sciences of the United States of America, 42(10), 767.

    Article  MathSciNet  MATH  Google Scholar 

  • Buchanan, A., & Fitzgibbon, A. (2006). Interactive feature tracking using kd trees and dynamic programming. In CVPR 06, Citeseer (Vol. 1, pp. 626–633).

    Google Scholar 

  • Chen, J., Zou, W., & Ng, A. (2011). Personal communication.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

    Google Scholar 

  • Demiröz, B., Salah, A., & Akarun, L. (2012). Çevresel zeka uygulamalari için tüm yönlü kamera kullanimi multi-omnidirectional cameras for ambient intelligence.

  • Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Fei-Fei, L. (2009). ImageNet: a large-scale hierarchical image database. In Proc. CVPR (pp. 710–719).

    Google Scholar 

  • Endres, I., Farhadi, A., Hoiem, D., & Forsyth, D. (2010). The benefits and challenges of collecting richer object annotations. In CVPR workshop on advancing computer vision with humans in the loop. New York: IEEE Press.

    Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Fan, R., Chang, K., Hsieh, C., Wang, X., & Lin, C. (2008). LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.

    MATH  Google Scholar 

  • Felzenszwalb, P., & Huttenlocher, D. (2004). Distance transforms of sampled functions (Cornell Computing and Information Science Technical Report TR2004-1963).

  • Fisher, R. (2004). The pets04 surveillance ground-truth data sets. In Proc. 6th IEEE international workshop on performance evaluation of tracking and surveillance (pp. 1–5).

    Google Scholar 

  • Huber, D. (2011). Personal communication.

  • Kahle, B. (2010). http://www.archive.org/details/movies.

  • Kumar, N., Berg, A. C., Belhumeur, P. N., & Nayar, S. K. (2009). Attribute and simile classifiers for face verification. In ICCV.

    Google Scholar 

  • Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008 (pp. 1–8). New York: IEEE Press.

    Google Scholar 

  • Liu, C., Freeman, W., Adelson, E., & Weiss, Y. (2008). Human-assisted motion annotation. In IEEE conference on computer vision and pattern recognition, CVPR 2008 (pp. 1–8).

    Chapter  Google Scholar 

  • Liu, W., & Lazebnik, S. (2011). Personal communication.

  • Mark, G., Gonzalez, V., & Harris, J. (2005). No task left behind? Examining the nature of fragmented work. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 321–330). New York: ACM Press.

    Google Scholar 

  • Mihalcik, D., & Doermann, D. (2003). The design and implementation of ViPER (Technical report).

  • Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38.

    Article  MathSciNet  MATH  Google Scholar 

  • Oh, S. (2011). Personal communication.

  • Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C. C., Lee, J. T., Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., Roy-Chowdhury, A., & Desai, M. (2011). A large-scale benchmark dataset for event recognition in surveillance video. In CVPR.

    Google Scholar 

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.

    Article  MATH  Google Scholar 

  • Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR.

    Google Scholar 

  • Ramanan, D., Baker, S., & Kakade, S. (2007). Leveraging archival video for building face datasets. In IEEE 11th international conference on Computer vision, 2007. ICCV 2007 (pp. 1–8). New York: IEEE Press.

    Google Scholar 

  • Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers? Shifting demographics in mechanical Turk. In Alt.CHI session of CHI 2010 extended abstracts on human factors in computing systems.

    Google Scholar 

  • Russell, B., Torralba, A., Murphy, K., & Freeman, W. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1), 157–173.

    Article  Google Scholar 

  • Schwartz, B. (2005). The paradox of choice: why more is less. New York: Harper Perennial.

    Google Scholar 

  • Smeaton, A., Over, P., & Kraaij, W. (2006). Evaluation campaigns and trecvid. In Proceedings of the 8th ACM international workshop on multimedia information retrieval (pp. 321–330). New York: ACM Press.

    Google Scholar 

  • Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. Urbana, 51, 61, 820.

    Google Scholar 

  • Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large dataset for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.

    Article  Google Scholar 

  • Torralba, A., Russell, B., & Yuen, J. (2010). Labelme: online image annotation and applications. Proceedings of the IEEE, 98(8), 1467–1484.

    Article  Google Scholar 

  • Vijayanarasimhan, S., & Grauman, K. (2009). What’s it going to cost you? Predicting effort vs. informativeness for multi-label image annotations. In CVPR.

    Google Scholar 

  • Vijayanarasimhan, S., Jain, P., & Grauman, K. (2010). Far-sighted active learning on a budget for image and video recognition. In CVPR.

    Google Scholar 

  • Vittayakorn, S., & Hays, J. (2011). Quality assessment for crowdsourced object annotations. In J. Hoey, S. McKenna, & E. Trucco (Eds.), Proceedings of the British machine vision conference (pp. 109–110).

    Google Scholar 

  • Von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 319–326). New York: ACM Press.

    Google Scholar 

  • Von Ahn, L., Liu, R., & Blum, M. (2006). Peekaboom: a game for locating objects in images. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 55–64). New York: ACM Press.

    Chapter  Google Scholar 

  • Vondrick, C., & Ramanan, D. (2011). Video annotation and tracking with active learning. In NIPS.

    Google Scholar 

  • Vondrick, C., Ramanan, D., & Patterson, D. (2010). Efficiently scaling up video annotation with crowdsourced marketplaces. In Computer Vision–ECCV 2010 (pp. 610–623).

    Chapter  Google Scholar 

  • Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In Neural information processing systems conference (NIPS) (Vol. 6, p. 8).

    Google Scholar 

  • Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: large-scale scene recognition from abbey to zoo. In CVPR.

    Google Scholar 

  • Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: a survey. In Acm computing surveys (CSUR).

    Google Scholar 

  • Yuen, J., Russell, B., Liu, C., & Torralba, A. (2009). LabelMe video: building a video database with human annotations. In International conference of computer vision.

    Google Scholar 

Download references

Acknowledgements

We thank Sangmin Oh, Allie Janoch, Sergey Karayev, Kate Saenko, Jenny Yuen, Antonio Torralba, Justin Chen, Will Zou, Barış Evrim Demiröz, Marco Antonio Valenzuela Escárcega, Alper Aydemir, David Owens, Hamed Pirsiavash, our user study participants, and the thousands of annotators for testing our software and offering invaluable insight throughout this study. Funding for this research was provided by NSF grants 0954083 and 0812428, ONR-MURI Grant N00014-10-1-0933, DARPA Contract No. HR0011-08-C-0135, an NSF Graduate Research Fellowship, support from Intel, and an Amazon AWS grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carl Vondrick.

Additional information

A preliminary version of this work appeared in ECCV 2010 by Vondrick et al.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vondrick, C., Patterson, D. & Ramanan, D. Efficiently Scaling up Crowdsourced Video Annotation. Int J Comput Vis 101, 184–204 (2013). https://doi.org/10.1007/s11263-012-0564-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-012-0564-1

Keywords

Navigation