University of Southampton

University of Southampton
web application assessment
Image by Josie Fraser
Annotating Multimedia for Community Folksonomy and Ontology Building


To make multimedia resources easier to access, search, manage, and exploit for all students and teachers by supporting the creation of annotations, metadata and schema (ontologies) based on social interactions and freeform tagging.

Problem Area

Many exciting opportunities for learning will occur in ‘Web 2.0’and ‘social software’ environments when information is being communicated through speech and video rather than just text. While multimedia has become technically easier to create and offers many potential benefits for learning and teaching, it also can be difficult to access, search, manage, and exploit. The growing amount of informal knowledge available in speech and audio-visual form rather than text has therefore yet to achieve the level of interconnection and manipulation achieved for textual documents via the World Wide Web. This makes it particularly difficult for systems to reason about multimedia content. Deaf students can also be greatly disadvantaged by the lack of captioning (subtitling) of multimedia with the very high costs of manual captioning often cited as the reason for non-compliance with web accessibility guidelines and national legislation1.

Innovative Application

The provision of multimedia consisting of text captions synchronised with recorded speech and images/video enables all their communication qualities and strengths to be available as appropriate for different contexts, content, tasks, learning styles and preferences. Text can reduce the memory demands of spoken language; speech can better express subtle emotions; while images can communicate moods, relationships and complex information holistically. Since there is little evidence that students’ preferred learning media can be predicted reliably through learning style instruments2, the availability of text captions and spoken output of text would enhance students’ choice of media.

Speech Recognition (SR) can provide a cost-effective way of automatically creating text captions synchronised with audio and video allowing audio visual material to be manipulated through searching, tagging, annotating, bookmarking and hyperlinking. Students and teachers can create tags, folksonomies, metadata, taxonomies and ontologies to support the structuring of elements of the multimedia. Links can be made to sections of the original multimedia instead of creating copies for reuse as learning objects or as appropriate, to provide evidence for assessment and E-portfolios.

The next generation of Web applications emphasise social interaction and user participation. This can take the form of social networks, collaborative content creation, and freeform metadata creation, in the form of tagging3. Recently we have also seen the emergence of Semantic Wikis which allow Wiki pages and links to be typed4. Such typing and tagging is currently used to aid search, and to support recommendations, but it is necessarily unconstrained – this makes it a lightweight activity, but limits the usefulness of the tags as the relationship between tags is not understood (for example, synonyms are not modelled).

Schemas or ontologies have a relatively high cost but allow for more powerful manipulation of resources than those tagged in a freeform way5. It is possible to examine the use of free tags and typing and extrapolate such a schema, called a Folksonomy, which reflects the evolving view of the community rather than the perspective of a particular design team6. This has the advantage that it is still lightweight, but also affords the kind of relationships necessary for advanced search and personalisation. The Folksonomy approach allows a community to reflect on their activities and develop tag richness, increasing the utility and reusability of resources. Folksonomy construction joins the advantage of the Web 2.0 model with the utility of more traditional approaches. It is an important consideration for the e-Framework, which must strike a balance between freeform tagging and structured annotation, and supports the agile and evolutionary development of information models.

An overview of the system is shown in figure 1.


The Learning Societies Lab research group in Southampton has extensive experience in information modelling, the social web, and mobile and ubiquitous computing. The members of the group have been involved in Hypertext, Web and Knowledge research for over fifteen years, and are internationally recognised for their application of these technologies to the domain of e-learning. In addition the group has advised HEFCE, BECTA, and JISC on accessibility, disability and technology issues and has worked with IBM and the International Liberated Learning (LL) consortium on researching and implementing the use of Speech Recognition (SR) engines to automatically create synchronised captions from live or recorded audio and video. A prototype application is currently under development to enable text captions synchronized with audio and video to be annotated by students and teachers.

Multimedia Presentation

A multimedia presentation (audio playback requires Internet Explorer) using text captions synchronized with audio and Powerpoint slides and demonstrating some of the ideas can be found at:

The text is highlighted automatically in time with the audio and selecting ‘Where am I? ’ will ensure the text also scrolls automatically with the audio
You can use the audio controls to move backwards or forwards through the presentation or to pause it
You can use the ‘find’ facility in your browser to search for text and play from that position
Clicking on a Powerpoint slide image inline with the text will open it full size in a separate window
Clicking on a Powerpoint slide thumbnail image will move backwards or forwards through the presentation to that position
You can resize the frames (and so enlarge the thumbnails)