Respeaking

Live subtitling (or “real-time captioning”, as it is known in the United States) is the real-time transcription of spoken words, sound effects, important musical cues, and other relevant audio information to enable deaf or hard-of-hearing persons to follow a live audiovisual programme. Commonly regarded (since it was introduced in the United States and Europe in the early 1980s) as one of the most challenging modalities within media accessibility, it can be produced through different methods: standard QWERTY keyboards, Velotype and the two most common approaches, namely stenography and respeaking.

At GALMA we are involved in research (and practice) on stenography and especially, on respeaking, which may be defined as:


a technique in which a respeaker listens to the original sound of a (live) programme or event and respeaks it, including punctuation marks and some specific features for the deaf and hard-of-hearing audience, to a speech recognition software, which turns the recognized utterances into subtitles displayed on the screen with the shortest possible delay
Romero-Fresco, 2011:1

In many ways, respeaking is to subtitling what interpreting is to translation, namely a leap from the written to the oral modality without the safety net of time. Although respeakers are normally encouraged to repeat the original soundtrack, and hence produce verbatim subtitles, the fast-paced delivery of speech in media content often makes this difficult. The challenges arising from high speech rates are compounded by other constraints. These include the need to incorporate punctuation marks through dictation while the respeaking of the original soundtrack is unfolding; and the expectation that respoken output will abide by standard viewers’ reading rates. Consequently, respeakers often end up paraphrasing, rather than repeating or shadowing, the original soundtrack. At GALMA we are conducting leading research on the quality of live subtitles with several governments, universities and companies around the world using our NER model. We are also delivering face-to-face and online training on intra- and interlingual respeaking and we have set up LiRICS (Live Reporting International Certification Standard), the first worldwide certification process for professional respeakers.


Quality in live subtitling: The NER model

The NER model is a tried and tested quality assessment tool to assess the accuracy of intralingual live subtitles. The NER model is currently used by regulators, broadcasters, and subtitling companies in Australia, Spain, the UK, Germany, Switzerland, Italy, France, and Belgium (Romero-Fresco, 2020).

The NER formula explained

Here is an illustration of the NER model (Romero-Fresco & Martínez, 2015):

The acronym ‘NER’ reflects the formula used in the model to calculate the accuracy rate. Below is an explanation of each of the model’s components:

  • N: the number of words in the respoken text;
  • E: the edition errors caused by strategies applied by the respeaker (deletions, substitutions, insertions);
  • R: the recognition errors usually caused by mispronunciations (on the respeakers part), or errors with the speech recognition software.

The need for human intervention is highlighted through two additional elements of the model:

  • CE: correct editions are any deviation from the text that does not lead to any loss of information, e.g. the use of synonyms, the deletion of repetitions and filler words.
  • Assessment: the assessor can comment on aspects that are not included in the NER formula, such as latency and ease of reading.

Error severity explained

The NER model accounts for minor, standard, and serious edition and recognition errors, and respoken texts are expected to reach an accuracy rate of 98% meaning they are suitable for live broadcast.

In terms of the error coding scheme, minor, standard, and serious edition and recognition errors are penalised at -0.25, -0.5 and -1, respectively.

  • Minor errors (-0.25) can be recognised by a viewer. Minor errors mean that the text can be followed, but the meaning or the flow of the text may sometimes be interrupted making it difficult to recognise original words.
  • Standard errors (-0.5) cannot be recognised by a viewer, as they cause confusion and/or loss of information. Standard errors do not result in new meanings but do omit ideas from the text.
  • Serious errors (-1) present factual mistakes or misleading information that create a new sense in the respoken text. Serious errors go unrecognised by a viewer as they can appear to be correct.

Related information and teaching resources

Below is a list of further information on the NER model and resources that can be used for self-guided or instructed teaching.

  • Advanced Intralingual Respeaking: Quality Assessment (ILSA project video)
    An ILSA project video that explains how to assess the quality of intralingual respoken subtitles. The speaker touches upon different methods of live subtitling, the development of quality assessment models, and automatic quality assessment before explaining the NER model and giving real-life examples of edition and recognition errors.
  • AppTek’s NER Infographic
    A NER infographic that explains the NER model and illustrates how it can be applied to edition and recognition errors.


References:

Eugeni, C. and G. Mack (2006) (eds.) Intralinea, Special Issue on New Technologies in Real Time Intralingual Subtitling. Available online: http://www.intralinea.org/specials/respeaking [last access 20 December 2017].

Romero-Fresco, P. (2011). Subtitling through speech recognition: Respeaking. Manchester: Routledge.

Romero-Fresco, P. (2016). Accessing communication: The quality of live subtitles in the UK. Language & Communication, 49, 56–69.

Romero-Fresco, P. (2020). Negotiating quality assessment in media accessibility: The case of live subtitling. Universal Access in the Information Society, 33, Special Issue on Quality of Media Accessibility Products and Services.

Romero-Fresco, P. and J. Martínez (2015). Accuracy rate in live subtitling: The NER model. In J. Díaz-Cintas and R. Baños (eds.) Audiovisual Translation in a Global Context: Mapping an Ever-changing Landscape. London: Palgrave Macmillan, pp. 28-50.

Universidade de VigoXunta de GaliciaMinisterio de EconomíaEuropean Union

Universidade de Vigo Facultade de Filoloxía e Tradución | Despacho Newton 5 | Campus de Vigo As Lagoas | Ctra. de Marcosende | 36310 Vigo (España)
Back to top
gtag('config', 'UA-122084657-2');