When AI meets sign language

Opinion
Even as streaming gathers pace, accessibility remains an issue. However, AI has the ability to bridge this gap to make streaming available to more people.
Streaming has revolutionised how people watch TV — yet accessibility remains an issue. According to research by Scope, 20% of disabled people have cancelled a streaming subscription because of accessibility issues.
People who are deaf or hard of hearing face unique barriers in accessing streamed content. Captions are not always available, and when they are, they are often incomplete or of poor quality, thus impacting the viewing experience.
Also, some individuals with hearing loss, particularly those who communicate primarily through sign language, may struggle to process written text at the speed it is displayed on screen.
While captions and subtitles may be helpful for some, they lack the depth and richness of speech. Sign language, on the other hand, is more expressive and can convey the nuances of speech, including tone and emotion, that are easily missed in subtitles. Yet sign-language interpreters are rarely seen in streamed content, largely because of the associated cost and technical difficulties.
However, recent developments have shown that AI technology could bridge the accessibility gap for deaf and hard-of-hearing individuals.
Plugging the accessibility gap
AI is already being used to transcribe speech to written text to generate automatic captions in real time. But what is perhaps even more interesting is that it can also be leveraged to incorporate signing avatars into video streams.
This approach, under development by Bitmovin, integrates AI- and machine learning-driven natural language processing with 3D animation technologies to convert text representations of American Sign Language (ASL) poses into client-side avatars.
Under this concept, a text-based representation of sign-language poses is created in order to produce an additional subtitle track. The video player can then recognise and play this track using a customisable 3D avatar overlay, which signs the dialogue alongside the main video content. Initial experiments were done using the Hamburg Notation System, HamNoSys, which is used to represent sign language grammatically.
The animation of the avatar is driven by timing and content cues embedded within the sign-language subtitle track, so as long as a video player can access these cues, as it would for standard subtitles, it can be extended to support the overlay of a signing avatar.
Technical advantages
Much like adding a foreign-language subtitle, by treating sign language as its own subtitle track, it can be incorporated into existing workflows without requiring major changes to video players or delivery mechanisms.
Whether content is being streamed via DASH (Dynamic Adaptive Streaming over HTTP), HTTP live streaming or another common protocol, sign-language tracks can be delivered alongside the video using standard streaming formats. This makes it easier to bring sign-language support to a wide range of platforms and devices without major upheaval.
Another major advantage is that it eliminates the need for additional video channels or picture-in-picture (PiP) windows, helping to reduce complexity and cost. Instead of having to re-record and re-encode to update content with PiP, sign-language subtitle tracks can be edited and uploaded quickly.
Further considerations
While AI offers exciting possibilities for improving accessibility through automated sign-language generation, there are still important issues to be addressed.
These include the ethical considerations around who owns the training data used for the translation and the challenge of having enough translation datasets for all sign languages, as well as dialects within those languages.
Another key challenge lies in the use of “gloss”, a simplified form of transcription that provides a word-for-sign translation. Glossing lacks the grammatical depth of sign languages like ASL, which have their own syntax, structure and rules. As a result, gloss-based systems often produce translations that are literal and linguistically incomplete.
Similarly, while systems like HamNoSys are valuable tools for analysing sign language linguistically, they fall short when it comes to driving natural, fluid animation. For instance, HamNoSys does not currently support transitions between signs or overlapping gestures — both of which are essential for capturing the expressive flow of real-life signing.
Quality is another critical factor. Are avatars of high enough quality for the user, both in resolution and signing accuracy? For signing avatars to feel authentic and easy to follow, the rhythm and pacing of gestures must align closely with the spoken content.
This includes not just hand and arm movements, but also facial expressions, which play a vital role in conveying tone, emotion and grammatical cues in sign language. Media organisation NHK Group, for instance, has developed KiKi, a photorealistic signing digital avatar to help address these issues.
What’s next?
While this approach will not replace real-life signers for high-value live events, it may help to fill the gaps, enabling whole back catalogues to be signed that otherwise would be unfeasible. Bringing this kind of technology into video streaming opens up an entirely new way to deliver sign language.
As the approach evolves, there’s room to explore alternative representations beyond HamNoSys that might better reflect the natural flow and grammar of sign language, as well as leading to more accurate subtitle generation.
To generate more accurate and meaningful signing, future systems will likely need to draw from a wider range of inputs, including audio and video metadata, to better capture context and intention.
Fortunately, this is well within the realm of feasibility for near-term development, so it will be incredibly exciting to watch how this space develops.
Stefan Lederer is CEO and co-founder of Bitmovin