Foundational models are the answer for TV-specific AI-powered applications, says Bitmovin
An important debate in the TV tech market is whether vendors should rely on foundational models to build and improve applications, or train their own.
Jacob Arends, senior product manager, AI scene analysis & playback at Bitmovin, believes foundational models will be the best long-term bet.
His company is harnessing AI to improve applications ranging from video scene analysis to streaming quality assurance.
He says foundational models are future-proofed because, even though they are not developed with a specific application in mind, they are improving so fast that their generic intelligence will outpace legacy domain-specific models that were trained for a narrow purpose and initially achieved superior performance.
“Foundational models are being developed at such speed that you could not fathom one year ago what they can do today,” Arends (pictured) declares.
“Even a couple of years ago, it was still hard to bring video, audio and text together in multi-modal models.”
He argues that ‘train your own’ domain-specific models demand more time, research and expertise to develop. They also come with longer-term maintenance and talent costs.
Content metadata enrichment – which helps improve content search and recommendation and contextual ad insertion – is one application that exemplifies the choice between using foundational models like Claude (from Anthropic) or Amazon Bedrock and vendor-developed domain-specific models, the Bitmovin exec says.
He acknowledges the performance of the latter but reckons that using foundational models – with a choice of models – will yield more detailed outputs.
“The challenge is to balance speed and adaptability against the investment, time and research expertise required to build something unique,” he explains.
NAB AI showcase
Arends will be taking his message to NAB Show in Las Vegas next month, where Bitmovin is showcasing its recent AI-powered innovations.
These include scene analysis, which creates metadata that, in turn, can be used for recommendations or to surface key moments in a video.
Bitmovin provides the Bitmovin Player to decode and present streaming video on devices. In a recent innovation, consumers can navigate a programme or movie using voice commands such as ‘Replay the car chase scene’ or ‘Jump to the interview’.
Bitmovin’s AI scene analysis can be leveraged by a variety of AI agents to support different applications, one of which is the ‘summarisation companion’.
This feature creates personalised catch-up summaries of a show when a viewer requests them, such as “What did I miss?”
Bitmovin is also using NAB to highlight its highlight clipping agent. This can take horizontal long-form video (as typical for TV) and convert it into short-form vertical video used in social media or video-sharing sites like TikTok.
Back-office teams for streaming services can use natural language prompts to automatically generate short-form clips at scale. The tool detects highlights, then reframes the video for vertical screens while preserving key actions with subject-aware cropping.
Another use for Bitmovin’s AI scene analysis is context-aware ad placement.
Especially useful where content was created for ad-free environments but is now being monetised with ads, solutions like this avoid key moments when inserting ads.
Metadata can also be used to align scenes with IAB taxonomies, leading to more ads that are relevant to the content shown.
Bitmovin will showcase its AI observability agent and MCP (Model Context Protocol) server in Las Vegas. This brings AI assistance to video stream monitoring.
Billions of data points are analysed and users can interrogate the data using natural language.
Bitmovin also has AI-powered video streaming testing infrastructure for the Bitmovin Player. This supports mass-scale testing of the video streaming performance and user experience across 180+ device types like set-top boxes, smart TVs and mobile phones.
AI analysis is used to identify the causes of on-device playback issues faster and with more accuracy.
