AI-Powered Voice Acting Predictions by 2028: Obstacles and Progress Insights

A Brave New World for Video Dubbing

In this connected digital universe, video dubbing transcends beyond mere translation—it's about universal understanding, inclusivity, and stories that resonate across dialects. At Kapwing, we've been submerged in the world of AI dubbing for the past year, and here's where we stand today, the roadblocks yet unconquered, and our predictions for AI dubbing by 2028.

The AI Dubbing Landscape in 2025

A few years back, high-quality dubbing demanded expensive voice actors, elaborate production processes, and a time-consuming wait. Today, thanks to rapid advancements in large language models (LLMs), text-to-speech synthesis, and voice cloning, creators can effortlessly dub videos in multiple languages with near-human voices, all within minutes, and at a fraction of the cost.

The AI Dubbing Process

AI dubbing is backed by intelligent automation. Inside our AI Dubbing tool at Kapwing, we've broken down the process into five steps:

Transcription: Converting the video's audio into text with time-stamps.
Translation: Using an LLM to accurately translate the text while preserving the tone and timing.
Text-to-Speech: Producing synthetic speech in the target language through voice cloning.
Playback Adjustments: Adjusting video and audio timing for natural pacing.
Collaboration: Allowing teams to refine translations, tweak timing, and perfect the speech before export.

With Kapwing's editor, collaboration becomes effortless. With native speakers, stakeholders, and reviewers on board, refining the dubbed video feels like a team effort before it hits publish.

How accurate is AI dubbing today?

We scrutinize our AI dubbing with human evaluators every fortnight across 13 languages. Our current quality score stands at 7.2 out of 10, with plenty of room for improvement. However, we're hopeful that we'll achieve an 85%+ accuracy rate with the first generation by year-end.

Advancements that have steered this improvement include:

Engineered prompts for improved translations
Real-time adjustment of speech pace
Enhanced multi-speaker recognition
Smart playback speed changes to make voiceovers feel natural

These advancements are sparked and escalated by the ballooning demand for video localization tools. They've found favor in educational, marketing, and news sectors where global reach is crucial.

The Obstacles Yet Unvanquished

Despite our progress, AI dubbing still grapples with four substantial challenges:

Timing Mismatches: Differences in phrase length and cadence between languages. We remediate this by stretching video segments and fine-tuning translations.
Emotional Expression: Current text-to-speech voices are devoid of emotional depth. We're delving into multiple voice clones, voice-to-voice modeling, and fine-grained inflection fine-tuning.
Translation Precision: Idioms, slang, and proper nouns can lead to misinterpretations. We're developing tools like glossaries and inline editors to help creators address these issues swiftly.
Speaker Identification: Assigning the right voice to each speaker when the audio changes tone or is interrupted is challenging. We offer manual overrides and continually improve our detection models.

We're actively tackling these hurdles through continuous research and product refinement. Although the technology still has a way to go, it's exhilarating to witness the growing possibilities as LLMs and related technologies evolve.

Cost Trends: The Now and The Future

AI dubbing costs today run from zero to $100 per finished hour, depending on the language, quality, and control. The most expensive part of the process is the text-to-speech generation, which can demand multiple voice clones. That's already a fraction of traditional dubbing costs, which can run into thousands per hour.

By 2028, we project costs to drop another 50-70% due to:

More efficient models
Scalable infrastructure
Commoditized voice libraries

Lower costs will welcome not only high-volume users like large media companies but also educators and creators with limited budgets who've been priced out of professional dubbing services.

2028: The Future for AI Dubbing

As innovation in this field accelerates, the next few years hold nothing but radical transformation:

Universal Dubbing: Like closed captions are today, dubbed audio will be standard.
Tailored Vocal Signatures: Users will create personalized voice clones that emulate accent, tone, and age.
Live Dubbing: Real-time streams will support multilingual dubbing with minimal latency.
Universal Review Tools: Editors will automatically flag potential errors in translation, tone, and synchronization.
Cultural Editing Layers: AI will adapt jokes, references, and tone for each market with ease.

We believe that AI dubbing will not just become more accessible but more culturally intelligent. Join us in shaping a future where every story can be heard in every language with Kapwing's AI Dubbing tool.

Reference:

[1] Statista. (2021). The global video market: Key figures and trends. [Online]. Available: https://www.statista.com/topics/1218/global-video-market/

[2] IDC. (2020). Worldwide spending on artificial intelligence and augmented reality to reach $100 billion in 2020. [Online]. Available: https://www.idc.com/getdoc.jsp?containerId=prUS46137220

[3] PwC. (2018). Artificial intelligence: The Ethical Implications. [Online]. Available: https://www.pwc.com/sg/en/services/consulting/technology/data-analytics/artificial-intelligence/artificial-intelligence-the-ethical-implications.html

[4] McKinsey & Company. (2017). Jobs lost, jobs gained: What the future of work will mean for jobs, skills, and wages. [Online]. Available: https://www.mckinsey.com/featured-insights/future-of-work/jobs-lost-jobs-gained-what-the-future-of-work-will-mean-for-jobs-skills-and-wages

By 2025, advances in large language models, text-to-speech synthesis, and voice cloning will enable creators to effortlessly dub videos in multiple languages, revolutionizing the entertainment industry and making it possible for stories to resonate across dialects, bridging the gap between different lifestyles and cultures through artificial-intelligence-powered dubbing.
As we move towards 2028, artificial intelligence dubbing is expected to become a standard feature, making it easier for users to create personalized, culturally-adjusted dubbed audio in various languages, thereby enhancing the accessibility and inclusivity of entertainment, ultimately democratizing entertainment content across the globe.