Product introduction:

Tencent Cloud Text To Speech (TTS) is an AI voice solution created by Tencent to solve the pain points of "low voice content production efficiency, stiff human-computer interaction experience, and high cost of customized voice." It positions itself as "the core engine of enterprise voice interaction and content production." The product relies on Tencent's self-developed Text To Speech model (integrating emotional modeling and prosodic optimization technology) to break through the limitations of traditional TTS "mechanical sound" and achieve "anthropomorphic, natural and smooth" speech output; At the same time, a three-level service system of "basic synthesis-sound reproduction-timbre transformation" is built to meet the general voice needs of ordinary users (such as notification broadcasts), and also to meet the in-depth customization needs of enterprises (such as Homo sapiens exclusive timbre). Rapid integration into various terminals through API/SDK, forming a closed loop of "text input-voice output-scene implementation". It is a key tool for realizing voice interaction in fields such as intelligent customer service, audio content, and intelligent hardware.

Core functions:

1. Three core service modules

  1. Basic Text To Speech

    • Multi-dimensional timbre selection : Provide 100+ high-quality timbre, covering "general scenarios" and "industry scenarios":
      • Universal tone: Including standard male voice, female voice, and children's voice (such as "Yun Xiaoning" and "Yun Xiaoduo"), suitable for basic scenarios such as notification broadcasts and smart assistants;
      • Stylized timbre: Support emotional voice (cordial, serious, lively) and character voice (anime characters, dialect characteristics). For example, customer service scenes use "cordial female voice" to enhance users 'goodwill, and audio novels use "character voice" to enhance the sense of story substitution;
      • Multi-lingual/dialect support: Covering Chinese (Mandarin), English, Cantonese, Sichuan dialect, Northeast dialect, etc., adapting to the needs of users in multiple regions (such as Cantonese broadcasts for Guangdong users);
    • Flexible parameter configuration : Support customization of voice attributes through SSML Synthesis Markup Language, including volume (0-100dB), speed (0.5- 3x speed), and tone (0.5- 2x). At the same time, pause and accent can be set to make the voice more suitable for the scene (such as using a smooth speed for news broadcasts, and using a slightly faster speed for promotional notices to enhance the sense of urgency);
    • Multi-format output : Supports the generation of offline audio files (MP3, WAV formats) and real-time audio streams (PCM, AAC formats), and adapts to different scenarios such as telephone customer service (real-time streams), audiobooks (offline files), and smart hardware (low-bitrate audio).
  2. Sound reproduction (customized tone)

    • Low-threshold customization process : Just provide real-life voice material for a specified duration (such as 10 minutes of pure recording), learn the tone characteristics through the AI model, and generate exclusive customized tone in 3-7 days, avoiding the problem of traditional customization requiring a large amount of material and high cost;
    • High restoration and security : The reproduced timbre is highly similar to real people, supports emotional transfer (for example, reproduced timbre can output different emotional voices), and also provides a copyright protection mechanism. The customized timbre is only authorized to be used by enterprises to prevent abuse;
    • Core application scenarios : Digital Homo sapiens exclusive voice (such as corporate virtual spokesperson voice), brand IP voice (such as anime IP character voice), personalized assistant (such as proprietary voice of smart hardware).
  3. Real-time tone transformation

    • Real-time voice processing : Supports the transformation of real-time input voice (such as real people speaking) into the target tone, with a delay as low as 100ms, and adapts to real-time interactive scenarios (such as transforming agent voice into a unified brand tone during customer service calls. Improve brand consistency);
    • Multi-scene adaptation : Can be applied to intelligent customer service (unified customer service voice style), live interaction (anchor voice changes), and game scenarios (real-time character voice changes). There is no need to generate audio files in advance and flexibly respond to real-time needs.

2. Technical advantages and experience guarantee

  • High pseudo-reality : Using deep learning models to optimize speech prosody and emotional expression, synthetic speech is industry-leading in naturalness, avoids "mechanical sound" problems, and has high user acceptance;
  • High stability : Support high concurrent calls (peak QPS can reach 100,000 +), service availability reaches 99.9%, suitable for high-frequency scenarios such as large-scale event notifications and e-commerce promotion customer service;
  • Low access cost : Provide detailed API documentation, SDK sample code and debugging tools (such as online Text To Speech Demo). Developers can complete the connection in 10 minutes without having to deeply understand the technical details of Text To Speech.

Typical application scenarios:

scene type core application Realize logic and value
Intelligent customer service and assistant Telephone customer service IVR, smart assistant voice response The customer service system is connected to TTS to convert text speech (such as "Your order has been shipped") into voice, achieving 7×24-hour automatic response and reducing manual agent costs by 30%+
Audio content production Audio novels, paid audio for knowledge, children's stories Authors or platforms convert text content (such as novel chapters and course scripts) into voice in batches to generate audio products. The production efficiency is 10 times higher than that of live recording and the cost is lower
Voice broadcast notification Government notices, corporate announcements, traffic announcements Government platforms/enterprises convert notification texts (such as vaccination reminders, meeting notices) into voice and push them through phone and APP to cover users of different ages (especially elderly users)
intelligent hardware Smart slightly, car voice, smart home Hardware devices convert text commands (such as "weather reminders" and "device status") into voice output through TTS, achieving screenless interaction and improving convenience of use
Homo sapiens interaction Virtual anchors, corporate numbers of humans Configure customized replica timbre with Homo sapiens, convert Homo sapiens dialogue text into speech in real time, and match lip synchronization to enhance the anthropomorphic interactive experience (such as live broadcast of Homo sapiens product information)

Applicable groups and industries:

  • Enterprise IT and Development Team : Technical teams in industries such as finance (bank customer service), e-commerce (promotion notice), government affairs (people's livelihood broadcast), and education (audio courses) need to integrate Text To Speech capabilities for business systems;
  • Content creators/organizations : Self-media people, audiobook platforms, and knowledge-paying organizations need to mass produce audio content at low cost to avoid the time and cost limitations of real-life recording;
  • Smart hardware manufacturers : Smart slightly, in-vehicle equipment, and smart home enterprises need to provide voice output functions for hardware products to achieve screenless interaction;
  • Small, medium and micro enterprises and individual developers : Small and micro enterprises without technical teams (such as individual merchants for order notification) and individual developers (such as developing children's story apps) realize voice functions through simple API access and reduce development threshold.

Unique advantages:

  1. "Universal + Custom" full-scene coverage : From basic free sound to highly customized replica sound, it meets different budget and scene needs, and is different from competing products that only provide universal sound;
  2. Tencent ecosystem deep linkage : It can be seamlessly connected with Tencent Cloud intelligent customer service, Digital Homo sapiens (IVH), Speech Recognition (ASR) and other products to form a complete human-computer interaction link of "Speech Recognition-Semantic Understanding-Text To Speech", reducing cross-platform docking costs;
  3. Compliance and security : All timbers have passed copyright review, customized timbers provide exclusive authorization, and comply with domestic data security regulations (such as local processing of voice data) to avoid legal risks;
  4. High cost performance : There is sufficient free quota for basic sound, and the unit price of the paid resource package is low. Small, medium and micro enterprises and individuals can use it at low cost. Resource packages can also reduce costs in high-concurrency scenarios for large enterprises.
Disclaimer: Tool information is based on public sources for reference only. Use of third-party tools is at your own risk. See full disclaimer for details.
所属分类