Simultaneous interpretation – demystify Sogou simultaneous interpretation

Those who pay attention to international news may have noticed that every time the United Nations General Assembly, many diplomats will wear headphones to listen carefully to the speeches of diplomats from other countries, where the application is simultaneous interpretation technology, which allows listeners to quickly understand the language of different countries. In the past, simultaneous interpretation basically relied on people to translate quickly, but with the development of intelligent technology, intelligent machine translation began to be applied to the field of simultaneous interpretation more and more.

Simultaneous interpretation, not a simple machine translation

Speaking of machine translation, I believe that many friends are using online translation provided by Baidu, Youdao and other companies. We just need to open the online translation page, then enter the phrase that needs to be translated, and select the output translation language to quickly complete the translation. If we then complete the input of the phrase to be translated with the help of voice input, this is a simple simultaneous interpretation model (Figure 1).


Figure 1 Simultaneous interpretation model

However, online machine translation has been criticized for its low accuracy, mechanical nature, and semantic ambiguity, etc. To realize simultaneous interpretation, one is to require “simultaneous voice”, which means that the speaker’s voice can be recognized simultaneously to achieve the shortest possible delay; the other is “interpretation”, which requires the translation to be as accurate as possible. The second is “interpretation”, which requires the translation to be as accurate as possible. With the development of artificial intelligence and deep learning technology, simultaneous interpretation technology gradually realizes the above two requirements, for example, Sogou’s “Sogou Simultaneous Interpretation” technology can well realize “simultaneous” and “interpretation” (Figure 2). “(Figure 2).


Figure 2 “Sogou Simultaneous Interpretation” technology demonstration

Simultaneous voice + interpretation, the secret behind Sogou’s simultaneous interpretation

As mentioned above, simultaneous interpretation technology is not simply a combination of voice input and machine translation. Simultaneous interpretation is all about “simultaneous voice” + “interpretation”, so how does Sogou’s simultaneous interpretation technology achieve these requirements?

First of all, the same voice, as can be seen in the demonstration of Wang Xiaochuan’s speech, after the speaker completes a sentence of speech, the large screen behind almost at the same time to complete the voice and text conversion. This seems to be a simple voice to text conversion, but it is the actual embodiment of Sogou’s powerful voice recognition technology.

To make real-time and efficient recognition of the utterance uttered by a certain person, Sogou Tongguo first needs to realize accurate voice breaking, that is, to judge each sentence spoken by the user so that it can prepare to recognize the user’s real expression intention. Because the user’s speech is coherent, if accurate speech break cannot be achieved, then it is easy to have deviation of recognition. As a simple example, a sentence like “Xiao Wang beat Xiao Li and won the championship” can have completely different meanings with different pauses in the user’s expression, such as “A: Xiao Wang beat, Xiao Li won the championship” and “B: Xiao Wang defeated Xiao Li and won the championship”.

In order to improve the ability of voice breaking, Sogou’s homophone algorithm makes speech and mute judgments on speech signals by means of energy detection and deep learning model, which can skip the processing of mute fragments to improve the decoding efficiency, while the speech fragments can be divided into multiple sentences for parallel recognition, greatly improving the efficiency of speech recognition. With the help of deep learning model, Sogou Simultaneous can then accurately recognize speech breaks, such as the above example, if the preceding text expresses the strength of Xiao Wang, so that through the contextual relationship Sogou Simultaneous will use the phrase A to understand the user input (Figure 3).


Figure 3 Speech phrase illustration

In the speech recognition part, Sogou Tongguo uses the acoustic model combined with CLDNN+CTC and RNNLM language model to transform the fragment after speech interruption into text through the acoustic model and language model. In this way, with the help of “speech break algorithm + acoustic model + RNNLM language model”, Sogou Tongguo can accurately recognize the user’s speech, thus realizing efficient “simultaneous” input recognition (Figure 4).


Figure 4 Simultaneous input illustration

Next is “translation”, which requires simultaneous translation after completing the recognition of user’s voice. To complete the accurate translation of text, the key is text phrase breaking. Sogou’s text phrase module removes meaningless words with the help of content smoothing technology to make sentences smooth, and then divides and punctuates sentences by two methods: rules and models. The two-way GRU technique is also used here to build the structure of the encoding side. Through the Attention mechanism, the text at the source and target ends is aligned and the sentence-level vector representation of the current moment is generated and sent to the decoding end, which decodes and outputs the translation result word by word. This results in a smoothly translated utterance that allows the listener to understand the meaning of the speaker in other languages (Figure 5).


Figure 6 Illustration of simultaneous interpretation

Simultaneous interpretation, making our communication more convenient

With the strengthening of opening up to the outside world, whether it is foreign trade (such as collaboration with foreign partners) or foreign communication of ordinary users (such as skype communication with foreign netizens). We all need to communicate with users of different languages, but due to the language barrier, it makes these communications extremely difficult.

However, with the development of technologies like Sogou’s simultaneous interpretation, we can communicate with foreign users and colleagues without barriers, which greatly improves the efficiency of our communication. For example, for companies with branches abroad, employees from different countries can browse and understand PPTs made in one language without barriers through the projector display in the conference room, and for Internet users who wish to learn other languages, remote online classes with the help of simultaneous interpretation, even if they are in China, they will not be unable to understand the presentation of foreign teachers due to the limitation of their native language, which greatly improves the efficiency of online learning.

Seeing the potential of simultaneous interpretation, major IT giants are now developing their own simultaneous interpretation technologies, such as Google, which is developing neural network machine translation technology and using instant simultaneous interpretation subtitles for YouTube videos (Figure 7).


Figure 7 Google neural network machine translation

With the development of AI technology, these simultaneous interpretation technologies will definitely bring more convenience to our communication with the world (Figure 8).


Figure 8 Tencent simultaneous interpretation

Leave a Comment