Voice cloning has become the new favorite of scammers and meme makers - Lauri Juvela returned to the university world, because synthetic speech needs to be discussed

Since the beginning of the year, Tiktok and Instagram have been full of humorous videos and songs made with artificial voices. Voice cloning is the latest addition to the artificial intelligence product family, the possibilities of which are revealed day by day.

For some reason that will surely puzzle future cultural researchers, voice cloning made its final breakthrough with the help of presidents playing video games.

The idea of \u200b\u200bhumorists is simple: Serious, genuine people grinding meme references and swear words while playing video games.

Another social media content at the beginning of the year that has received a lot of attention has been very special artificial intelligence remixes of hit songs. With sound cloning technology, it is possible to have any artist cover another artist’s songs.

A couple of weeks ago, a two-minute song appeared on the video application Tiktok, in which Drake performed with his compatriot pop star The Weeknd. The caption revealed that it was artificially synthesized artists.

The game videos and AI remixes of the presidents, however, are probably just the beginning. Various voice-producing artificial intelligence applications have now hit the ground running: There are entire articles on the Internet that present various Trump voice clones, and artificial intelligence applications that reprocess music can now be found for artists’ voices as well as compositions.

Speech produced by artificial intelligence is now seriously interesting.

20 seconds of talk is enough for cloning

Behind the catchy meme fun is a technological leap. Many may experience déjà vu from the sudden popularity of Dall-E artificial intelligence last summer. In artificial intelligences that produce images from text, the focus was largely on humor at first.

– It is important to understand that sound artificial intelligences are not samplers: They do not copy, paste, cut and paste the sound fed into them.

The voice of the speaker is fed to voice artificial intelligence for cloning. However, the model does not copy the recording, but calculates a probability calculation from it, as if analyzing how someone’s speech might sound in different texts. The synthetic \”speech\” of the models itself is generated based on the acoustic model from the synthesizer’s digital signal, i.e. there is no actual human voice in the end result.

Like artificial intelligence models that produce images and text, sound cloning models are also generative. This means that they know how to reproduce random variation from the data. Random variation is key when creating natural-sounding speech.

– If a person repeats the same sentence five times, it sounds a little different each time. This needs to be modeled if you want believable sound cloning, says Juvela.

Where previously sound cloning required long pieces of speech material, today a dozen sentences can be enough for the job. The most advanced models need only 20 seconds of human speech to create a reasonably believable voice clone.

The difference between a real person and a voice clone is still distinguishable, at least in some meme videos that use cheap voice cloning models. Finnish-language meme videos are not expected right away, as most of the voice cloning models available to everyone have been trained to speak the languages \u200b\u200bof the large masses, such as English.

In addition to meme videos, voice cloning can be used for anything, both good and bad, says Juvela.

A voice prosthesis could restore human speech

Voice cloning could help people in many different ways. Juvela raises, for example, diseases that cause people to lose their voice or the ability to speak.

– Voice prosthesis means cloning a person’s voice so that he can continue to use it with the help of a computer, then when he can no longer speak himself, Juvela paints.

One significant application of artificial intelligence voice would of course be a virtual assistant. A naturally conversing artificial intelligence assistant connected to a text-producing artificial intelligence model could interpret text on request and handle not-so-important conversations for a human.

– However, virtual assistants are already being developed by such large companies that it does not make sense for us at Aalto to focus on the issue.

In its own research, Juvela aims to improve the basic characteristics of sound models: to make them more energy efficient, to lower the latency of the models, i.e. the delay in the transfer of information, and to develop real-time functions.

Lauri Juvela in the acoustic studio. — According to Lauri Juvela, websites like GitHub have played a really important role in the development of voice artificial intelligence, through which artificial intelligence code is distributed freely for use and development by those interested.

Real-time functions would be one of the key uses of voice clones. Even now, voice cloning models with enough comprehensive resources can translate anyone’s voice to speak any language. When the accuracy of the models increases and the latency decreases, we can eventually reach simultaneous interpreting models.

This way, for example, in business meetings and peace negotiations, the parties representing different nations could hear each other through their ear buds in real time in their own language. For example, ElevenLabs, the industry leader mentioned earlier, mentions real-time interpretation as a long-term goal.

So what happens in real time can only be affected in the future, but history can be modified quite well with voice clones already.

From Fucking to Fricking

Last year, the movie *The Fall* was released in the USA, where the two protagonists get stuck on a very high radio mast. Getting stuck in the heights results in a couple of hours of excitement and extreme feelings of frustration for the main characters. They channeled these especially into the F curse word of the English language.

After the release of the film, the production company decided that they also wanted a PG13 version of the film for distribution. Cursing is not suitable for this age limit. How to hold back the obscenities that have already been unleashed?

With voice cloning. The production company hired artificial intelligence startup FlawlessAI for the task, which made voice clones of both actors, modeled their facial movements, and then, using a combination of voice clones and re-animation, changed the “Fuckings” to lighter “Frickings”.

Nothing but a pure PG13 version for stores. Editing history based on video or audio is quickly becoming very easy. It poses a dilemma for reality: What really happened?

– It may be that we soon have to assume that most of the content on the Internet is a scam or manipulation, says Juvela.

Certainly, authentic content may need to be equipped with a certificate in the future, perhaps requiring online bank credentials to publish the content, Juvela thinks semi-seriously.

The Pope’s flashy style was admired on the internet over the weekend, until it was found out that the image is the product of Midjourney artificial intelligence, on the orders of an American rakshadunar who played with artificial intelligence. A public outing and hundreds of corrections were required to bring even the perception of the Pope’s living choices to a level corresponding to reality.

Good and bad AI developers

Juvela sees the situation as an opportunity for an endless game of cat and mouse, where other artificial intelligence developers develop certificates and models to detect fakes, while the other side’s developers modify their models to avoid being \”caught\”.

These threatening images are one of the reasons why Juvela returned to the university world after the time spent in the private sector. Artificial intelligence and voice will affect our lives, and Juvela wants a platform to talk about these effects.

In the future, an innocent consumer may be called from a cloned number by a voice-cloned bank clerk who will ask to tell the identity code and key numbers in the context of an urgent banking matter. Already in 2019, a British CEO was allegedly defrauded of 220,000 euros when a voice clone of his boss advised him to make a money transfer to a Hungarian bank on the phone.

Inauthentic co-workers or authorities can be just the beginning: a fraudster equipped with good enough speech material and artificial intelligence can in the future contact his target with the voice of, for example, this sibling, parent or partner. The U.S. Trade Commission, which is responsible for consumer protection, recently published a warning about voice clone scams, in which calls are received from relatives asking for money to, for example, pay prison bail.

Although criminals are great innovators, according to Juvela, scientists are also alert to countermeasures. For example, the artificial intelligence assistant that comes with the phone could be people’s primary firewall against various scams

– Creating shields is a big research trend at the moment. However, there is a question mark as to how effective countermeasures will become when new types of attacks are invented so quickly, says Juvela.