By inverting many of these same techniques, we were able to extract an approximation of a speaker’s vocal tract during a segment of speech. Luckily scientists have techniques to estimate what someone - or some being such as a dinosaur - would sound like based on anatomical measurements of its vocal tract. The first step in differentiating speech produced by humans from speech generated by deepfakes is understanding how to acoustically model the vocal tract. Detecting deepfakesīy estimating the anatomy responsible for creating the observed speech, it’s possible to identify whether the audio was generated by a person or a computer. This process of creating a single deepfaked audio sample can be accomplished in a matter of seconds, potentially allowing attackers enough flexibility to use the deepfake voice in a conversation. The attacker selects a phrase for the deepfake to speak and then, using a modified text-to-speech algorithm, generates an audio sample that sounds like the victim saying the selected phrase. This audio is used to extract key information about the unique aspects of the victim’s voice. Depending on the exact techniques used, the computer might need to listen to as little as 10 to 20 seconds of audio. In contrast, audio deepfakes are created by first allowing a computer to listen to audio recordings of a targeted victim speaker. However, human anatomy fundamentally limits the acoustic behavior of these different phonemes, resulting in a relatively small range of correct sounds for each. By rearranging these structures, you alter the acoustical properties of your vocal tract, allowing you to create over 200 distinct sounds or phonemes. Humans vocalize by forcing air over the various structures of the vocal tract, including vocal folds, tongue, and lips. CSA-Printstock/DigitalVision Vectors/Getty Images Attackers only need 10 to 20 seconds of audio of the target person’s voice.
0 Comments
Leave a Reply. |