SvenSvensonov
PROFESSIONAL
- Joined
- Oct 15, 2014
- Messages
- 1,617
- Reaction score
- 207
- Country
- Location
The references, due to their formatting and excess use of hyperlinks and strange symbols tends to cause moderation issues, I've omitted them for now, but will try to include them after tweaking their formatting a bit.
In the wake of the NSA spying revelations (to those not familiar with them) the German BND decided to return to typewriters instead of computer based documents and communications. It won’t make a difference
Here's why
INTRODUCTION
This paper reports on recovering keystrokes typed on a keyboard from a sound recording of the user typing. Emanations produced by electronic devices have long been a topic of concern in the security and privacy communities. Both electromagnetic and optical emanations have been used as sources for attacks. For example, Kuhn was able to recover the display on CRT and LCD monitors using indirectly reflected optical emanations. Acoustic emanations are another source of data for attacks. Researchers have shown that acoustic emanations of matrix printers carry substantial information about the printed text. Some researchers suggest it may be possible to discover CPU operations from acoustic emanations In ground-breaking research, Asonov and Agrawal showed that it is possible to recover text from the acoustic emanations from typing on a keyboard.
Most emanations, including acoustic keyboard emanations, are not uniform across different instances, even when the same device model is used; and they are affected by the environment. Different users on a single keyboard or different keyboards (even of the same model) emit different sounds, making reliable recognition hard. Asonov and Agrawal achieved relatively high recognition rate (approximately 80%) when they trained neural networks with text-labeled sound samples of the same user typing on the same keyboard. Their attack is analogous to a known-plaintext attack on a cipher – the cryptanalyst has a sample of plaintext (the keys typed) and the corresponding cipher-text (the recording of acoustic emanations). This labeled training sample requirement suggests a limited attack, because the attacker needs to obtain training samples of significant length. Presumably these could be obtained from video surveillance or network sniffing. However, video surveillance in most cases should render the acoustic attack irrelevant, because even if passwords are masked on the screen, a video shot of the keyboard could directly reveal the keys being typed.
In this paper we argue that a labeled training sample requirement is unnecessary for an attacker. This implies keyboard emanation attacks are more serious than previous work suggests. The key insight in our work is that the typed text is often not random. When one types English text, the finite number of mostly used English words limits possible temporal combinations of keys, and English grammar limits word combinations. One can first cluster (using unsupervised methods) keystrokes into a number of acoustic classes based on their sound. Given sufficient (unlabeled) training samples, a most-likely mapping between these acoustic classes and actual typed characters can be established using the language constraints.
This task is not trivial. Challenges include:
1) How can one mathematically model language constraints and mechanically apply them?
2) In the first soundbased clustering step, how can one address the problem of different keys clustered in the same acoustic class and a single key clustered in multiple acoustic classes?
3) Can we improve the accuracy of the guesses by the algorithm to match the level achieved with labeled samples?
Our work answers these challenges, using a combination of machine learning and speech recognition techniques. We show how to build a keystroke recognizer that has better recognition rate than labeled sample recognizers in. We only use a sound recording of a user typing.
Our method can be viewed as a machine learning version of classic attacks to simple substitution ciphers. Assuming the ideal case in which a key produces exactly the same sound each time it is pressed, each keystroke could be easily given an acoustic class according to the sound. The acoustic class assignment would be a permutation of the key labels. This is exactly an instance of substitution cipher. Early cryptographers developed methods for cryptanalyzing substitution ciphers. Our attack can be viewed as an extension of these methods – but our problem is more difficult because the sound of a particular keystroke varies even when it is produced by the same typist.
We built a prototype that can bootstrap the recognizer from about 10 minutes of English text typing, using about 30 minutes of computation on a desktop computer with a Pentium IV 3.0G CPU and 1GB of memory. After the bootstrap step, it could recognize language-independent keystrokes in real time, including random keystrokes occurring in passwords, with an accuracy rate of about 90%. When language-dependent constraints are applied to English text, we achieve a 90-96% accuracy rate for characters and a 75-90% accuracy rate for words.
We posit that our framework also applies to other types of emanations with inherent statistical constraints, such as power consumption or electromagnetic radiation. One only need adapt the methods of extracting features and modeling constraints. Our work implies that emanation attacks are far more challenging, serious, and realistic than previously realized. Emanation attacks deserve greater attention in the computer security community.
OVERVIEW OF PREVIOUS ATTACKS
We briefly review two related previous research studies examining recovery of keystrokes, each using a different type of side channel.
To the best of our knowledge, Asonov and Agrawal were the first researchers to publish a concrete attack exploiting keyboard acoustic emanations. They note that the sound of keystrokes differ slightly from key to key. They give a concrete method to recover information about typing on keyboards, using neural networks as acoustic classifiers. Their approach is to first “teach” the neural networks about what the different keys sound like. To do this, each key is typed 100 times. The neural network is trained with the label (the key being typed) and the corresponding sound. The raw digitalized sound input is too large for their neural networks, so each keystroke is represented as a vector of Fast Fourier Transform (FFT) features. The trained neural network then can be used to recognize subsequent keystrokes.
Based on the supervised learning approach above, Asonov and Agrawal show:
A wide variety (e.g. different keyboards of the same model, different models, different brands) of keyboards have keys with distinct acoustic properties.
Sound recordings from as far away as 15 meters suffice for neural network supervised learning if sophisticated microphones such as parabolic microphones are used.
Their neural network supervised learning is sensitive to training errors: if input label are inaccurate, their recognition rates drop sharply. The effectiveness of the approach also depends a lot on the comprehensiveness of the training samples, i.e. whether it contains enough samples for each key or not.
Asonov and Agrawal’s work opened a new field. However, there are limitations in their approach:
Their attack is for labeled acoustic recordings. Their attack works well only with the same settings (i.e. the same keyboard, person, recording environment, etc.) as the training recording, and such training data are hard to obtain in many cases. Training on one keyboard and recognizing on another keyboard of the same model yields much lower accuracy rates, at around 25%. Even if we count all occasions when the correct key is among the top four candidates, the accuracy rate is still only about 50%. Lower recognition rates are also observed when the system is trained for one typist and then applied to another typist.
The set of acoustic classification techniques used leaves room for improvement. In our work, we found superior features to FFT and superior acoustic classifiers to neural networks. FFT and cepstrum features and also compares three classifiers: linear classification, neural networks and Gaussian mixtures. The classifier is trained on the training set data and is then used to classify the training set itself and two other data sets. Character recognition rate using cepstrum features (discussed below) on average is better than character recognition using FFT. This is true for all data sets and classification methods. Neural networks perform worse than linear classification on the two test sets. In this experiment, we could only approximate the experiment settings in. But the significant performance differences indicate that there are better alternatives to FFT and neural networks combination.
Timing information is a different type of side channel information related to keyboard typing. Timing information includes the time between two keystrokes, the time between keystroke push to keystroke release, etc. Song, Wagner and Tian showed how to extract information based on the time between two consecutive keystrokes . They considered interactive login shells encrypted with the SSH protocol. In this scenario, an eavesdropper can detect the time between consecutive keys. Statistical analysis shows that the distribution of time between a pair of keys vary for different key pairs. Contributing factors include: whether keys are typed with alternating hands or the same hand, with different fingers or the same fingers, etc. The types of pairs defined in their work capture the physical distances between keys and also the response time of human beings. However, many different pairs may belong to the same type, e.g. two letters typed by alternating hands. Timing information is generally not helpful in distinguishing different pairs in the same type. Their work gives some analysis of the amount of information leaked by timing information.
THE ATTACK
We take a recording of a user typing English text on a keyboard, and produce a recognizer that can, with high accuracy, determine subsequent keystrokes from sound recordings if it is typed by the same person, with the same keyboard, under the same recording conditions. These conditions can easily be satisfied by, for example, placing a wireless microphone in the user’s work area or by using parabolic or laser microphones from a distance. Although we do not necessarily know in advance whether a user is typing English text, in practice we can record continuously, try to apply the attack, and see if meaningful text is recovered.
It contains the following steps,
Feature extraction.
We use cepstrum features, a technique developed by researchers in voice recognition. As we discuss below, cepstrum features give better results than FFT.
Unsupervised key recognition
Using unlabeled training data. We cluster each keystroke into one of K acoustic classes, using standard data clustering methods. K is chosen to be slightly larger than the number of keys on the keyboard. If these acoustic clustering classes correspond exactly to different keys in a one-to-one mapping, we can easily determine the mapping between keys and acoustic classes. However, clustering algorithms are imprecise. Keystrokes of the same key are sometimes placed in different acoustic classes and conversely keystrokes of different keys can be in the same acoustic class. We let the acoustic class be a random variable conditioned on the actual key typed. A particular key will be in each acoustic class with a certain probability. In well clustered data, probabilities of one or a few acoustic classes will dominate for each key. Once the conditional distributions of the acoustic classes are determined, we try to find the most likely sequence of keys given a sequence of acoustic classes for each keystroke. Naively, one might think picking the letter with highest probability for each keystroke yields the best estimation and we can declare our job done.
But we can do better. We use a Hidden Markov Model (HMM). HMMs model a stochastic process with state. They capture the correlation between keys typed in sequence. For example, if the current key can be either “h” or “j” (e.g. because they are physically close on the keyboard) and we know the previous key is “t”, then the current key is more likely to be “h” because “th” is more common than “tj”. Using these correlations in English, both the keys and the key-to-class mapping distributions can be efficiently estimated using standard HMM algorithms. This step yields accuracy rates of slightly over 60% for characters, which in turn yields accuracy rates of over 20% for words.
Spelling and grammar checking
We use dictionary-based spelling correction and a simple statistical model of English grammar. These two approaches, spelling and grammar, are combined in a single Hidden Markov Model. This increases the character accuracy rate to over 70%, yielding a word accuracy rate of about 50% or more. At this point, the text is quite readable
Feedback-based training
Feedback-based training produces a keystroke acoustic classifier that does not require an English spelling and grammar model, enabling random text recognition, including password recognition. In this step, we use the previously obtained corrected results as labeled training samples. Note that our corrected results are not 100% correct. We use heuristics to select words that are more likely to be correct. For examples, a word that is not spell-corrected or one that changes only slightly during correction in the last step is more likely to be correct than those that had more changes. In our experiments, we pick out those words with fewer than 1/4 of characters corrected and use them as labeled samples to train an acoustic classifier. The recognition phase recognizes the training samples again. This second recognition typically yields a higher keystroke accuracy rate. We use the number of corrections made in the spelling and grammar correction step as a quality indicator. Fewer corrections indicate better results. The same feedback procedure is performed repeatedly until no significant improvement is seen. In our experiments, we perform three feedback cycles. Our experiments indicate both linear classification and Gaussian mixtures perform well as classification algorithms and both are better than neural networks as used in. In our experiments, character accuracy rates (without a final spelling and grammar correction step) reach up to 92%.
The second phase, the recognition phase, uses the trained keystroke acoustic classifier to recognize new sound recordings. If the text consists of random strings, such as passwords, the result is output directly. For English text, the above spelling and grammar language model is used to further correct the result. To distinguish between two types of input, random or English, we apply the correction and see if reasonable text is produced. In practice, a human attacker can typically determine if text is random. An attacker can also identify occasions when the user types user names and passwords. For example, password entry typically follows a URL for a password protected website. Meaningful text recovered from the recognition phase during an attack can also be fedback to the first phase. These new samples along with existing samples can be used together to increase the accuracy of the keystroke classifier.
Below, we describe in detail the steps of our attack. Some steps (feature extraction and supervised classification) are used in both the training phase and the recognition phase.
Keystroke Feature Extraction
Keystroke Extraction
Typical users can type up to about 300 characters per minutes. Keystrokes consist of a push and a release. Our experiments confirm Asonov and Agrawal’s observation that the period from push to release is typically about 100 milliseconds. That is, there is usually more than 100 milliseconds between consecutive keystrokes, which is large enough to distinguish the consecutive keystrokes. We need to detect the start of a keystroke, which is essentially the start of the push peak in a keystroke acoustic signal.
We distinguish between keystrokes and silence using energy levels in time windows. In particular, we calculate windowed discrete Fourier transform of the signal and use the sum of all FFT coefficients as energy. We use a threshold to detect the start of keystrokes.
Features: Cepstrum vs. FFT
Given the start of each keystroke (i.e. waveposition), features of this keystroke are extracted from the audio signal during the period from wavposition to wavposition plus delta-theta. Our experiments compared two different types of features. First we used FFT features, as in . This time period roughly corresponds to the touch peak of the keystroke, which is when the finger touches the key. An alternative would be to use the hit peak, when the key hits the supporting plate. The hit peak is harder to pinpoint in the signal, so our experiments used the touch peak.
Next, we used cepstrum features. Cepstrum features are widely used in speech analysis and recognition Cepstrum features have been empirically verified to be more effective than plain FFT coefficients for voice signals. In particular, we used Mel-Frequency Cepstral Coefficients (MFCCs). In our experiments, we set the number of channels in the Mel-Scale Filter Bank to 32 and used the first 16 MFCCs computed using 10ms windows, shifting 2.5ms each time. MFCCs of a keystroke were extracted from the period from wavposition to wavposition plus delta-theta-omega, where delta-theta-omega does not equal 40ms which covers the whole push peak.
Asonov and Agrawal’s observation shows that high frequency acoustic data provides limited value. We ignore data over 12KHz. After feature extraction, each keystroke is represented as a vector of features (FFT coefficients or MFCCs).
Unsupervised Single Keystroke Recognition
As discussed above, the unsupervised recognition step recognizes keystrokes using audio recording data only and no training or language data.
The first step is to cluster the feature vectors into k acoustic classes. Possible algorithms to do this include k-means and Expectation-Maximization (EM) on Gaussian mixtures. Our experiments tested values of k from 40 to 55, and k = 50 yielded the best results. We use thirty keys, so k must be equal or larger than 30. A larger k captures more information from the sound samples, but it also makes the system more sensitive to noise. It would be interesting to experiment with using Dirichlet processes that might predict k automatically
The second step is to recover text from these classes. For this we use a Hidden Markov Model (HMM) HMMs are often used to model finite-state stochastic processes. In a Markov chain, the next state depends only on the current state. Examples of processes that are close to Markov chains include sequences of words in a sentence, weather patterns, etc. For processes modeled with HMM, the true state of the system is unknown and thus is represented with hidden random variables. What is known are observations that depend on the state. These are represented with known output variables. One common problem of interest in an HMM is the inference problem, where the unknown state variables are inferred from a sequence of observations. This is often solved with the Viterbi algorithm]. Another problem is the parameter estimation problem, where the parameters of the conditional distribution of the observations are estimated from the sequence of observations. This can be solved with the EM algorithm.
We use the EM algorithm for parameter estimation. It goes through a number of rounds, alternately improving qiand n. The output of this step is the η matrix. After that, the Viterbi algorithm is used to infer qi, i.e. the best sequence of keys.
EM is a randomized algorithm. Good initial values make the chance of getting satisfactory results better. We found initializing the row in n corresponding to the Space key to an informed guess makes the EM results more stable. This is probably because spaces delimit words and strongly affect the distribution of keys before and after the spaces. This task is performed manually. Space keys are easy to distinguish by ear in the recording because of the key’s distinctive sound and frequency of use. We marked several dozen space keys, look at the class that the clustering algorithm assigns to each of them, calculate their estimated probabilities for class membership, and put these into n. This approach yields good results for most of the runs. However, it is not necessary. Even without space keys guessing, running EM with different random initial values will eventually yield a good set of parameters. All other keys used in our study, including punctuation keys are initialized to random values in n. We believe that initialization of n can be completely automated, and hope to explore this idea in the future work.
This section discusses improvements to attacks.
One challenge we met in our work was marking keystroke starting points in the sound signal. This is not trivial because the sound varies in energy level, timing, frequency distribution, etc., depending on the typist and recording environment. We use energy level and timing constraints between consecutive keys to mark the starting positions of keystrokes. Detection rules are manually created based on past experiences. Our detection program based on this approach has difficulty in marking keystroke positions in recordings from fast typists. However, there is additional information we can use: namely frequency, which appears to vary from the push peak to the release peak. We hope to explore this in future work. A robust and consistent keystroke position detection algorithm may also improve the recovery rate of typed characters.
Currently, we assume the Space key, Enter key and punctuation keys are detected correctly and use them to divide characters into words. We use candidate words of the same length as the “words” separated in this way. In future work, we will explore better ways to choose candidate words for correction, with the goal of high quality correction even when there are mistakes in separating words.
An alternative method for feedback training is Hierarchical Hidden Markov Models (HHMMs)]. In a HHMM, HMMs of multiple levels (grammar level and spelling level in this case) are built into a single model. Algorithms to maximize global joint probability may improve the effectiveness of the feedback training procedure. This approach merits further investigation.
Our experiments tested on FFT features and cepstrum features. However, there are other types of features for representing sound signals. For each type of feature, there are multiple parameters to control the extracted information. Currently, we used ad hoc methods to select these parameters. An entropy based metric defined specifically for measuring acoustic features may provide better, more systematic way to compare features and parameters. This metric may also allow us to compare information leaked by individual keys. Given current PC keyboard layouts, is the leaking uniform among keys, or should we pay more attention to specific keys? Is it possible to know which typing styles leak more information and whether different typists leak different amounts of information?
In a controlled environment where we can record isolated typing sounds, the recovery rate is now high. However, in most realistic situations, environmental background noise is an issue. In many work spaces, we have multiple users simultaneously typing. Isolating the sound of a single typist is difficult. We propose to test recording with multiple microphones, including arrays of directional microphones. We could get the sound signal of multiple channels in this way. Similarly, we have shown that the recognition rate is lower in noisy environments. Attacks will be less successful when the user is playing music or talking to others while typing. However, we may be able to use signal processing techniques (especially in multichannel recordings) to isolate the sound of a single typist.
We hope to explore a variety of recording devices including parabolic microphones, laser microphones, telephone receiver microphones, acoustic chat connections such as Skype, etc.
In future work, it is particularly interesting to try to detect keystrokes typed in a particular application, such as a visual editor (e.g. emacs) or a software development environment (e.g. Eclipse). Examining text typed in these environment presents challenges because more keys maybe used and special keys maybe used more often. Furthermore, the bigram or transition matrix will be different. Nonetheless we believe that our techniques may be applicable to detecting keystrokes of users in these applications and indeed can even cover input as different as other small alphabet languages, such as Russian or Arabic, large alphabet languages, such as Chinese or Japanese, and even programming languages.
Defenses
Since our attack is based on acoustic signal through passively eavesdropping, it is more difficult to detect this type of attacks than active attacks where attackers actively interact with victims. Here are some preliminary areas for potential defenses:
Reduce the possibility of leaking acoustic signals. Sound proving may help, but given the effectiveness of modern parabolic and laser microphones, the standards are very high.
Quieter keyboards as suggested by Asonov and Agrawal may reduce vulnerability. However, the two so-called “quiet” keyboards we used in our experiments proved ineffective against the attack. Asonov and Agrawal also suggest that keyboard makers could produce keyboards having keys that sound so similar that they are not easily distinguishable. They claim that one reason keys sound different today is that the plate underneath the keys makes different sounds when hit at different places. If this is true, using a more uniform plate may alleviate the attack. However, it is not clear whether these kinds of keyboards are commercially viable. Also, there is the possibility that more subtle differences between keys can still be captured by an attacker. Further, keyboards may develop distinct keystroke sounds after months of use.
Another approach is reduce the quality of acoustic signal that could be acquired by attackers. We could add masking noise while typing. However, we are not sure that masking noises might not be easily separated out. As we discussed above, an array of directional microphones may be able to record and distinguish sound into multiple channels according to the locations of the sound sources. This defense could also be ineffective when attackers are able to collect more data. Reducing the annoyance of masking is also an issue. Perhaps a short window of noise could be added at every predicted push peak. This may be more acceptable to users than continuous masking noise. Alternatively, perhaps we could randomly insert noise windows which sound like push peaks of keystrokes.
The practice of relying only on typed passwords or even long passphrases should be reexamined. One alternative is two-factor authentication that combines passwords or pass-phrases with smart cards, one-time-password tokens, biometric authentication and etc. However two-factor authentication does not solve all our problems. Typed text other than passwords is also valuable to attackers.
CONCLUSION
Our new attack on keyboard emanations needs only acoustic recording of typing using a keyboard and recovers the typed content. Compared to previous work that requires clear-text labeled training data, our attack is more general and serious. More important, the techniques we use to exploit inherent statistical constraints in the input and to perform feedback training can be applied to other emanations with similar properties.
@Oscar @Slav Defence @Jungibaaz @Gufi @Nihonjin1051 @AMDR @AUSTERLITZ @levina
In the wake of the NSA spying revelations (to those not familiar with them) the German BND decided to return to typewriters instead of computer based documents and communications. It won’t make a difference
Here's why
INTRODUCTION
This paper reports on recovering keystrokes typed on a keyboard from a sound recording of the user typing. Emanations produced by electronic devices have long been a topic of concern in the security and privacy communities. Both electromagnetic and optical emanations have been used as sources for attacks. For example, Kuhn was able to recover the display on CRT and LCD monitors using indirectly reflected optical emanations. Acoustic emanations are another source of data for attacks. Researchers have shown that acoustic emanations of matrix printers carry substantial information about the printed text. Some researchers suggest it may be possible to discover CPU operations from acoustic emanations In ground-breaking research, Asonov and Agrawal showed that it is possible to recover text from the acoustic emanations from typing on a keyboard.
Most emanations, including acoustic keyboard emanations, are not uniform across different instances, even when the same device model is used; and they are affected by the environment. Different users on a single keyboard or different keyboards (even of the same model) emit different sounds, making reliable recognition hard. Asonov and Agrawal achieved relatively high recognition rate (approximately 80%) when they trained neural networks with text-labeled sound samples of the same user typing on the same keyboard. Their attack is analogous to a known-plaintext attack on a cipher – the cryptanalyst has a sample of plaintext (the keys typed) and the corresponding cipher-text (the recording of acoustic emanations). This labeled training sample requirement suggests a limited attack, because the attacker needs to obtain training samples of significant length. Presumably these could be obtained from video surveillance or network sniffing. However, video surveillance in most cases should render the acoustic attack irrelevant, because even if passwords are masked on the screen, a video shot of the keyboard could directly reveal the keys being typed.
In this paper we argue that a labeled training sample requirement is unnecessary for an attacker. This implies keyboard emanation attacks are more serious than previous work suggests. The key insight in our work is that the typed text is often not random. When one types English text, the finite number of mostly used English words limits possible temporal combinations of keys, and English grammar limits word combinations. One can first cluster (using unsupervised methods) keystrokes into a number of acoustic classes based on their sound. Given sufficient (unlabeled) training samples, a most-likely mapping between these acoustic classes and actual typed characters can be established using the language constraints.
This task is not trivial. Challenges include:
1) How can one mathematically model language constraints and mechanically apply them?
2) In the first soundbased clustering step, how can one address the problem of different keys clustered in the same acoustic class and a single key clustered in multiple acoustic classes?
3) Can we improve the accuracy of the guesses by the algorithm to match the level achieved with labeled samples?
Our work answers these challenges, using a combination of machine learning and speech recognition techniques. We show how to build a keystroke recognizer that has better recognition rate than labeled sample recognizers in. We only use a sound recording of a user typing.
Our method can be viewed as a machine learning version of classic attacks to simple substitution ciphers. Assuming the ideal case in which a key produces exactly the same sound each time it is pressed, each keystroke could be easily given an acoustic class according to the sound. The acoustic class assignment would be a permutation of the key labels. This is exactly an instance of substitution cipher. Early cryptographers developed methods for cryptanalyzing substitution ciphers. Our attack can be viewed as an extension of these methods – but our problem is more difficult because the sound of a particular keystroke varies even when it is produced by the same typist.
We built a prototype that can bootstrap the recognizer from about 10 minutes of English text typing, using about 30 minutes of computation on a desktop computer with a Pentium IV 3.0G CPU and 1GB of memory. After the bootstrap step, it could recognize language-independent keystrokes in real time, including random keystrokes occurring in passwords, with an accuracy rate of about 90%. When language-dependent constraints are applied to English text, we achieve a 90-96% accuracy rate for characters and a 75-90% accuracy rate for words.
We posit that our framework also applies to other types of emanations with inherent statistical constraints, such as power consumption or electromagnetic radiation. One only need adapt the methods of extracting features and modeling constraints. Our work implies that emanation attacks are far more challenging, serious, and realistic than previously realized. Emanation attacks deserve greater attention in the computer security community.
OVERVIEW OF PREVIOUS ATTACKS
We briefly review two related previous research studies examining recovery of keystrokes, each using a different type of side channel.
To the best of our knowledge, Asonov and Agrawal were the first researchers to publish a concrete attack exploiting keyboard acoustic emanations. They note that the sound of keystrokes differ slightly from key to key. They give a concrete method to recover information about typing on keyboards, using neural networks as acoustic classifiers. Their approach is to first “teach” the neural networks about what the different keys sound like. To do this, each key is typed 100 times. The neural network is trained with the label (the key being typed) and the corresponding sound. The raw digitalized sound input is too large for their neural networks, so each keystroke is represented as a vector of Fast Fourier Transform (FFT) features. The trained neural network then can be used to recognize subsequent keystrokes.
Based on the supervised learning approach above, Asonov and Agrawal show:
A wide variety (e.g. different keyboards of the same model, different models, different brands) of keyboards have keys with distinct acoustic properties.
Sound recordings from as far away as 15 meters suffice for neural network supervised learning if sophisticated microphones such as parabolic microphones are used.
Their neural network supervised learning is sensitive to training errors: if input label are inaccurate, their recognition rates drop sharply. The effectiveness of the approach also depends a lot on the comprehensiveness of the training samples, i.e. whether it contains enough samples for each key or not.
Asonov and Agrawal’s work opened a new field. However, there are limitations in their approach:
Their attack is for labeled acoustic recordings. Their attack works well only with the same settings (i.e. the same keyboard, person, recording environment, etc.) as the training recording, and such training data are hard to obtain in many cases. Training on one keyboard and recognizing on another keyboard of the same model yields much lower accuracy rates, at around 25%. Even if we count all occasions when the correct key is among the top four candidates, the accuracy rate is still only about 50%. Lower recognition rates are also observed when the system is trained for one typist and then applied to another typist.
The set of acoustic classification techniques used leaves room for improvement. In our work, we found superior features to FFT and superior acoustic classifiers to neural networks. FFT and cepstrum features and also compares three classifiers: linear classification, neural networks and Gaussian mixtures. The classifier is trained on the training set data and is then used to classify the training set itself and two other data sets. Character recognition rate using cepstrum features (discussed below) on average is better than character recognition using FFT. This is true for all data sets and classification methods. Neural networks perform worse than linear classification on the two test sets. In this experiment, we could only approximate the experiment settings in. But the significant performance differences indicate that there are better alternatives to FFT and neural networks combination.
Timing information is a different type of side channel information related to keyboard typing. Timing information includes the time between two keystrokes, the time between keystroke push to keystroke release, etc. Song, Wagner and Tian showed how to extract information based on the time between two consecutive keystrokes . They considered interactive login shells encrypted with the SSH protocol. In this scenario, an eavesdropper can detect the time between consecutive keys. Statistical analysis shows that the distribution of time between a pair of keys vary for different key pairs. Contributing factors include: whether keys are typed with alternating hands or the same hand, with different fingers or the same fingers, etc. The types of pairs defined in their work capture the physical distances between keys and also the response time of human beings. However, many different pairs may belong to the same type, e.g. two letters typed by alternating hands. Timing information is generally not helpful in distinguishing different pairs in the same type. Their work gives some analysis of the amount of information leaked by timing information.
THE ATTACK
We take a recording of a user typing English text on a keyboard, and produce a recognizer that can, with high accuracy, determine subsequent keystrokes from sound recordings if it is typed by the same person, with the same keyboard, under the same recording conditions. These conditions can easily be satisfied by, for example, placing a wireless microphone in the user’s work area or by using parabolic or laser microphones from a distance. Although we do not necessarily know in advance whether a user is typing English text, in practice we can record continuously, try to apply the attack, and see if meaningful text is recovered.
It contains the following steps,
Feature extraction.
We use cepstrum features, a technique developed by researchers in voice recognition. As we discuss below, cepstrum features give better results than FFT.
Unsupervised key recognition
Using unlabeled training data. We cluster each keystroke into one of K acoustic classes, using standard data clustering methods. K is chosen to be slightly larger than the number of keys on the keyboard. If these acoustic clustering classes correspond exactly to different keys in a one-to-one mapping, we can easily determine the mapping between keys and acoustic classes. However, clustering algorithms are imprecise. Keystrokes of the same key are sometimes placed in different acoustic classes and conversely keystrokes of different keys can be in the same acoustic class. We let the acoustic class be a random variable conditioned on the actual key typed. A particular key will be in each acoustic class with a certain probability. In well clustered data, probabilities of one or a few acoustic classes will dominate for each key. Once the conditional distributions of the acoustic classes are determined, we try to find the most likely sequence of keys given a sequence of acoustic classes for each keystroke. Naively, one might think picking the letter with highest probability for each keystroke yields the best estimation and we can declare our job done.
But we can do better. We use a Hidden Markov Model (HMM). HMMs model a stochastic process with state. They capture the correlation between keys typed in sequence. For example, if the current key can be either “h” or “j” (e.g. because they are physically close on the keyboard) and we know the previous key is “t”, then the current key is more likely to be “h” because “th” is more common than “tj”. Using these correlations in English, both the keys and the key-to-class mapping distributions can be efficiently estimated using standard HMM algorithms. This step yields accuracy rates of slightly over 60% for characters, which in turn yields accuracy rates of over 20% for words.
Spelling and grammar checking
We use dictionary-based spelling correction and a simple statistical model of English grammar. These two approaches, spelling and grammar, are combined in a single Hidden Markov Model. This increases the character accuracy rate to over 70%, yielding a word accuracy rate of about 50% or more. At this point, the text is quite readable
Feedback-based training
Feedback-based training produces a keystroke acoustic classifier that does not require an English spelling and grammar model, enabling random text recognition, including password recognition. In this step, we use the previously obtained corrected results as labeled training samples. Note that our corrected results are not 100% correct. We use heuristics to select words that are more likely to be correct. For examples, a word that is not spell-corrected or one that changes only slightly during correction in the last step is more likely to be correct than those that had more changes. In our experiments, we pick out those words with fewer than 1/4 of characters corrected and use them as labeled samples to train an acoustic classifier. The recognition phase recognizes the training samples again. This second recognition typically yields a higher keystroke accuracy rate. We use the number of corrections made in the spelling and grammar correction step as a quality indicator. Fewer corrections indicate better results. The same feedback procedure is performed repeatedly until no significant improvement is seen. In our experiments, we perform three feedback cycles. Our experiments indicate both linear classification and Gaussian mixtures perform well as classification algorithms and both are better than neural networks as used in. In our experiments, character accuracy rates (without a final spelling and grammar correction step) reach up to 92%.
The second phase, the recognition phase, uses the trained keystroke acoustic classifier to recognize new sound recordings. If the text consists of random strings, such as passwords, the result is output directly. For English text, the above spelling and grammar language model is used to further correct the result. To distinguish between two types of input, random or English, we apply the correction and see if reasonable text is produced. In practice, a human attacker can typically determine if text is random. An attacker can also identify occasions when the user types user names and passwords. For example, password entry typically follows a URL for a password protected website. Meaningful text recovered from the recognition phase during an attack can also be fedback to the first phase. These new samples along with existing samples can be used together to increase the accuracy of the keystroke classifier.
Below, we describe in detail the steps of our attack. Some steps (feature extraction and supervised classification) are used in both the training phase and the recognition phase.
Keystroke Feature Extraction
Keystroke Extraction
Typical users can type up to about 300 characters per minutes. Keystrokes consist of a push and a release. Our experiments confirm Asonov and Agrawal’s observation that the period from push to release is typically about 100 milliseconds. That is, there is usually more than 100 milliseconds between consecutive keystrokes, which is large enough to distinguish the consecutive keystrokes. We need to detect the start of a keystroke, which is essentially the start of the push peak in a keystroke acoustic signal.
We distinguish between keystrokes and silence using energy levels in time windows. In particular, we calculate windowed discrete Fourier transform of the signal and use the sum of all FFT coefficients as energy. We use a threshold to detect the start of keystrokes.
Features: Cepstrum vs. FFT
Given the start of each keystroke (i.e. waveposition), features of this keystroke are extracted from the audio signal during the period from wavposition to wavposition plus delta-theta. Our experiments compared two different types of features. First we used FFT features, as in . This time period roughly corresponds to the touch peak of the keystroke, which is when the finger touches the key. An alternative would be to use the hit peak, when the key hits the supporting plate. The hit peak is harder to pinpoint in the signal, so our experiments used the touch peak.
Next, we used cepstrum features. Cepstrum features are widely used in speech analysis and recognition Cepstrum features have been empirically verified to be more effective than plain FFT coefficients for voice signals. In particular, we used Mel-Frequency Cepstral Coefficients (MFCCs). In our experiments, we set the number of channels in the Mel-Scale Filter Bank to 32 and used the first 16 MFCCs computed using 10ms windows, shifting 2.5ms each time. MFCCs of a keystroke were extracted from the period from wavposition to wavposition plus delta-theta-omega, where delta-theta-omega does not equal 40ms which covers the whole push peak.
Asonov and Agrawal’s observation shows that high frequency acoustic data provides limited value. We ignore data over 12KHz. After feature extraction, each keystroke is represented as a vector of features (FFT coefficients or MFCCs).
Unsupervised Single Keystroke Recognition
As discussed above, the unsupervised recognition step recognizes keystrokes using audio recording data only and no training or language data.
The first step is to cluster the feature vectors into k acoustic classes. Possible algorithms to do this include k-means and Expectation-Maximization (EM) on Gaussian mixtures. Our experiments tested values of k from 40 to 55, and k = 50 yielded the best results. We use thirty keys, so k must be equal or larger than 30. A larger k captures more information from the sound samples, but it also makes the system more sensitive to noise. It would be interesting to experiment with using Dirichlet processes that might predict k automatically
The second step is to recover text from these classes. For this we use a Hidden Markov Model (HMM) HMMs are often used to model finite-state stochastic processes. In a Markov chain, the next state depends only on the current state. Examples of processes that are close to Markov chains include sequences of words in a sentence, weather patterns, etc. For processes modeled with HMM, the true state of the system is unknown and thus is represented with hidden random variables. What is known are observations that depend on the state. These are represented with known output variables. One common problem of interest in an HMM is the inference problem, where the unknown state variables are inferred from a sequence of observations. This is often solved with the Viterbi algorithm]. Another problem is the parameter estimation problem, where the parameters of the conditional distribution of the observations are estimated from the sequence of observations. This can be solved with the EM algorithm.
We use the EM algorithm for parameter estimation. It goes through a number of rounds, alternately improving qiand n. The output of this step is the η matrix. After that, the Viterbi algorithm is used to infer qi, i.e. the best sequence of keys.
EM is a randomized algorithm. Good initial values make the chance of getting satisfactory results better. We found initializing the row in n corresponding to the Space key to an informed guess makes the EM results more stable. This is probably because spaces delimit words and strongly affect the distribution of keys before and after the spaces. This task is performed manually. Space keys are easy to distinguish by ear in the recording because of the key’s distinctive sound and frequency of use. We marked several dozen space keys, look at the class that the clustering algorithm assigns to each of them, calculate their estimated probabilities for class membership, and put these into n. This approach yields good results for most of the runs. However, it is not necessary. Even without space keys guessing, running EM with different random initial values will eventually yield a good set of parameters. All other keys used in our study, including punctuation keys are initialized to random values in n. We believe that initialization of n can be completely automated, and hope to explore this idea in the future work.
This section discusses improvements to attacks.
One challenge we met in our work was marking keystroke starting points in the sound signal. This is not trivial because the sound varies in energy level, timing, frequency distribution, etc., depending on the typist and recording environment. We use energy level and timing constraints between consecutive keys to mark the starting positions of keystrokes. Detection rules are manually created based on past experiences. Our detection program based on this approach has difficulty in marking keystroke positions in recordings from fast typists. However, there is additional information we can use: namely frequency, which appears to vary from the push peak to the release peak. We hope to explore this in future work. A robust and consistent keystroke position detection algorithm may also improve the recovery rate of typed characters.
Currently, we assume the Space key, Enter key and punctuation keys are detected correctly and use them to divide characters into words. We use candidate words of the same length as the “words” separated in this way. In future work, we will explore better ways to choose candidate words for correction, with the goal of high quality correction even when there are mistakes in separating words.
An alternative method for feedback training is Hierarchical Hidden Markov Models (HHMMs)]. In a HHMM, HMMs of multiple levels (grammar level and spelling level in this case) are built into a single model. Algorithms to maximize global joint probability may improve the effectiveness of the feedback training procedure. This approach merits further investigation.
Our experiments tested on FFT features and cepstrum features. However, there are other types of features for representing sound signals. For each type of feature, there are multiple parameters to control the extracted information. Currently, we used ad hoc methods to select these parameters. An entropy based metric defined specifically for measuring acoustic features may provide better, more systematic way to compare features and parameters. This metric may also allow us to compare information leaked by individual keys. Given current PC keyboard layouts, is the leaking uniform among keys, or should we pay more attention to specific keys? Is it possible to know which typing styles leak more information and whether different typists leak different amounts of information?
In a controlled environment where we can record isolated typing sounds, the recovery rate is now high. However, in most realistic situations, environmental background noise is an issue. In many work spaces, we have multiple users simultaneously typing. Isolating the sound of a single typist is difficult. We propose to test recording with multiple microphones, including arrays of directional microphones. We could get the sound signal of multiple channels in this way. Similarly, we have shown that the recognition rate is lower in noisy environments. Attacks will be less successful when the user is playing music or talking to others while typing. However, we may be able to use signal processing techniques (especially in multichannel recordings) to isolate the sound of a single typist.
We hope to explore a variety of recording devices including parabolic microphones, laser microphones, telephone receiver microphones, acoustic chat connections such as Skype, etc.
In future work, it is particularly interesting to try to detect keystrokes typed in a particular application, such as a visual editor (e.g. emacs) or a software development environment (e.g. Eclipse). Examining text typed in these environment presents challenges because more keys maybe used and special keys maybe used more often. Furthermore, the bigram or transition matrix will be different. Nonetheless we believe that our techniques may be applicable to detecting keystrokes of users in these applications and indeed can even cover input as different as other small alphabet languages, such as Russian or Arabic, large alphabet languages, such as Chinese or Japanese, and even programming languages.
Defenses
Since our attack is based on acoustic signal through passively eavesdropping, it is more difficult to detect this type of attacks than active attacks where attackers actively interact with victims. Here are some preliminary areas for potential defenses:
Reduce the possibility of leaking acoustic signals. Sound proving may help, but given the effectiveness of modern parabolic and laser microphones, the standards are very high.
Quieter keyboards as suggested by Asonov and Agrawal may reduce vulnerability. However, the two so-called “quiet” keyboards we used in our experiments proved ineffective against the attack. Asonov and Agrawal also suggest that keyboard makers could produce keyboards having keys that sound so similar that they are not easily distinguishable. They claim that one reason keys sound different today is that the plate underneath the keys makes different sounds when hit at different places. If this is true, using a more uniform plate may alleviate the attack. However, it is not clear whether these kinds of keyboards are commercially viable. Also, there is the possibility that more subtle differences between keys can still be captured by an attacker. Further, keyboards may develop distinct keystroke sounds after months of use.
Another approach is reduce the quality of acoustic signal that could be acquired by attackers. We could add masking noise while typing. However, we are not sure that masking noises might not be easily separated out. As we discussed above, an array of directional microphones may be able to record and distinguish sound into multiple channels according to the locations of the sound sources. This defense could also be ineffective when attackers are able to collect more data. Reducing the annoyance of masking is also an issue. Perhaps a short window of noise could be added at every predicted push peak. This may be more acceptable to users than continuous masking noise. Alternatively, perhaps we could randomly insert noise windows which sound like push peaks of keystrokes.
The practice of relying only on typed passwords or even long passphrases should be reexamined. One alternative is two-factor authentication that combines passwords or pass-phrases with smart cards, one-time-password tokens, biometric authentication and etc. However two-factor authentication does not solve all our problems. Typed text other than passwords is also valuable to attackers.
CONCLUSION
Our new attack on keyboard emanations needs only acoustic recording of typing using a keyboard and recovers the typed content. Compared to previous work that requires clear-text labeled training data, our attack is more general and serious. More important, the techniques we use to exploit inherent statistical constraints in the input and to perform feedback training can be applied to other emanations with similar properties.
@Oscar @Slav Defence @Jungibaaz @Gufi @Nihonjin1051 @AMDR @AUSTERLITZ @levina