Abstract:
In January 2022 I analyzed 2.5+ billion passwords by their letter frequency.
I found distinct letter distributions for different domain extensions.
This knowledge may increase password cracker programs using (intelligent) brute force attacks.
Why:
I was just curious, and ... in ancient times, when codebreakers battled codemakers, language analysis and letter frequency was a sharp sword against encrypted texts (especially against a monoalphabetical substitution like in a Cesar shift). If you are interested in historical cryptanalysis, you may be interested in HistoCrypt 2022 that takes place in Amsterdam in June 2022. Knowing the frequency of used letters in passwords may (or may not) increase the speed when cracking passwords using a brute-force attack. Instead of testing each letter on all positions from ABC...XYZ it may (or may not) be faster to sort the letters like this: ENIRS...JXYQ |
Disclaimer:
Cracking passwords is illegal in many countries. Don't do it!
You will not see any cleartext passwords here or elsewhere on my website. Only numbers.
Findings:
Here are some cool findings that I made. You will find more. If you want to share them, let me know.
Method:
bigDB is a huge textfile that contains billions of e-mail-addresses and cleartext passwords, which were stolen in various databreaches over the last years. It was released in November 2017. The file is more or less available for free on various websites. I used it for my analysis.
I programmed a small python skript, that filtered only the passwords and sorted them by the top-level-domain (tld) of each associated e-mail-address.
Then, for each tld, I simply counted the occurences of letters, numbers and special characters as follows:
reference = "ABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890!ยง$%&/()=?+*#-_.:,;<>[]^' \"\\{}|~`"
In addition, I calculated some statistics such as number of passwords per tld, average length etc. For a few languages (German, English, Spanish, French, Italian, Swedish) I also compared the frequency of each letter per tld (de, com, es, fr, it, se) to their respective frequency in general texts.
Limitations:
There are some limitations in the source (bigDB) that influences the findings and distort the statistics.
Thanks for reading! And if you like this, why not buy me a coffee :-)
And then ...?
Here are the files. Have fun!
If you want all the data in an Excel sheet, then click here
Version 1.00 - 22 Feb 2022