Password analysis

Analysis of 2.5b+ passwords in terms of letter frequency by top-level-domain

Tobias Schroedel, Munich, Germany // www.comedyhacker.com // schroedel (at) sichere (dot) it // @comedyhacker

Abstract:
In January 2022 I analyzed 2.5+ billion passwords by their letter frequency.
I found distinct letter distributions for different domain extensions.
This knowledge may increase password cracker programs using (intelligent) brute force attacks.

Why:
I was just curious, and ... in ancient times, when codebreakers battled codemakers, language analysis and letter frequency was a sharp sword against encrypted texts (especially against a monoalphabetical substitution like in a Cesar shift). If you are interested in historical cryptanalysis, you may be interested in HistoCrypt 2022 that takes place in Amsterdam in June 2022.

Knowing the frequency of used letters in passwords may (or may not) increase the speed when cracking passwords using a brute-force attack.

Instead of testing each letter on all positions from ABC...XYZ it may (or may not) be faster to sort the letters like this: ENIRS...JXYQ

Disclaimer:
Cracking passwords is illegal in many countries. Don't do it!
You will not see any cleartext passwords here or elsewhere on my website. Only numbers.

Findings:
Here are some cool findings that I made. You will find more. If you want to share them, let me know.

  • the most frequent letter in German and English texts is 'e' while in passwords it's 'a'
  • x, y and z are used up to 50 times more often in passwords than in texts
  • Asians use 2-3 times as many numbers in passwords than the rest of the world
  • the least used number in passwords worldwide is 7, probably as it stands for bad luck in China
  • # is not in the top 5 used special characters
  • the letter 't' is rare (-50%) in english passwords compared to english texts

    Method:
    bigDB is a huge textfile that contains billions of e-mail-addresses and cleartext passwords, which were stolen in various databreaches over the last years. It was released in November 2017. The file is more or less available for free on various websites. I used it for my analysis.
    I programmed a small python skript, that filtered only the passwords and sorted them by the top-level-domain (tld) of each associated e-mail-address.
    Then, for each tld, I simply counted the occurences of letters, numbers and special characters as follows:
    reference = "ABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890!ยง$%&/()=?+*#-_.:,;<>[]^' \"\\{}|~`"

    In addition, I calculated some statistics such as number of passwords per tld, average length etc. For a few languages (German, English, Spanish, French, Italian, Swedish) I also compared the frequency of each letter per tld (de, com, es, fr, it, se) to their respective frequency in general texts.

    Limitations:
    There are some limitations in the source (bigDB) that influences the findings and distort the statistics.

  • tlds such as .com .net or .org are language-free, as they are used more or less worldwide and do not (neccessarily) reflect one language such as english.
  • no uppercase letters occur in the source, except a few. But only on the initial or on all characters. Therefore I do not differenciate between upper and lowercase.
  • a huge number of passwords seem not to be cleartext passwords. They may be a counter, database id, unique identifier or a hash with/without salt
  • the bigDB is 5 years old. Many passwords in it are even older. They do not represent current password requirements and specifications that are in use nowadays.
  • a huge number of passwords from Ireland seem to be the e-mail without host
  • a huge number of passwords from Belgium are 2x the complete e-Mail address (default pw ???)
  • a huge number of passwords from Mexico are hash and salt
  • way more passwords than I imagined contain a space (25%). This might be an error in the source or in my imagination.
  • I added a 'region' to each country in my table, which is not defined in the same way all over the world. If you think, I am wrong, it was no offense!


    Thanks for reading! And if you like this, why not buy me a coffee :-)

    And then ...?
    Here are the files. Have fun!

    All the details (PDF)

    Number of passwords, length, characters, numbers, special chars and frequency of A..Z, 0..1 and special chars.


    Password frequency compared to texts (PDF)

    For German, English, Swedish, Spanish, Italian and French.


    Password frequency compared to texts as graph (PNG)

    For .com, .uk, .de, .fr, .it, .es and .se
               

    Overview (PDF)

    All tlds with percentage on characters vs numbers vs special chars


    Password length (PDF)

    Some tlds with number of passwords, number of characters and average password length


    Numbers (PDF)

    Which tlds use more numbers in passwords than others


    Special characters (PDF)

    Special characters increase the security of passwords. Some tlds use more than others


    If you want all the data in an Excel sheet, then click here

    Version 1.00 - 22 Feb 2022