Bruteforcing IDNs for fun and profit
IDNA is a mechanism to allow non ASCII characters in domain names.
With this mechanism, we could register the domain name bücher.com, and it would actually be encoded as
- a prefix "xn--", that is a randomly chosen prefix for the IDNA standard,
- and the rest of the name, encoded using Punycode.
Since the encoded names can contain characters from
[a-z0-9-], they can potentially mean something in their encoded forms, as well as in their decoded, Unicode forms.
Wouldn't it be cool to own a nice domain name like that?
Things are not that easy though: any combination of ASCII characters is not necessarily valid punycode...
Punycode is far too clever for me, but I can bruteforce it to try and find all the valid ASCII strings with a cool or meaningful valid Unicode representation. So that's what I did.
If you didn't know, the majority of assigned Unicode code points are used to encode CJK scripts (Chinese, Japanese, Korean) and other Asian scripts, so a lot of the resulting strings are Chinese characters.
xn--somuchchinese.com is actually decoded as 㰇㰍㰏㰉㯨㰄㰌㰏㰊㰇.com...)
Aside from Chinese looking characters, I found that most of the results are either decoded strings of meaningless random Unicode garbage, or encoded as totally meaningless and unpronounceable sequences of ASCII characters.
Here are some of the few interesting results I got in the 3 characters range:
xn--x2a→ ѫ (CYRILLIC SMALL LETTER BIG YUS) (0x2A == 42. Don't judge me)
xn--chi→ ➿ (DOUBLE CURLY LOOP)
xn--eye→ ᛰ (RUNIC BELGTHOR SYMBOL)
xn--bye→ ᛭ (RUNIC CROSS PUNCTUATION)
xn--exe→ ᛍ (RUNIC LETTER C)
xn--hoh→ ⏰ (ALARM CLOCK)
xn--meh→ ⊗ (CIRCLED TIMES)
I don't know what I was thinking, but I quickly realised there was in fact still a lot of valid combinations, most of them useless, and I didn't want to lose too much more time on that so I stopped there...
I actually registered ѫ.net (the .com version was already taken), just to realise I wasted 11€ in the process, because since browsers became aware of the potential for abuse of IDNs, the rules to determine whether to show the Unicode string, or its punycode encoded version when displaying an url have been drastically restricted, and ѫ.net now shows as
xn--x2a.net in Chrome and Firefox, most likely because it is an obsolete letter...
(This does not stop the guy owning ѫ.com from asking for $2,300 USD for it...)
If I wasn't the first to have that idea, now I know why I never saw anybody with a domain meaning something in both its encoded and decoded form...