I’ve been wondering… how many URLs are out there in the internet? Not just domain names, but real URLs, including files and parameters. Different protocols as well. I think those are quite a few.
The thought occured to me while working on the concept of lonks. For the community edition I want to save URLs into a seperate table and just refer to them through IDs, so that they are not directly connected to the bookmark entries. That also reflects the idea of a somewhat normalized database and makes anonymizing referers more easy.
But the (random) IDs have to be in the right size from the start to last for eternity (or at least close to that). Otherwise some URLs could be identified to be created after a certain timestamp. On the other side they should be short enough to waste no unnecessary dataspace and don’t make the the referer URLs too long.
Just using numbers looks lame. But I can’t use all characters as well or there’ll be an ID that makes sense as a word. Maybe even a swear word. You don’t want http://lonks/nr1idiot to direct to your site, do you? Going hex is a bit restricted as well, but is the best common system.
In addition I thought of a system that splits the alphabet into chunks, which will makes it virtually impossible to create a word. I still have to figure out if that system is any good and how many IDs I can squeeze out of it with a decent amount of digits. If that won’t work out, I guess I’l stick to 4-16 digit hex (64 bit).
Okay, lets do the math with 4-16 digits (always including numbers) just for fun.
- hex
18,446,744,073,709,486,080 - 3 no-vowel chunks
194,644,767,472,667,473,927 - No vowels
727,423,121,747,185,262,904,960 - All characters
7,958,661,109,946,400,882,712,320
Maybe a case-sensitive character system will help to reduce the digits and/or increase the possible number of IDs. But maybe hex is enough… considering there won’t be the need to save every url of the internet anyway.
Am I thinking too much? Or am I just megalomaniac? Still the question remains… how many URLs are there?
Update: Just did the number crunching on a case-sensitive version of the 3 no-vowel chunks: 36,349,704,372,835,319,666,931 Somewhat a nice intermediate number. Looks mysterious and leet as well, so I might go for that. So many possibilities, that most of the time there won’t be the need to generate another random id, in case it is already in use. The speed of the queries and searches will be an interesting factor in the end, but I guess that problem will be solved when it arises.