Yahoo Pipes

For a net citizen like me, who wants to keep track of many sites, but has limited time (or is just lazy), the next best thing to sliced bread are RSS feeds. You don’t have to visit every site every other day to see if there’s something new posted. A program on your computer (your browser or a specific one) or a website (Bloglines for example (try the beta)) does this for you and you just get the results.

Now, what do you do, if you are only interested in a few specific items that appear in a particular feed? Well, you could ignore the unimportant posts, mark them read unread or delete them. That’s what one usually does. But there has to be another way. Some sites, like this one, offer a variety of different feeds for every category and tag. It’s easy to pick the topics one is interested in. And for the others? Well, there’s Yahoo Pipes.

With pipes you can do a lot of things. For example you can filter a feed for different search terms. You could also combine several feeds into one. Or build both functions into a whole pipeline, that filters the feeds of many sites and combines the results into a single feed. And after you’re done with that you can just put that through a translator to get the informations in your language.

And those are just examples with feeds. You can also use it to actively filter sites like internet auctions for special items or price ranges. Or manage your whole social networking life via a single pipe. And thanks to a growing API trend on the web you’ll be able to do a lot more over time.

So anyone who is actively using the web should check Pipes out. It’s really helpful.

How many URLs are there?

I’ve been wondering… how many URLs are out there in the internet? Not just domain names, but real URLs, including files and parameters. Different protocols as well. I think those are quite a few.

The thought occured to me while working on the concept of lonks. For the community edition I want to save URLs into a seperate table and just refer to them through IDs, so that they are not directly connected to the bookmark entries. That also reflects the idea of a somewhat normalized database and makes anonymizing referers more easy.

But the (random) IDs have to be in the right size from the start to last for eternity (or at least close to that). Otherwise some URLs could be identified to be created after a certain timestamp. On the other side they should be short enough to waste no unnecessary dataspace and don’t make the the referer URLs too long.

Just using numbers looks lame. But I can’t use all characters as well or there’ll be an ID that makes sense as a word. Maybe even a swear word. You don’t want http://lonks/nr1idiot to direct to your site, do you? Going hex is a bit restricted as well, but is the best common system.

In addition I thought of a system that splits the alphabet into chunks, which will makes it virtually impossible to create a word. I still have to figure out if that system is any good and how many IDs I can squeeze out of it with a decent amount of digits. If that won’t work out, I guess I’l stick to 4-16 digit hex (64 bit).

Okay, lets do the math with 4-16 digits (always including numbers) just for fun.

  • hex
    18,446,744,073,709,486,080
  • 3 no-vowel chunks
    194,644,767,472,667,473,927
  • No vowels
    727,423,121,747,185,262,904,960
  • All characters
    7,958,661,109,946,400,882,712,320

Maybe a case-sensitive character system will help to reduce the digits and/or increase the possible number of IDs. But maybe hex is enough… considering there won’t be the need to save every url of the internet anyway.

Am I thinking too much? Or am I just megalomaniac? Still the question remains… how many URLs are there?

Update: Just did the number crunching on a case-sensitive version of the 3 no-vowel chunks: 36,349,704,372,835,319,666,931 Somewhat a nice intermediate number. Looks mysterious and leet as well, so I might go for that. So many possibilities, that most of the time there won’t be the need to generate another random id, in case it is already in use. The speed of the queries and searches will be an interesting factor in the end, but I guess that problem will be solved when it arises.