DMOZ Expired Domain Finder

Back in the days of yore, right after wikipedia started added nofollow to all their outbound links, I built a bunch of small scripts that could be piped together to find domains in dmoz (Google’s directory) that were expired. The idea was to download a copy of the directory, scrape all the domains out of it, see which ones were expired, register them and put up the old content from archive.org. It’s a shame that these domains go to waste, since they have a Google directory link. Usually more backlinks than that.

Now that Google pays attention to domain expiry, I don’t think the tools are of much use for the original purpose. They are, however, a good example of how a bunch of small, special purpose utilities can be combined on the UNIX command line to accomplish tasks in parallel. Also, these scripts could be used for something else, like checking which of your list of domains is expiring, or quickly scanning through DMOZ for domains for doing link exchanges, etc.

Here’s the link to the DMOZ directory as a single gzipped file. It’s over 300MB compressed, and almost 2GB uncompressed. The source code for finding expired DMOZ domains is as follows:

  • parsedmoz.rb – ruby script to extract domains from the content.rdf.utf8 file
  • findMistakes.pl – perl script that checks for a DNS ‘A’ record (good indication that it doesn’t exist) and prints out the domain if it lacks one. Uses memoization (caching) to remember if it has already seen a domain. There are lots of dupes in the DMOZ dump.
  • checkDomain.sh – bash shell script that prints out which domains are available for registration.

Note: No, I don’t know why I wrote all three scripts in different languages. I think someone made a stupid remark to me that day about which language was Teh bEsT 3v4r! I should’ve written the whole thing in Haskell for fun.

Also, each script takes its input from stdin, and send its output to stdout. That way you can chain them if you want. For example:

$ cat content.rdf.u8 | ./parsedmoz.rb

Will give you a list of all host names in the DMOZ dump.

$ cat content.rdf.u8 | ./parsedmoz.rb | ./findMistakes.pl

Will give you a list of all domains from the DMOZ dump that don’t resolve anymore (perhaps a trip to archive.org is in order?) Piping these through checkDomain.sh will then give you a list of domains that are available for registration.

Unfortunately, the whois servers for .org domains tend to limit the number of queries that you can make per hour. You can either run your whois lookups through SOCKS proxies, split your list and run on multiple servers, or even just code delay between lookups to get around them.