Crawling websites for facebook Ids and the competition

This is a technical show-and-tell post. I had a list of all10,000 colleges in the United States, thick with esoteric details such as did they have land grants from the government, were they affiliated with a hospital, and so on.The list alsoincluded aweb address for each one. If I could find them on facebook, I could get moreuser-friendly information on each college to use on This is about how I built a web crawler in node.js that hit every one of their websites looking for Facebook Ids.

While I was at it, I crawled for any mention of our Upswing competitors, companies that might be selling online tutoring or academic support services to these schools.

You can get the list of colleges from the DOE; it’s called the post secondary school survey, and it was most recently done in 2012.

Here’s my github repository:college-facebook-id-finder

The node.js crawler I used was this one: node-simplecrawler. This crawler will first hit the URL you provide it, then it will grab other links off that page to continue crawling the site – it will continue doing so until it cannot find another unique url.

I used to “fetch conditions” to make sure that the crawler wasn’t going to bother any url with a number in it (otherwise you mightend up combing a calendar or something) and any type of document/image/object/etc.

crawler.addFetchCondition(function(parsedURL) { return parsedURL.path.match(/^([^0-9]*)$/i); }); crawler.addFetchCondition(function(parsedURL) { if (parsedURL.path.match(/.(css|doc|xls|ppt|xml|cgi|mso|avi|wmz|zip|wmv|swf|jpg|pdf|gif|docx|js|png|ico)/i)) { return false; } return true; });

These were my regular expressions:

var instaeduregex = /instaedu/ig; var nettutorregex = /nettutor/ig; var smarthinkingregex = /smarthinking/ig; var tutordotcomregex = /; var fbregex = /(?:https?:\/\/)?(?:www.)?\/(?:(?:\w)#!\/)?(?:pages\/)?(?:[\w-]\/)([\w-.])/; var regex = new RegExp("^(www\.){1}([0-9A-Za-z-\.@:%_+~#=]+)+((\.[a-zA-Z]{2,3})+)(/(.))?(\?(.))?");

Now, this thing will run forever. Then, unexplainably, it will occasionally just stop – or maybe it’s working but without making progress; regardless, it breaks, and the crawler won’t emit the ‘timeout’ event – and the server is useless until you ssh in and discover so.

My solution to that problem was to start a timeout that would allow the crawler to work for 30 seconds on each page before, assuming the worst, it just ends the node process altogether.

crawler.on("fetchcomplete",function(item, buffer, response) { if (promise) clearTimeout(promise); promise = setTimeout(function(){ process.exit(); }, 20000);

Now here is the real gold. You want that server to start up again fresh, right?

So, go to Amazon, startup a node server using a Bitnami image. It helps to have a bunch of credit with via their Amazon Activate program for startups. Create a key pair so you can ssh into your server and set it up as a remote for your git repository.

Once that is done, here’s how to set up your server to take care of itself.

crontab -e

What we have below is a chron job (it’s commented out now by that ampersand) that runs the script at I forget how often it runs – based on the five asterices.

* * * * * /bin/bash

What does is check if there is a node process ongoing, if not, it starts one by running app.js. It logs all the console output into server.log; if there is an error, it gets logged to server.error.log.

!/bin/bash if [ "pidof node" == "" ]; then cd /home/dev/college-facebook-id-finder && node server.js > server.log 2> server.error.log; fi

The name of my user was ‘dev’, by the way.

git hook

This way, whenever I push to this server, my changes automatically get

!/bin/bash cd /home/dev/prerender unset GIT_DIR git pull /home/dev/prerender.git

Then, close that instance down. Create your own personal AMI, an image of it and launch 19 more of them – see how fast it takes to get through that list of 10,000 (spoiler alert: with 20 micro instances it takes about 3 days).