Archive for the ‘Programming’ Category

Writing A Web Analytics Engine From Scratch

Thursday, April 2nd, 2009

If you’re building a system that needs to track affiliate sales, you’ll need to integrate some form of analytics into your software. Your affiliates will want to see how many total visitors hit their link, total uniques, how they got there (search terms or referrer URL) and if they made a product purchase.

There are a few ways of doing web analytics – writing tracking code right into your application (affiliate landing page), using JavaScript, or looking at web logs. I’ll be focusing on the JavaScript version. I won’t even go into web log processing here, since although there is interesting information there, it’s not real time enough for our use but is a powerful way to “check” the other methods or even gather information on spider visits (frequency, times of day, etc.)

Our tracking code will be fairly simple:

  1. Grab any information from the URL (GET parameters) and server data (user agent, remote IP, referer [sic] URL)
  2. Record the information in a database
  3. Continue processing the page

Decide on a Tracking Method

If you’re embedding the tracking code directly into your application, it’s a matter of adding some code to your controller and creating a model (and associated tables) to store the visitor data. The reporting backend will work exactly the same. The pros here are you don’t have to deal with JavaScript and/or cross-browser problems, and there may be a performance benefit since there are fewer HTTP requests being made to your server. The cons are that any time you want to change the tracking code, you need to change the controller, and you lose the ability to use the same tracking code on different sites, or sites that aren’t yours. Typically you set up one application (and domain) for doing analytics and reporting, and you have multiple websites. If you only have one website, and don’t mind running your analytics and reporting there, I’d recommend embedding the tracking code in your controller.

Using JavaScript to record visitor information is relatively simple. We need to write a controller to handle the requests to record visitor information, and a model to do the actual recording. The client side is a small JavaScript snippet, which will extract some variables and make a GET request to our controller. We won’t be using any AJAX here, since we need to deploy this code to multiple sites and have only one analytics site (i.e. we run the code on www.domain1.com but have our analytics requests hitting analytics.anotherdomain.com) – this is cross-site scripting (XSS), and although we want to allow it in this case, your browser won’t! Pros of this method are the ability to deploy to multiple sites and consolidate analytics/reporting to one server, and the ability to change tracking code without re-deploying your application. Cons are JavaScript browser incompatibility and increased complexity and load due to many (small) requests.

My Analytics Solution

We’ll be writing a controller and model using the Kohana PHP framework, and the client-side JavaScript without a framework, since all it does is generate a request for a 1×1 pixel GIF. This is the same way Google analytics and Mint do it. So, on to the code.


Web Analytics Model

Our model will store time, IP, request and referer [sic] URL information. Here is the MySQL table:

CREATE TABLE IF NOT EXISTS Hits (
    id                  INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
    recorded_time        TIMESTAMP NOT NULL, -- Time the record was created
    ip                  INTEGER UNSIGNED NOT NULL,
    ua          VARCHAR( 200 ) NOT NULL DEFAULT '-', -- User agent
    request         VARCHAR( 200 ) NOT NULL DEFAULT '/',
    referer        VARCHAR( 200 ) NOT NULL DEFAULT '-', -- Referer
    is_unique           BOOLEAN NOT NULL DEFAULT TRUE
);

Hopefully there isn’t anything unclear there. I’ve created fields to record the time at which the hit happened, the IP (stored as an integer for compactness), the user agent string, the original request, the referer URL and whether this is a unique or not (has this person already visited our site.)

The model is equally simple:

class Click_Model extends Model {

    function __construct() {
        parent::__construct();
    }

    function create( $ip, $ua, $request, $referer, $is_unique=1) {
        $ret = false;
        $row = array();

        // Convert IP to integer
        $ip = $this->_ip_to_integer( $ip );

        // Either 0 or 1
        if( $is_unique > 0 ) {
            $is_unique = 1;
        } else {
            $is_unique = 0;
        }
        if( $ip > 0  ) {
            $row[ 'ip' ] = (int)$ip;
            $row[ 'ua' ] = $ua;
            $row[ 'request' ] = $request;
            $row[ 'referer' ] = $referer;
            $row[ 'is_unique' ] = $is_unique;
            $ret = $this->_create_if_not_exists( 'Clicks', $row );
        }
        return $ret;
    }

    /**
     * Converts a text IP address to an integer.
     **/

    function _ip_to_integer( $ip ) {
        $octets = split( '\.', $ip );
        return (int)( $octets[ 3 ] + $octets[2]*256 +
                      $octets[1]*256*256 + $octets[0]*256*256*256 );
    }

    /**
     * Inserts the row if it's new and returns the ID, or just returns the
     * ID if it already exists. The table must have a column called 'id'
     * that is the INTEGER AUTO_INCREMENT PRIMARY KEY style.
     **/

    function _create_if_not_exists( $table, $row ) {
        // Try to insert - if it doesn't exist we'll get an ID of zero
        $columns = join( ',', array_keys( $row ) );
        $placeholders = join( ',', array_fill( 0, count( $row ), '?' ) );
        $q = $this->db->query( "INSERT IGNORE INTO $table ($columns) ".
                               "VALUES ($placeholders)", array_values( $row ) );
        $ret = $q->insert_id();
        if( $ret == 0 ) {
            $q = $this->db->getwhere( $table, $row );
            if( $q->count() > 0 ) {
                $result = $q->result_array( false );
                $ret = $result[ 0 ][ 'id' ];
            }
        }
        return $ret;
    }
}

The model class is pretty straightforward. Since Kohana doesn’t support “INSERT IGNORE”, I had to roll my own version. The model only handles inserts – actual reporting and such are left out.


Web Analytics Controller

The controller only does one thing – validate and record the data passed to it, then return a 1×1 pixel GIF:

class Hit_Controller extends Controller {

    private $gif_data = "\x47\x49\x46\x38\x39\x61\x01\x00\x01".
                                "\x00\x80\xFF\x00\xFF\xFF\xFF\x00\x00".
                                "\x00\x2C\x00\x00\x00\x00\x01\x00\x01".
                                "\x00\x00\x02\x02\x44\x01\x00\x3B\x00";

    function __construct() {
        parent::__construct();
    }

    /**
     * Basically grab all the parameters, record in the database and return
     * some content.
     **/

    function index() {
        if( isset( $_GET[ 'ru' ] ) ) {

            $h_model = new Hit_Model();
            $ip = $this->input->server( 'REMOTE_ADDR' );
            $ua = $this->input->server( 'HTTP_USER_AGENT' );
            $h_model->record_click( $ip,
                                    $ua,
                                    $this->_get_elem( $_GET, 'ru' ),
                                    $this->_get_elem( $_GET, 'rf' ),
                                    $this->_get_elem( $_GET, 'u' ) );
        }

        // Return a 1x1 pixel transparent gif
        header( 'Content-Type: image/gif' );
        echo( $this->gif_data );
    }

    function _get_elem( $a, $k ) {
        $ret = '';
        if( isset( $a[ $k ] ) ) {
            $ret = $a[ $k ];
        }
        return $ret;
    }
}

The only validation we do here is check that the referer URL was passed (the ru variable in the GET string.)


Client-side JavaScript

The JavaScript that acts as our view (although nothing is displayed) and executes in the user’s browser is quite simple. It marshals the require parameters, then munges this into a request for a GIF. In order to tell the difference between a unique visitor and a pageview, we set a cookie upon first visit, which is then checked upon subsequent pageviews. Here’s our JavaScript:

function track() {
    var days = 7; // Number of days to keep cookie alive
    var ru = document.location.href;
    var rf = document.referrer;

    var rest = '';
    if( ru.length > 0 ) {
        if( rf == '' ) {
            rf = '-';
        } else {
            rf = urlencode( rf );
        }

        // If there's a query string, grab it and stick all the parameters on the
        // end.
        var qstring = ru.split( '?' );
        if( qstring.length > 1 ) {
            rest = qstring[ 1 ];
        }

        ru = urlencode( ru );
        rf = urlencode( rf );
        var clicked_time = new Date();
        clicked_time = Math.round(clicked_time.getTime()/1000);

        // Build data.
        var d = 'rf=' + rf;
        if( ru.length > 0 ) {
            d += '&ru=' + ru;

       }
        if( rest.length > 0 ) {
            d += '&' + rest;
        }
        d += '&ct=' + clicked_time;

        // If the cookie already exists for this bonus code, this isn't a unique hit
        var unique = 1;
        old_cookie = readCookie( 'analytics_unique' );
        if( old_cookie != null && old_cookie != "" ) {
            unique = 0;
        }

        // Set cookie.
        setCookie( 'analytics_unique', 'visited', days, '/' ); // For uniqueness

        d += '&u=' + unique;
        // Now request the 1x1 pixel gif to record the click.
        (new Image()).src =  'http://your.analytics.site.com/click.gif?' + d;
    }
    return true;
}

function setCookie( name, value, days, path ) {
    var date = new Date();
    date.setTime( date.getTime() + ( days*24*60*60*1000 ) );
    var expires = "; expires=" + date.toGMTString();
    document.cookie = name + '=' + value + expires + '; path=' + path;
}

function readCookie(cookieName) {
    var theCookie=""+document.cookie;
    var ind=theCookie.indexOf(cookieName);
    if (ind==-1 || cookieName=="") return "";
    var ind1=theCookie.indexOf(';',ind);
    if (ind1==-1) ind1=theCookie.length;
    return unescape(theCookie.substring(ind+cookieName.length+1,ind1));
}

function deleteCookie( cookieName ) {
    if( readCookie( cookieName ) ) {
        setCookie( cookieName, '', 0, '/' );
    }
}

function urlencode(str) {
    str = escape(str);
    str = str.replace(/\+/g, '%2B');
    str = str.replace(/%20/g, '+');
    str = str.replace(/\*/g, '%2A');
    str = str.replace(/\//g, '%2F');
    str = str.replace(/@/g, '%40');
    return str;
}

There are a few convenience methods for reading/writing cookies and encoding the data so things don’t get screwed up when we request the image. The final piece is to add a rewrite rule so our controller gets hit with any requests to click.gif:

RewriteEngine on
RewriteBase /
RewriteCond %{REQUEST_URI} ^/click\.gif\?
RewriteRule ^/click\.gif\?(.*) /hit?$1 [L]

The above just strips off all our GET parameters and feeds them to our hit controller, which we know returns a 1×1 pixel gif.


Extensions

You could extend the above to include more information about the user’s browser such as platform, Java-enabled, Flash version, JavaScript version or screen resolution. With some post-processing you’d be able to do geolocation on the user’s IP, and strip out keywords from search engines or PPC campaign variables. If you added a little more information to the uniqueness cookie, you’d be able to record bounce rate and time on page.

I’ve completely glossed over how the data should be presented to the users (your affiliates.) Most affiliate systems show total clicks, uniques and sales grouped by date, time of day or campaign ID. Of course, the main benefit of writing your own engine from scratch is you can offer affiliates things that other programs don’t show them such as referrer URL, search terms, PPC campaign variables and geographic location.

INSERT IGNORE For Kohana

Tuesday, March 24th, 2009

I’ve been using the Kohana framework for my PHP projects for a while now. One thing that’s annoying is lack of support for INSERT IGNORE and REPLACE INTO. Now, I understand that these are MySQL-specific, and not supported in other RDBMS, but there are equivalents in Oracle. INSERT IGNORE is commonly used on tables where there is an auto incrementing ID field or just a unique index on a particular column.

Anyway, here’s the code that acts like the Database library’s insert() function but ignores any duplicate key exceptions:

/**
* Inserts the row if it's new and returns the ID, or just returns the
* ID if it already exists. The table must have a column called 'id'
* that is the INTEGER AUTO_INCREMENT PRIMARY KEY style in
* order to return an ID.
**/

function _create_if_not_exists( $table, $row, $needs_id=true ) {
    $ret = true;
    // Try to insert - if it doesn't exist we'll get an ID of zero
    $columns = join( ',', array_keys( $row ) );
    $placeholders = join( ',', array_fill( 0, count( $row ), '?' ) );
    $q = $this->db->query( "INSERT IGNORE INTO $table ($columns) ".
                                              "VALUES ($placeholders)", array_values( $row ) );
    if( $needs_id == true ) {
        $ret = $q->insert_id();
        if( $ret == 0 ) {
            $q = $this->db->getwhere( $table, $row );
            if( $q->count() > 0 ) {
                $result = $q->result_array( false );
                $ret = $result[ 0 ][ 'id' ];
            }
        }
    }
    return $ret;
}

Madlibs Site Generator Code

Monday, November 10th, 2008

Writing content sucks. So what if there were a way to generate locale-specific content (and ads!) using a template and a few small tricks? The idea is to write a small amount of content as a template, then populate that template using a database. The source code I’m providing here uses a “database” (really just a list) of US states, cities and keywords. These three variables are used to generate both a search-engine friendly URL and the content that shows location-specific ads as sponsored results.

This code is 2 years old, but shows a few interesting tricks like putting keywords into subdomains and then extracting them. It generates a site with thousands of pages using only one php script and some html templates. The URLs generated are SEO friendly, and distinct titles and meta descriptions are generated on each page. I used the following URI scheme:

http://<state>.<domain>/<keyword>/<city>/

You can change this by modifying the rewrites in the .htaccess file, and the getState, getCity and getKeyword methods in util.php. The subdomains actually worked really well in terms of indexing – Google loved them. Obviously you’ll need to write new templates, too. One modification that would really help is to have multiple templates, or make the templates PHP-based so they can generate different content for all the leaf pages at the very bottom (i.e. that include keyword, city and state.) The content is so obviously generated that it might not even get indexed these days. If you want to actually use this code, just search for YOURDOMAIN and the word “widget” (case insensitive) to find all the places you need to modify. If you’re super stuck, email me and I’ll try to give you a hand. All I would ask is that you let me know where you’ve used it and how [un]successful it is.

Here’s the madlibs site generator source code.

DMOZ Expired Domain Finder

Monday, November 10th, 2008

Back in the days of yore, right after wikipedia started added nofollow to all their outbound links, I built a bunch of small scripts that could be piped together to find domains in dmoz (Google’s directory) that were expired. The idea was to download a copy of the directory, scrape all the domains out of it, see which ones were expired, register them and put up the old content from archive.org. It’s a shame that these domains go to waste, since they have a Google directory link. Usually more backlinks than that.

Now that Google pays attention to domain expiry, I don’t think the tools are of much use for the original purpose. They are, however, a good example of how a bunch of small, special purpose utilities can be combined on the UNIX command line to accomplish tasks in parallel. Also, these scripts could be used for something else, like checking which of your list of domains is expiring, or quickly scanning through DMOZ for domains for doing link exchanges, etc.

Here’s the link to the DMOZ directory as a single gzipped file. It’s over 300MB compressed, and almost 2GB uncompressed. The source code for finding expired DMOZ domains is as follows:

  • parsedmoz.rb – ruby script to extract domains from the content.rdf.utf8 file
  • findMistakes.pl – perl script that checks for a DNS ‘A’ record (good indication that it doesn’t exist) and prints out the domain if it lacks one. Uses memoization (caching) to remember if it has already seen a domain. There are lots of dupes in the DMOZ dump.
  • checkDomain.sh – bash shell script that prints out which domains are available for registration.

Note: No, I don’t know why I wrote all three scripts in different languages. I think someone made a stupid remark to me that day about which language was Teh bEsT 3v4r! I should’ve written the whole thing in Haskell for fun.

Also, each script takes its input from stdin, and send its output to stdout. That way you can chain them if you want. For example:

$ cat content.rdf.u8 | ./parsedmoz.rb

Will give you a list of all host names in the DMOZ dump.

$ cat content.rdf.u8 | ./parsedmoz.rb | ./findMistakes.pl

Will give you a list of all domains from the DMOZ dump that don’t resolve anymore (perhaps a trip to archive.org is in order?) Piping these through checkDomain.sh will then give you a list of domains that are available for registration.

Unfortunately, the whois servers for .org domains tend to limit the number of queries that you can make per hour. You can either run your whois lookups through SOCKS proxies, split your list and run on multiple servers, or even just code delay between lookups to get around them.