Own web analytics for startups – Part VI

Part VI – what we will collect and how. Session and cookies utilisation in our web analytics journey.

If you follow the course of our thoughts, then you probably figured out that we are ready for the opening ‘a nice Italian restaurant in the web space’ as DHH called web services that are profitable, self-sufficient, developing organically, which satisfy customers and are not concerned about the problem to catch up and overtake Facebook. Except for the actual web service, which will be launched with the name Nerrvana, we are working on a marketing website, the forum which will be most likely* running on phpBB, ‘Ideas’ inside forum, ‘Answers’ inside forum and a blog on WordPress platform.

Alex is working on the authorization integration between our product and phpBB forum. We have decided not to integrate a blog with the forum and the product and only make it easier to comment in a blog for authorized users. We will put their email address in the comments section if they wanted to write it. If the blog will become popular and have lots of comments (we hope so), then we integrate it with a forum in a way which will bring forum order and visualization power into blog discussions. We will write more about the tight integration between the blog and the forum, when and if it will become a requirement.

So the first Deep Shift Labs product build consists of:
- product
- forum (Forum + Ideas + Answers live in one place)
- blog
- marketing Site

Registration of users and their access rights in the system are described in detail in a previous post. In short, it says – product and forum registration pages (Ideas and Answers live inside the forum) for a not authorized user are redirected to a single registration page. Thus a new user unequivocally informed that the registration would give him the opportunity to work with the forum, Ideas and Answers. In addition, the user gets access to our product on a 30-day fully functional trial.

During or on your next visit after the expiration of your 30 day trial, you can:
(1) switch to the free version with limited resources
(2) upgrade to the full version
(3) delete your account

Only when the user chooses the third option provided, he loses access not only to our product, but also to the forum and as a consequence to Ideas and Answers.

If, after 30 day trial expiration date, the user has not made a decision, we certainly will not send automatically generated email, but written individually. If the user never replied then, in the end, his account which had already been deactivated, will be deleted. If the user enters the system when the trial period has ended, he is offered to choose between (1), (2) and (3). We will take care of accounts cleansing (3) after users long-term absence themselves.

For our web analytics we will need:

- a unique identifier linking a continuous chain of a user transitions from page to page. It will work until the browser is closed or cookies will not be cleaned. We will use a standard PHP session identifier for it. In PHP it is created by calling session_start(). If we want to be confident in the uniqueness of session identifiers we can use this approach.

- a unique identifier tied to a browser on a single separately taken computer. Of course, it could be several people using the same account on the same computer. Such identifier works until cookies will be cleaned or, strictly speaking, expiration date in 2037 will be reached.

In PHP it is created this way:

 setcookie ('vz', $ very_unique_hash, 2114398800, '/', 'nerrvana.com', 0, 1);

Now we add this code into common included files of a marketing site, blog, forum and the product:

if (isset($_COOKIE['vz'])
     apache_note('visitor_id',$_COOKIE['vz']);
if (isset($_COOKIE[session_name()])
    apache_note('session_id',$_COOKIE[session_name()]);
if (isset($_SESSION['user_id']))
     apache_note('user_id',$_SESSION['user_id']);

As you can can see from the code, to record analytics data into the Apache log file we use apache_note(). To achieve it we set the Apache log format as follows:

LogFormat "%h %l %u %t \"%r\" %>s %b visitor_id:%{visitor_id}n session_id:%{session_id}n user_id:%{user_id}n" combined

As a result our log file will contain something like this:

127.0.0.1 - - [14/Dec ...] "GET /test.php HTTP/1.1" 200 1520 visitor_id:- session_id:- user_id:-
127.0.0.1 - - [14/Dec ...] "GET /test.php HTTP/1.1" 200 1520 visitor_id:0365febba979c9d2640e1cee5890c1ff session_id:r2rjjetasl1bidece4f4p0jik5 user_id:-
127.0.0.1 - - [14/Dec ...] "POST /test.php HTTP/1.1" 200 1968 visitor_id:0365febba979c9d2640e1cee5890c1ff session_id:r2rjjetasl1bidece4f4p0jik5 user_id:341890

Look at the first line – this is the first visitor hit. If we leave the code as is, we miss the entry point logging of the visitor and it will not be analyzed, and this, at least to us, is very important. To remedy the situation, so you do not wrestle later with how to define the first entry for a visitor without a cookie in the logs, we will change the code a little. Moreover, since it is easy to determine such first hit, we can conveniently mark it for easier later parsing:

if (isset($_COOKIE['vz']) && !empty($_COOKIE['vz'])) {
     apache_note('visitor_id',$_COOKIE['vz']);
} else { // First visit
     // show here for demo purpose only ...
     // ... real value will be generated above IF statement and variable will be used here
     apache_note(
         'visitor_id',
         'smth._like_http://stackoverflow.com/questions/181159/generating-a-unique-id-in-php'
      );
}
 
if (isset($_COOKIE[session_name()]) && !empty($_COOKIE[session_name()])) {
    apache_note('session_id', $_COOKIE[session_name()]);
} else {
    apache_note('session_id', 'NEW#' .session_id());
}
 
apache_note('user_id',$_SESSION['user_id']);

After correction we will not see such lines in logs:

127.0.0.1 - - [14/Dec ...] "GET /test.php HTTP/1.1" 200 1520 visitor_id:- session_id:- user_id:-

However search engine spiders will leave plenty of such “NEW#” marks while crawling as they do not pass them – we filter them out from human beings by using User-Agent and IP address.

See? The application begins to tell us its story. We only need to learn to listen and interpret.

Comment

Analysis is not that easy. I’ve asked a question here thinking about getting a quick answer, but I was wrong. I think there is a certain gap between people able to answer questions like ‘How do I transform the parameters of the F distribution?’ and those who use GA results and not really understanding it. There is a lack of people who can generate simple questions, and answer them correctly using heuristics and statistical methods.

To make it all to work, we first need to properly set the cookies. Let’s summarize what will be stored in the cookies:

- version number of the site
- product version – with a planned addition of Events to our web analytics we no longer need to store this information in a cookie
- secret hash for autologin (more details in comment block below)
- session ID
- visitor ID

Comment

setcookie ( 'uz', $hash_value_identifying_user, 2114398800, '/', 'nerrvana.com, 1, 1);

The secret autologin hash allows to implement ‘Remember me on this computer’ functionality. A cookie storing it is created during authorization when ‘Remember me on this computer’ checkbox is checked. This cookies is deleted when you explicitly logout, or if it turns out that the hash value is not correct anymore. Example – we entered the system with autologin on computer 1, also entered system from computer 2, and changed the password. Now the transition from page to page on the computer 1 will redirect you to the login page first, because your autologin cookie hash on computer 1 became invalid. If you will provide a valid password you will be directed to a page you were going to.

The cookies logic during page generation on a server looks as follows:

In the next post of our web analytics series, according to our plan (last paragraph), we will talk about the integration of our system with a forum (phpBB), blog (WP) and a marketing site and will introduce code injected into these systems to collect analytics data.

* – in the process of writing the previous post, I stumbled again on Vanilla forum , which became considerably prettier, got a voting plug-in, which can easily turn into Ideas in our hands. So now we are looking at Vanilla, comparing it with phpBB, and make a final decision about forum platform. The next post will be just on this topic.

Own web analytics for startups – Part IPart IIPart IIIPart IVPart V

Print this post | Home

Post a comment