Own web analytics for startups – Part VII

Своя веб-аналитика для стартапа – Часть VII

Part VII – our plans changed, we start implementing it.

In our last post we promised to tell you about the integration of our product (Nerrvana) with our forum (phpBB), blog (WP) and marketing site. We also were planning to show how we added code to collect analytics info into each component. Our plans changed since.
We decided to launch a beta version of our product without a forum, which takes the forum as well as Ideas and Answers out of our priorities list for now. We will add these parts during the beta testing period. But we need analytics anyway and it is time to start implementing it. Rephrasing ‘I sing what I see’ into ‘I code what I sing’ we start making it.

To heat up and tune we will complete a few tasks supporting our project.

1. Import GeoIP City database into PostgreSQL

We will need it to get visitors’ location. If you read our previous posts this is not something we terribly want from web analytics. However it can be handy to figure out what our clients business hours are. It is to understand that city info can be very wrong. According to this database I am in Perth which is not true for me. I am in Sydney but my ISP is in Perth. Country info is more reliable.

We take sources from GeoLite City and read ‘Loading the GeoIP City database into PostgreSQL’.

Create tables:

CREATE TABLE locations
 (
   id bigint NOT NULL,
   country character(2) NOT NULL,
   region character(2),
   city character varying(75),
   postal_code character varying(15),
   latitude numeric(6,4) NOT NULL,
   longitude numeric(7,4),
  metro_code integer,
  area_code integer,
  CONSTRAINT locations_pkey PRIMARY KEY (id)
);
 
CREATE TABLE blocks
(
  start_ip bigint NOT NULL,
  end_ip bigint NOT NULL,
  location_id bigint NOT NULL
);

Execute:

-- Data is in ISO-8859-15 format
SET client_encoding to "ISO-8859-15";

Put CSV files on C:\. Delete the first line containing copyright info and import them into database.

COPY locations (id,country,region,city,postal_code,latitude,longitude,metro_code,area_code)
FROM 'C:/GeoLiteCity-Location.csv' WITH CSV HEADER;
 
COPY blocks
FROM 'C:/GeoLiteCity-Blocks.csv' WITH CSV HEADER;

We use Unix slash / here anyway even though we are on Windows.

2. Enhance and improve log formats for Apache and ProFTPD

Before writing code we make a short stop and create own format for Apache logs based on standard “combined” type to make it more friendly for our home-grown web analytics. Here is a “combined” format:

View Code APACHE
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

we create own type:

View Code APACHE
LogFormat "%h %{%F %T}t  \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \
visitor_id:%{visitor_id}n session_id:%{session_id}n user_id:%{user_id}n" nerrvana

Differences:

- dropped %l. Acording to docs this is the ‘Remote logname (from identd, if supplied). This will return a dash unless mod_ident is present and IdentityCheck is set On. Value always empty and useless for us.

- dropped %u. Also always empty.

- changed date/time format from standard [24/Jul/2011:08:49:07 +0000] to 2011-07-24 08:49:07

- added session_id:%{session_id}n user_id:%{user_id}n. We wrote about it before. These values will be empty in logs for now as we haven’t modified marketing site and Nerrvana UI pages yet. But this will not stop us – we have many other things to do before we will start looking into these values. For now we will add dummy values when we will parse logs to database tables.

Now about a few specific things we have to deal with.

First one is an FTP server (ProFTPD) which is part of our Nerrvana. It allows to upload Selenium tests and download results. Nerrvana can work with multiple FTP servers. We built it this way to make the system more flexible and potentially distributed. Each machine runs FTP server (currently we have only one) and also runs Apache with authorization overlooking ftp folders. It allows us to provide HTTPS access from UI to test results. Since FTP servers are running on different virtual or physical machines their Apache’s logs are located there too. When a user clicks a link to browse test results we loose him in Nerrvana logs. To solve this problem we use redirect pages which leave this info in Nerrvana logs and redirects the user to a page he wanted to see.

Second thing is also related to having a FTP server as a part of our system. Normal workflow we expect after registration – create a place to load files we call Space, connect to it with FTP client and load Selenium tests, create schedule and run tests. In our analytics we want to know how real workflow deviates from ideal one. It’s one thing if a client created his Space but never uploaded files, another – when he loaded and deleted files without making an attempt to launch tests. We need to know this to pro-actively help a particular client before he abandons our service or to see a wider problem many clients face. So we are going to parse the second type of log – xferlog.

It looks like this:

Tue May 31 16:08:17 2011 0 ::ffff:192.168.1.254 303 /var/lib/trillium/ftproot/demo433/sf_tests/_files/build/class/com/deepshiftlabs/sf_tests/SftestException.class b _ i r demo433 ftp 0 * c

From manual:

current-time transfer-time remote-host file-size filename transfer-type special-action-flag direction access-mode username service-name authentication-method authenticated-user-id completion-status

Everything is pretty straightforward. Client demo433 on 31st of May at 16:08:17 loaded (i) file SftestException.class. We do not create a security tracking system and this is why we will store bare minimum of info to keep client’s privacy. With FTP we only need to know the sort of activity (uploaded, downloaded, deleted). If files are constantly downloaded, deleted, uploaded – most likely files are getting edited in ftp client. We need to rush with enabling ssh access for client’s accounts. If we will store just file extensions we will be able to see preferences of a particular client on one side and have stats what languages are used by free/paid/heavy/light Nerrvana users.

Analytics? Yes. Web analytics? Don’t know – we craft analytics of a special kind not blindly following fundamentals of this art but strongly adhering to common sense, customer care and commitment to excellence.

On this optimistic note I will finish this post. Next post – log parsing and storage, crawler traffic recognition, exceptions in logs to ignore, extracting key phrases from search engine referrers. Did I miss something? E-mail – help me out.

Print this post | Home

Comments are closed.