Logging your Oyster trips

Since I moved to London at the start of 2010, I've been using a Oyster card to pay for my journeys on buses and Underground trains. I've bought annual travelcards, because they free me from having to think about "topping up", and also because they seemed to be cost-effective with respect to the "pay as you go" option.

Then, about a year ago, I noticed that I could get a list of my recent trips from the TfL website. It was thus necessary that I wrote a program to save that data on a machine under my control. So I did.

Storing the journeys

Let's look at the source for the oyster program.

It uses a SQLite database to store both the actual journey data and the credentials for the website account, so I don't have to write them on the command line, and I don't need a separate configuration file. The database, by default, is called oyster.db and resides in the same directory as the script (line 16). Command-line options (lines 18-31) allow to specify a different database, and to set the credentials on the first run (lines 76-83).

The check for the presence of the database (lines 39-51) is not particularly smart: try to read data, and if it fails, assume the db is new. The database schema (lines 54-75) is also very straightforward: a table for the credentials, and a table for the journeys. The columns for the latter come from the CSV that TfL provides.

To get to the CSV, the script needs to pretend to be a person browsing the website, so we use WWW::Mechanize and log into the site (lines 85-103). After reaching the "Journey History" page (line 105), we should click a link that runs some JavaScript, which is hard to do without an actual browser. Instead, we extract the URL from the JavaScript that would be run (lines 107-113), and download it directly (lines 116-118).

Now we only have to parse the CSV: we use Text::CSV_XS, and DateTime::Format::Strptime for the times (lines 120-129). Looping over all the lines, we convert the times into DateTime object, adjusting for a few minor issues: bus journeys don't have end times, since you don't "tap out" of buses (line 139); also, if you tap in before midnight, but tap out after, the CSV still shows a single date, but the end time will appear to be about 23 hours before the start time (line 144-146).

Inserting into the database is trivial (lines 149-158). Since we set a unique on conflict replace constraint on the journeys table, we don't have to worry about inserting twice the same trip: SQLite will take care of it.

Using the data

As interesting as it is to just look back at where I've been, I found a better use for all those database records: I wanted to check how much I would have spent if I had not been using the annual travelcard.

Calculating that is not exactly easy: the rules are convoluted, not obvious to find, and probably not completely documented. Also, I may have interpreted them wrong. Anyway, I wrote another program to read the database and give me the total cost.

Of course it uses the same defaults and command line options as the other one (lines 130-148).

The main execution flow loops over all the journeys (lines 174-200), grouping them by day; for each day, we store the journey and some aggregate information: the minimum and maximum zones travelled, if there were any tube journeys, and if there were any journeys during peak time. Parsing the "journey description" is annoying but simple (lines 185-192); extracting the day of a journey is trickier (lines 15-24), since TfL considers a day to start at 4:30am. Similarly, a journey is during peak time (lines 26-35) if it starts between 4:30am and 9:30am. I could find no mention of an evening peak time on TfL's website. Mapping a station to its zone should be easy, but TfL does not provide a nice text file with this information, so I had to write it by hand (lines 37-118) taking the data from Wikipedia; the program currently does not account for those stations that, being on the border between two zones, count differently depending on which side you're arriving from.

After that first loop, we compute the cost of each day (lines 257-287): we add the cost of each journey in the day, using the bus fare or the tube fare (peak or non-peak) for the zones travelled (fares on lines 202-254 taken for the website). If the sum for a day is above the daily cap for the set of journeys (only bus, or non-peak for the zone range travelled in the day, or peak &c), we cap it. Adding all the daily costs gives us the total.

I'll have to get a couple more months of data, since I moved from zone 3 to zone 5 last June, but it looks like that, for the journeys I make, the annual travelcard is about 15% more expensive than the "pay as you go" option. Either that, or I got my calculations wrong…

DatesCreated: 2013-04-12 16:53:02 Last modification: 2023-02-10 12:45:24