Monday, March 10, 2014

Skytrain Reliability, or How TransLink Made My Head Explode

Alright, this is going to be a long one, and a little mathy, so grab a cup of coffee or three. 

What Started This Off:

One of the benefits of living at Quayside is easy access to New Westminster skytrain station.  I take the train (and a bus) to work every day.  Generally the skytrain service is quite good.  Major delays seem quite rare, but I was getting a feeling that they had been more frequent this winter.  Reliability is my line of work, so I figured I’d take a look-see.  

Disclosure – I have no affiliation with Translink or either Skytrain operating company.  My only interaction with the system is as a rider.  While I have spent nearly 10 years working in heavy industry as a maintenance and reliability professional and have a degree in Mechanical Engineering, I am not certified as a Professional Engineer (yet), and none of this should be considered professional analysis or advice.  This is just an overactive blogger’s exploration of a collision of interests.

At first I looked at a few news articles discussing skytrain reliability (note: in this article Skytrain refers to both the Expo and Millennium lines run by BC Rapid Transit Company, and the Canada Line run by inTransitBC).  These were just regurgitated customer complaints and Translink talking points, so I decided, “no, I need to go to the source” and dug into the actual reports from Translink. Big mistake.  What they reported was so infuriatingly vague and useless I made a project out of this to try and get a real sense of skytrain reliability. 

Highlights from the 2013 Q2 report.  The highlights are colour-coded indicating my rage level at reading them.  Keep in mind this is only for half a year.  

Expo/Millennium Lines:
  • Service reliability (% service hours delivered): 99.6% (Target 99.8%)
  • On-time performance: 95.7% (Target 94.5%)
  • Complaints per million boarded passengers: 6.4 (Target  5.1)
  • Service Hours:  550,415
  • Collisions and derailments per million km:  0
  • Utilization: 18%
  • Service km: 22,181,836
  • Mean Distance Between Vehicle Failures: 563,654 (no units given, presumably km.  If it’s meters, we’re in trouble)
  • Boardings per Service Hour:  68.7
  • Boardings (BpSH * Service Hours) = 38,723,030

Canada Line: 
  • Service reliability (% service hours delivered): Not reported
  • On-time performance: Not reported
  • Complaints per million boarded passengers: 12.3 (Target 2.7)
  • Service Hours:  96,622
  • Collisions and derailments per million km:  Not reported
  • Utilization: 35.9%
  • Service km: 3,381,770
  • Mean Distance Between Vehicle Failures: Not reported
  • Boardings per Service Hour:  209.2
  • Boardings (BpSH * Service Hours) = 20,213,322
  • Contract Adherence Monitoring: 95.8% (Target 98.5%) – How’s that for a clear metric?

So by the time the 2013 reports are out, I should have about 3 heart attacks from reading Translink’s uselessness.  Reporting for the bus fleet was even worse, but that’s a story for another day. 

Why Is The Report So Bad?

The fundamental problem with the reported figures is, “What the bleep does any of this mean?”   The first question that comes up in any proper discussion of reliability is, “How are you defining reliability?”  As Translink does not define it for us, I am left to guess.  There are a tremendous number of questions that impact how the metric is reported vs. what people actually experience. 

On-time performance: there is no publicly available skytrain schedule that I can find – all the schedules I found on define first and last train times, and service frequency during blocks of time.  Customer service tweets regularly say there is no set schedule, just a train frequency.  Does BCRTC/inTransitBC have a set schedule for when a train is due to arrive at each station, or is “on-time” service measured by trains meeting frequencies?  Who knows?!?  In theory you could back out a schedule from the first train time and service frequency, however the frequencies are variable (e.g. 3-4 minutes) and change through the day so this gets ugly quickly. 

So is a late trip one that departs or arrives at a station later than a set clock time?  Or is a late trip defined as a train that misses its headway (e.g. the train is supposed to be four minutes behind the train ahead of it, but is actually running five minutes behind)?  Is it a late trip every time a train is late into a station?  Or is it calculated over an entire run? Or a day?

For example, suppose our train is on the Expo line heading downtown.  It departs King George on-time, but due to mechanical malfunction is late into Surrey Central, falling behind the train in front, and consequently is late into every station down the line.  Is this counted as one late trip or 19?  If it is calculated over the run, at what point or points are lateness determined?  If it’s late to Metrotown but makes up the time by Waterfront, is that trip on time, or not?  The longer the timeframe a trip is measured against, the easier it is to miss out on small delays as they get averaged out if the train meets the larger schedule on time.  

Are particular kinds of problems not counted against on-time performance?  That is, is performance measured strictly against mechanical or operational issues (i.e. things within Translink’s control), or do they count it anytime the train is late regardless of the reason (e.g. medical emergencies, Godzilla attacks).  While I certainly sympathize with medical incidents, if somebody has a heart attack on a platform shutting down the train seven times in a week, I want to see that somewhere. 

Why is there a two-minute buffer in the on-time performance?  Given that Expo Line trains are often running at 2 minute headways or less, depending on how performance is measured you could rig this calculation to the point that service could be cut in half and it would still measure as perfectly on time.  At least it’s not the airline industry’s 15 minute buffer, but on-time is on-time, not two minutes late. 

The service hours metric raises many similar questions.  Does that mean the stations were open 99.7% of their scheduled times?  Does that mean at least one train was running 99.7% of the time?  Trains were on the tracks and not immobilized 99.7% of the time?  And this doesn’t even start us down the road of cost-effectiveness - what is train availability compared with utilization, opportunity costs due to lateness, call-out and emergency costs due to breakdown. 

On-time performance of 96.5%, and 99.7% service hour delivery.  What do those two numbers mean to me as a transit user?  I have no idea.  These are meaningless management metrics meant to look favourable in the annual public reporting and to give Translink a defence whenever Skytrain reliability is questioned.  Is Skytrain reliability bad this year?  We had an on-time performance of 96.5%.  Ninety-six point five!

What I care about as a customer is when I go to a station, will a train show up in 2-4 minutes (or whatever the designated frequency is), and will there be any major delays en-route to my stop.  I want to have an idea how frequently the system is significantly slowed down or delayed.  I don’t care as much about a train being held at a station an extra minute for a police incident, or other minor localized delays.  I build buffers into my trip time to deal with that, much like I add buffers to deal with traffic if I’m driving.  What I want to know is how often is the whole system is grinding down – a train malfunction forcing single-tracking that delays a whole line, station shut-downs, bus bridges, the works.  I want to know how likely it is my 10 minute train ride is going to take 30 minutes.  As a point of reference, my worst experience was a nominal 15 minute ride taking 1.5 hours.  Translink’s reporting gives me no sense of this at all. 

So What Did I Do?

The proper way to do this would be to dig through reams of data from the Skytrain Operations and Maintenance systems, detail every incident regarding the who, what, when, where how and why, and do some analysis to figure out how often the system is delayed, why those delays are occurring, and recommend what to do about it.

I don’t have access to this data.  I could likely make a Freedom of Information Request for it, but it would be massive, unwieldy, missing 2/3rds of what I need, and without direct access to their systems to delve into it further and access to the personnel to answer questions around context, I really couldn’t do a proper job of it anyways. 

However, to get a higher level estimate of major problems, I don’t need access to the down and dirty data.  Through Translink’s amazing customer service twitter feed (seriously, if you take transit in Metro Vancouver, follow them!) they announce when there are major problems.  If there’s an issue on any Skytrain line, these guys and gals will be tweeting it.  My giant make-an-ass-out-of-you-and-me assumption for this whole analysis is this:

If the delay isn’t bad enough for someone to tweet about, it’s not bad enough for me to care about.

This goes back to my earlier point.  I have buffers in my travel schedule to deal with the minor problems. It’s the major ones that concern me. 

Now just because Translink tweeted it doesn’t necessarily mean I will count an incident as a major delay, and vice versa.  Some judgment is involved.   Also, I did see at least one tweet mention that the Customer Service folks don’t get proactively notified unless the delay is estimated at 10 minutes or longer. 

So basically, I scoured Translink’s twitter feed for skytrain problems.  I looked for a tweet indicating a major skytrain incident had occurred and was causing delays.  Then I looked for a corresponding tweet indicating it had ended.  The time difference between the two tweets gives the approximate duration of the event.  Knowing the duration of events and the scheduled hours of service, I can estimate how often the skytrain system is under some kind of major slowdown or delay.  Even better, many of the tweets indicate where the incident occurred and a reason for the delay, letting me dive into things to a small degree. 

There are lots and lots and lots and lots of simplifications and assumptions going on here.  Even with these, there are a number of challenges with the data and analysis.  Namely: 

1. Downloading someone else’s twitter feed is hard.  Seriously – try it some time.  I looked at Translink’s twitter feed early on November 15th, and they apparently had posted 133,822 tweets from the beginning of time.  My google skills were not sufficient to get me an easy way to download all of their tweets. worked well for the first 3500 tweets, but wouldn’t give me anything beyond that.  An old html/xml command didn’t work.  All the other solutions involved learning programming and twitter’s API.  I’m already spending enough time on this.
Solution?   Make Translink do all the hard work for me.  I submitted a Freedom of Information request for all of their tweets up to November 15th, 2013.  After 60ish days I got the tweets from November 21st, 2011 through February 11th, 2014. According to Translink, Twitter could or would not give them the earlier data ( about another 1.5 years worth), but they were continuing to pursue it.  Rather than wait, and given assumption 7, I ran with what was provided – November 21st, 2011, through February 11th, plus a few days of hand gathering data from twitter to round it out to February 15th, 2014.
2. Coverage Gaps:  The customer service tweeting hours do not line up exactly with Skytrain operating hours.  On weekdays, the trains start about an hour earlier than customer service, and end 1-1.5 hours later than them. 
sumption:  Assume any delay that is in progress when Customer Service starts up in the morning has been ongoing from the start of service.  Likewise, if a delay is still ongoing when they close, assume the delay continues until the skytrain shuts down for the night.  If a major delay happened entirely within the pre or post-customer service hours and was not tweeted, for the purpose of this analysis it never happened.  Starting early or running late was an issue in about 5% of all incidents.  There is some fudge factor in here as I frequently found tweets indicating a problem before the official opening time of the twitter service, and so ran with the tweet times there. 
3. Delays in announcements.  There is some delay between the time an incident occurs and when customer service tweets the announcement.  Similarly there is a delay between when the system is restored and when the announcement is made. 
Assumption:  assume that the delay before announcing a problem and delay announcing a resolution are the same, and so there is no effect on the incident duration. 
4. Restored means something different to Translink than it does to me.  Translink announces when the technical problem is resolved – the problem train is off the tracks, the medical emergency is cleared, etc.  That doesn’t mean the system is instantly back to normal – trains are still bunched up and overcrowded, stations might still have passups.  Depending on the type and length of problem it could take significant time for the system to stretch out and get back to normal.  
Assumption: when Translink announces the problem is resolved, the system is magically restored to normality, but with a minimum incident time of 10 minutes. 
5. Inconsistent terminology.  Skytrain delays.  Expo Line problems.  C-line issues.  MAJOR SYSTEM WIDE DELAYS.  There is no standard terminology for describing Skytrain problems.  Customer Service is pretty good at putting #Skytrain on really big issues, but not perfect, and definitely not for everything I was after.  I used keywords in various combinations to pull up relevant tweets, and even those required judgment on when it was a major problem vs. a minor single train delay.  Did I miss some incidents because of this?  Most likely. 
6. Other Data Sources.  If I personally experienced a delay or had news reports discussing a major delay, I went with that over what I could estimate from the tweets. 
7. Reporting consistency.  The way the twitter feed has been used has likely changed over the 3.5 years it’s been running.  The account has gone from zero to over 40,000 followers in its time.  I have no way to account for that, so I won’t, but I’m missing the entire first year of usage, so I would expect things were reasonably settled down by the time my data starts.   
8. Anything that says Skytrain is Expo and/or Millennium line.  Translink seems diligent about calling the Canada Line out separately, so there might be a slight bias in favour of the C-line, but I don’t believe it is much.    
9. Sometimes they will report a problem cleared, but no start of a problem, or vice versa.  In that case I assumed the problem started or ended 10 minutes before or after the relevant tweet. 
10. Delays or slowdowns for planned maintenance are not counted as incidents.  So all those evenings for the past year and a half where things slow down for the power rail replacement project don’t affect the numbers.  When the Canada Line goes to single tracking after 11 pm for regular maintenance, it is not included.  If an actual incident occurs and is reported during that time, that is still counted.  I think maintenance announcements represented nearly half the tweets I had to wade through.  
This can get fuzzy.  On January 31st, 2014, the Canada Line had a major problem.  The next three days there was ‘track maintenance’ causing delays.  That maintenance was likely a result of the failure that occurred on the 31st – do I count that as planned maintenance and ignore it, or do I count it against the system?  In this case I was nice and considered it planned maintenance, but realistically delays like that should be counted against system performance. 
11. When a problem was reported over a range or area without specifying where exactly it occurred, I tracked it against the station closest to Waterfront Station on the system map.  If not reported at all, I classified it against the Expo/Millennium Line or the Canada Line in general.  Incidents were tracked with the following data:
a.      Date
b.      Failure Mode, if any information was given (why it failed e.g. train issue, medical incident, track intrusion alarm)
c.      Train Line and Location
d.      Start and End time of the incident, giving a duration

So What Does All That Mean?

Throw all those assumptions and ~107,000 tweets in a blender.  Note that for the remainder of this article, “Expo” or “Expo Line” refers to the combined Expo and Millennium Lines (both lines through Burnaby and the line to Surrey) and “Canada” or “Canada Line” includes both the Richmond and Vancouver Airport segments.  What we get is from November 21st, 2011, through February 15th, 2014…

Table 1: Summary of incidents.  All times reported as hh:mm.  MTBF = Mean Time Between Failures

This gives us a little bit of information.  Clearly, there are more incidents on the Expo Lines.  This makes sense – older tracks, older trains, more service, and much less of the system is underground and thus it is more exposed to weather and morons compared to the Canada Line.  What surprised me the most was the overall frequency – a major incident every three days on the Expo Line!

We can also look at the total duration of delays. 

Table 2: Incident Durations and Overall System Reliability

In this case the reliability is calculated as the percentage of time the system is not operating under an incident, based on an average of 20 service hours per day and 817 days in the study.  I don’t have any significant transit experience to compare this against, but my gut reaction to these numbers is they feel pretty good.  Not that it compares directly, but it is below the 2013 Q2 reported Service Reliability of 99.6% for the Expo Line. 

Curiously, the Canada Line tends to have significantly longer incidents with both Average and Median delays roughly double that of the Expo Line.  The Canada Line had about 80% fewer incidents.  Canada Line runs about 20% of the service hours, and 15% of the service km of the Expo Line, so that matches up nicely.   However, the total duration of incidents was only half that of the Expo Line, waaaay higher than you would expect.  Both lines had 10 minutes, the shortest delay counted, as the most frequent delay (the Mode). 

Figure 1: Histogram of Incident Durations

55% of incidents were resolved in 20 minutes or less, and once you get past 30 minutes the incident durations seem pretty flat.  Let’s plot this out to see what else we can tell from the delay times.

 Figure 2: Median Ranks of Incidents vs. Duration

This is a fairly ugly, but fairly typical chart of incident times from a maintenance and reliability perspective. What we would like to see here is a nice straight, steep line.  That would indicate the delays are short (the steep part) and distributed evenly (the straight part).

What we actually see here, and in most other organizations, is a straight, steep part for the start of the chart, with a kink in the upper half.  That indicates the bulk of the incidents are being resolved in a regular period of time – the Expo Lines have 80% of their incidents resolved in 50 minutes or less, the Canada Line has 70% in an hour or less.  However, once things stretch beyond that mark, the duration gets very uncertain and potentially very lengthy. 

What To Do About It

Translink’s goal should be twofold.
  1. Get rid of all the stuff that’s happening after the kink.  Long duration incidents are a killer – they cost Translink money in terms of the direct costs of the incident, lost fare revenue, and lost goodwill.  Also they create a public relations problem – once skytrain incidents reach two hours, or stretch through rush hour, they become headline news (but we have 96.5% on-time service!). 
  2. Once the long duration issues are well handled, focus on shortening up the “regular” durations, i.e. cut it from almost all incidents being resolved in under 50 minutes to being resolved in under 30 minutes. 
Interestingly, the Canada Line is less consistent in their repair times (the kink in the chart comes lower down, and the tail stretches out much further).  We’ll get into that as we continue.  First, let’s delve into those long duration incidents, specifically those lasting at least 45 minutes.  Why 45?
  1. That’s right around the kink in the chart – once incidents hit the 45 minute mark, it becomes much less predictable how long it will last. 
  2. Travelling the Expo Line takes roughly 40 minutes end to end.  The Canada Line is shorter.  That means if an incident lasts 45 minutes, it will by definition affect the entire line. 
  3. Major incidents this long are much more likely to get reported and updated on twitter, so the data is likely better for longer incidents. 
Table 3: Summary of Incidents Lasting at Least 45 Minutes

I have to say, this really surprised me.  Basically, Translink has a major Skytrain issue that lasts 45 minutes or longer nearly three times a month!  The long incidents represent only 24% of the total number of incidents, but 76% of the total incident time.  Both the average and median are well removed from the lower limit of 45 minutes – it’s not a raft of issues at 45 minutes with a few outliers lasting longer – the incident duration is very widely distributed, as we saw in Figure 1.  How serious a problem is this? From our earlier 2013 Q2 data
  • ~58.7 million boardings in Q2
  • 3,650 system operating hours, based on 20 hours per day for ½ a year
  • 16,000 boardings per hour (this number seems really high to me, but works out to about 300 boardings per hour per station which feels reasonable.  It also gives ~320,000 boardings per day, which is in line with other Translink reports of ridership)
  • An average fare of $1.86 (from 2012 annual budget)
  • 188.5 hours of lost service
  • $100,000 in repair labour (say four tradesmen @ $75 per hour for 188.5 hours plus two hours for every incident)
  • $100,000 in repair parts (guesstimate: parts $$$ = labour $$$ - probably low but on the right order of magnitude)
Add that all up to get $5.8 million over roughly two years, of which $5.6 million is lost fare revenue.  Now, this overstates the costs, probably to a large degree.  Boardings are not the same as fare passengers.  A lot of the fares will be delayed, or shift to another Translink service, or are monthly passes and so don’t represent lost revenue.  And while not chump change, $5.6 million over two years fits into ~$900 million of total fare revenue for Translink greater in that period.   Still, we’re likely talking seven-figures between costs and lost revenue, and that doesn’t touch the lost goodwill or scaring away potential users.

Switching It Up

All this emphasizes what I said above – to improve skytrain reliability, first Translink needs to get a handle on why long incidents occur and work to prevent them.  Let’s look at some Pareto Charts for these incidents.
Figure 3: Pareto of Skytrain Failure Modes

What strikes me most about the Pareto are the Rail and Switch incidents, given their large durations and relatively small counts.  Interestingly, they cascade down to both the Expo and Canada Lines, though in slightly different manners.

Figure 4: Pareto of Expo Line Failure Modes

Figure 5: Pareto of Canada Line Failure Modes

When we break it down by line, “Trains” pop up higher in the Expo Line.  Again, this is expected – a good portion of the trains on that line are nearing 30 years old, and even the new trains are younger than the Canada Line trains.  On the Canada Line, a single major problem with the control system resulted in it appearing near the top of the Pareto. 

Rail and switch items show a very high average duration – that is they don’t break as often as the trains, but when they do it takes a long time to fix the problem.  This seems intuitive – a train can be pulled off the tracks, while a rail or switch issue needs to be dealt with in place, and also has higher overhead when being fixed – you need to lock out the power system, workers need to travel to the problem, and then clear the tracks.

This helps explains the longer tail of the Canada Line incidents compared to the Expo Line.  The Canada Line has proportionally more Rail and Switch issues, as well as the major Control issue.  Given that the Canada Line is less than five years old, I wasn’t expecting such a high rate of those issues.  Had this been the year 2010/2011 I would have explained these issues as infant mortality.  You would expect a higher incident rate as you worked out the kinks in a new system. 

But nearing five years in, those kinks should be worked out, particularly since the Canada Line uses traditional electric train design, not fancy Linear Propulsion.  It seems more likely that there are some design issues with the Canada Line track/switching system.  The switch incidents are split evenly over the years, but fall exclusively within the months of January-May.  There’s not enough information to say what the problem is, but if I had to take a guess I would venture there’s an issue in terms of robustness regarding weather and/or cold, compounded by the complicated track geometry near Bridgeport.

Two incidents accounted for nearly 1/3 of the total time for rail and switching issues: a problem at Richmond-Brighouse on January 31st, 2014 and a problem at New Westminster April 25th, 2013.  Of the remaining issues, nearly half of them happened at Bridgeport and Columbia stations, with the rest scattered about.  Again, this makes sense.  As confluence points both stations have a high switching frequency and Brighouse has complicated track geometry and additional complications with the maintenance centre there. 

These stations have had a switch/rail failure on average every four months between them.  We know the rail system undergoes nightly inspection, but it appears a significant portion of those inspections is directed towards the linear propulsion rail and not necessarily the power or traction rails.  Switches and rail are not new technology, and there are numerous papers on rail and switch maintenance.  Other transit agencies certainly have rapid transit lines with frequent switching.  How has Translink adopted best practices from the rail industry in general and from other transit agencies to manage this?  Given the high stress at these stations, maybe best practices are not enough and Translink needs to go beyond them to maintain a reliable system. 


Trains are by far the most frequent cause of grief, both for long duration (20 incidents) and when counting everything (111 incidents).  The average duration for Train incidents is 30 minutes, with a median of 15 minutes. 

Train Incidents are all over the map. As mentioned, many Expo Line trains are getting old and that is reflected in the higher number of issues on the Expo Line.  There are certainly analyses that can be done on the trains to ensure they are being maintained to optimal reliability (Reliability Centered Maintenance being the traditional one).  However, there are many ways to optimize maintenance, and no one involved says to what end they are directing their maintenance.  For example, you could optimize maintenance to
  • Maximize train reliability to minimize service disruptions
  • Minimize direct maintenance costs while maintaining a set level of reliability
  • Minimize total system costs (including emergency repair costs, lost fare revenue, etc)
  • No optimization, just follow the schedule in the maintenance manual
All of these will result in different train reliabilities, direct costs, and extended costs.  Which is the right one to use?  Any of them are valid choices (except for just follow the manual – that’s lazy and rarely optimal for anything) and can be justified.  But it would be nice to know what philosophy is being used and how train (or track for that matter) maintenance has been changed to deal with trains that are 30 years old now. 

Another problem with Trains is many of the train issues are time-outs from people holding the doors open too long at stations.  This was reported about 20% of the time for the long duration incidents, and 25% for the incidents less than 45 minutes (these are likely low – train incidents were not always specified beyond being a train incident).  The best solution to this problem I can think of is to replace the door seals with giant blades that sever anything caught between the doors as they close.  The cleanup crews would have a few more limbs to sweep up, but it would cut down on time-outs significantly. 

Indictables, Ills, and Idiots

People do other stupid things too, like criminal acts and trespassing into the tracks.  They also get sick sometimes.
Intrusion Alarms are the second most frequent issue after trains, with 58 incidents during the two years.  They tend to be short, with an average duration of 23 minutes, but still a median of 15 minutes.  There were only 8 incidents longer than 45 minutes. 

Depending on the nature of the Intrusion Alarms, improved station design may be able to reduce these incidents.  I find it absolutely insane that you can have a passenger platform open to tracks energized to 600 VDC, and where trains blow past at 30+ kph, with no safeguards other than a yellow strip of paint and a safety switch on the tracks. 

Medical Incidents are also frequent but short – 43 incidents, 31 minute average, and seven lasting at least 45 minutes. There’s not much that can be done about Medical Incidents other than having more Skytrain Attendants scattered around to reduce the response time.  I don’t know for certain, but I assume all attendants are trained in first aid.  I suppose we could add biometrics to the Compass Card and do a medical scan before you enter the station.   

Police incidents are less frequent still, with only 16 incidents (including the single incident that occurred at YVR-Airport Station).  They tend to be longer, though, with the average duration 51 minutes.  One of the supposed benefits of faregates is a reduction in criminal activity – that should show up in a reduction in police incidents in the years to come. 

Location, Location, Location

Not too many surprises here:  the incident rate along the combined portion of the Expo/Millennium line is about twice as high as the other segments.  It has older tracks, older trains, and it runs twice as many trains as the other lines – a recipe for trouble.  Columbia and Bridgeport stand out due to their track/switching issues.  Edmonds gets a lot of train incidents, presumably since the maintenance centre is there – how many of those truly occurred at Edmonds compared to how many are just being reported there is anyone’s guess, particularly as Bridgeport doesn’t show this.    

Figure 6: Number and Type of Incidents by Location - click for  larger version.  Map by Matt Lorenzi

Somehow Brentwood, Braid, Lake City Way, Surrey Central (seriously?) and King Edward escaped scott free.  Scott Road, however, was another story – this was a big surprise when looking at the location data.  It tied Columbia for the most incidents, but I’m not sure what is driving it.  It’s largely Train incidents by count, so people holding the doors perhaps, but Intrusions, Trains and Police Incidents all had about the same total duration. 

Other surprises were Patterson and Olympic Village.  With Patterson the rest of the line is fairly consistent at about 10 incidents per station along that section, while Patterson has only two.  Olympic Village reversed that, with seven incidents compared to the average of two along the rest of that section of the Canada Line. 

Now, onto what prompted this insanity in the first place.

How Bad Was This Winter?

Figure 7: Incidents by Month and average monthly incident rate for winter months.  Non-winter months greyed out.  Note:  November 2011 not included, February 2014 estimated from 1/2 month of data

Well, if we look at the average monthly incidents and length of incidents over winter (loosely defined as November to February), it’s trending up.  The number of monthly incidents of at least 45 minutes decreased slightly in 2014, but not back to 2012 levels.  This is where it would have been nice to have that missing data from 2011, although I don’t know if I could survive combing through another year of twitter data.  However, the rest of November would have been nice to compare with the massive spikes there in 2012 and 2013.  Why is there such a spike in November?  I’m not sure – the incidents are a typical mish-mash consisting mostly of Trains, then Intrusions, and smattering of others.  Maybe it reflects a surge of ridership with the first winter storms. 

Incident duration follows the same trend: winter 2012 was about 40 minutes on average, 2013 and 2014 were right around 60 minutes.  Something interesting on the chart is the uptick of incidents in the springtime.  The types of incidents occurring in March and April of 2012 and 2013 appear to be a general mishmash, with no particular type standing out of line.  Spring storms, mayhaps?

Figure 8: Failures during winter

  • Longest stretch of time with no incidents:  19 days: July 18th to August 5th, 2013
  • Longest stretch of time with no incidents > 45 minutes:  61 days (twice): May 3rd to July 2nd, 2012 and August 27th to October 26th, 2012. 
  • Incidents occur most commonly on Mondays and in the afternoon rush.  Sundays appear eerily quiet.  
Figure 9: Incidents by Start Time

Figure 10: Incidents by Day of the Week

So What?

So to answer the question was winter 2013/2014 worse than previous years? Yes it was.  More incidents and longer incidents means more grief for passengers. 

How good is Skytrain reliability overall?  Well, it really depends on how you look at the numbers.  The optimist would say the combined system is operating normally 98.5% of the time, and the individual lines even better than that.  The pessimist would say there's a significant issue ever 2.5 days, and major issues lasting more than 45 minutes occur nearly three times a month.  Knowing all this does make me want to increase the buffer in my schedule a bit, but it doesn’t fundamentally alter how I will use the Skytrain system.

There is nothing here that Translink and the operating companies don’t (or at least shouldn’t) already know.  Trains are an issue on the Expo Line.  Track and system issues are a problem on the Canada Line.  My questions for Translink are about what is being done to address these issues.
  • What is being done to fix the problems with the Bridgeport tracks and switches?
  • What parts of the Mark I train refurbishment program will address train reliability issues?
  • What actions are being taken to reduce the number of and duration of intrusion incidents?
And similar questions for the other issues.  I would also encourage Translink to be more transparent in their reliability reporting.  Show us what’s really happening; tell us why and what you’re doing about it.  It shouldn’t take a blogger with too much time on his hands to do this.  Finally, my advice if you want to avoid Skytrain issues:  Travel only on Sunday nights during the summer. 

I would like to give a great big sloppy wet kiss to Translink’s customer service team, firstly for providing such a great service, and secondly for providing me a giant set of data to use against them.  Uber thanks to Matt Lorenzi for the killer skytrain map, and my editor wife for paring down my raging rant into a semi-coherent line of thought.  You guys are all the best. 

1 comment:

  1. Interesting and nice work! I read up to about half of the text and stop, and want to ask a question and if you want to do some more works :) ...
    Does it make more senses to do your Sky train incident analysis based on impact on the number of affected rider? A 45 minute delay incident during the rush hour will certainly impact more people than one in late evening?