**Analyzing train travel times and fares in Singapore**

*November 5, 2018*

I set out to try to understand the accessibility of different train stations in Singapore and how travel times and fares vary by stations. The purpose of the analysis is to identify areas that are more convenient, based on train as the sole mode of transport. There are many ways to do the analysis:

- Looking at the number of stationsÂ reachable within X mins

- Looking at the range/ distribution of travel times

- Looking at the range/ distribution of travel faresÂ

Â

Â

I placed equal weights on each station, i.e. each station is as important as another. This might not be realistic as some locations are not as frequently accessed as the rest. However, this is a goodÂ starting point and a general approach. I have also focused more on using travel times as the basis for comparison instead ofÂ travel fares as fares are going to change 29 Dec 2018. It is also important to note that the travel times here do not include waiting and transfer times, nor delays. This might not be trivial butÂ I'm assuming an equal probability of occurrence for each station which is probably not true. There is a need to analyze the probability of delay events before we can build in this prior, of which,Â I do not have this dataÂ at the moment. How I collected the travel time and fare data can be found in a previous post.

Â

So, IÂ did a dashboardÂ for users to compare the reach from a particular station with another within a certain time limit to help suggest possible meeting points through overlapping regions.Â The interactive dashboardÂ can be accessed here.Â It is interesting to note that Little India has the most number of stations reachable within 15 mins. For the most number of stations reachable withinÂ 8Â to 18 mins, the top 3 stations are Dhoby Ghaut, Bugis and Little India. Serangoon became the top station with most number of reachable stations within 19 mins. This could help in the planning of a property purchase, as convenience is a major consideration. It would also be interesting to correlate this accessibility with property prices.Â Â

I did a second dashboard to present a macro- and micro-level view of the distributions of travel times and fares.Â There seemed to be a trend towardsÂ fancy visualizations and I don't quite endorse that. Most of the time, simple charts such as bar charts, line graphs or scatterplots would be able to present findings effectively. Nonetheless I decided to give radial charts a go. If you look at it, it is really only useful if we're interested inÂ looking at the big picture, instead of relying on it for statistics or numbers. WeÂ can see the overall spread of travel times for different train lines, indicated by the different colorsÂ (I have grouped some together eg. SW and SE, PW and PE, CE and CG). Also, we can tell that the BP line (indicated by grey) has the least number of stations followed by the NE line (indicated by purple).Â Further, the stations on SW/ SE line (indicated by pink) and PW/ PE line (indicated by teal) have a pretty similar distribution of travel times, while the EW line (indicated by green) has the greatest variance, where the spread looked like aÂ heart-shaped figure.Â A more homogeneous spread would resemble a triangle shape, more similar to the BP line (indicated by grey).Â Through adjusting the maximum travel time, IÂ foundÂ ALL stations on the CC line (indicated by yellow) being able toÂ reach all other stations within 62 mins.Â Again, there are many limitations of such a chart. It's not obvious how many (or proportion of) stations can be reached within 15 mins, and it's hard to filter onÂ a particular station to look at. Hence it really depends on one's objectives to be achieved when designing visualizations.

Â

Moving on from the radial chart, the two charts beside provide a granular breakdown on the travel times and fares station by station. It is easier to see the distribution for each line and each station. The distributions of the travel fares are pretty interesting; eachÂ has aÂ distinctÂ shape, and they are especially differentÂ across different lines. The faresÂ are definitely not normally distributed and hence computing travelÂ fares on average for someone traveling from a particular station might not be meaningful. It would be useful to study how much a person staying in each region is spending on train travel and perhaps this could better influence the design of different tiers of welfareÂ packages in supplementing transportation costs. Of course, there are many ways to do this, either through a cap or a percent discount on each ride; in terms of implementation, this will probably have to be tied to an identifiableÂ card.Â The interactive dashboard for the below can be foundÂ here.

Â Â Â

Â

Â

Separately, before I finished collecting data on all 157 stations, I decided to do a preliminary analysis on the distributions after collecting data onÂ 40 stations. Boxplots are very useful in assessing range/ distribution where the median is more robust to outliers and hence making comparisons more meaningful. It is also easy to see which stations have lower median travel times and fares than the others. Again, this isÂ based on the assumption of equal importance ofÂ each station. Should we place higher weightings on certain stations, the distributions will change.

Â

On the other hand, scatterplots are useful to help us understand general trends and also in identifying outliersÂ (and possible data entry errors?). We can see a logarithm relationship with a plateau effect. There is definitely aÂ potential to study the impact of differentÂ pricing strategies onÂ revenue, such as a stepped increase or a time-based instead of distance-based feeÂ structure. This would make fare estimation easier but also I think this would translate to higher costs for commuters.Â Â

Â

Â

This post was also published in Towards Data Science on Medium.Â

Â

Recent Posts