Over the past few days, I've been digging even deeper into the taxi data, resulting in: 1) some nifty visualizations, and 2) a first cut at using machine learning to find optimal driver strategies. I'll discuss these results below, but for the purposes of presentation, I'll keep the python code to a minimum. If you'd like to take a peek under the hood, you can check out the iPython notebooks used to generate the below results in my github repository here (warning: somewhat hacky and uncommented code!).
For this analysis, we'll use the second chunk of the data from http://www.andresmh.com/nyctaxitrips/. First, some basic facts about the data:
Using date range 2013-02-01 00:00:00 to 2013-03-01 01:15:48.
Total of 13,990,176 trips and 32,062 drivers.
Total of 38,355,637.53 miles and 2,727,526.51 hours.
Total of $196,346,621.37 = $163,877,407.41 in fares, $18,311,045.64 in tips, and $14,158,168.32 in tolls/fees.
Next, let's make a simple scatter plot of the pickups and dropoffs. Here, I only use a subset of the data consisting of 500,000 trips (i.e., 1,000,000 pickups and dropoffs):
We can see that the GPS location data is impressively detailed! We can zoom in to some areas of interest---for example, LaGuardia Airport:
We see that the pickups clearly cluster around the taxi stands at each terminal, while the dropoffs trace LaGuardia Road and Central Terminal Drive.
We can also compare the areas around Madison Square Garden/Penn Station and Barclays Center to get an idea of how Manhattan taxi activity compares to that in Brooklyn. Here, I show all trips occurring in the 3-day period from 2/4/13 10AM to 2/6/13 10AM:
We can see that the number of pickups and dropoffs in midtown Manhattan is a few orders of magnitude larger than that in Brooklyn. Also interesting to note is that the GPS accuracy noticeably suffers in Manhattan (most likely due to the signals bouncing off of all of the skyscrapers), while in Brooklyn the resolution is fine enough to show that passengers get picked up from and dropped off on opposite sides of Flatbush!
Of course, the data also contains detailed temporal information. We can thus plot the time series of pickups/dropoffs in these two areas:
Note that the scale of the y-axes differs by a factor of 10. We see that sports events and concerts induce clear spikes in the number of pickups at Barclays after they end, while the lub-dub heartbeat of Penn Station is fairly constant (and is relatively unperturbed by the Piston vs. Knicks game on 2/4/13).
We can also combine both the spatial and temporal information to make some neat animations of the activity in this 3-day period:
Pretty cool! Now let's move on to a more serious and quantitative analysis of the data.
We can combine this spatiotemporal information with the fare information to optimize strategies for drivers. That is, we can use machine learning to determine when and where the highest/lowest performing drivers make their pickups, using the spatiotemporal pickup frequencies as features.
However, this raises the question of how to label driver performance. One might guess that we should use total earnings (fares + tips) per hour, but this doesn't take into account that drivers need to pay for their own gas. We should thus penalize drivers by the total distance they travel.
This then raises another question. The taxi data only gives detailed information (i.e., trip time and distance) for periods when the taxi was occupied---not for those when the taxi was empty. So we can't determine the total distance traveled directly from the data. Instead, we'll approximate it by assuming the distance traveled when the taxi was empty is simply given by the distance from the previous dropoff point (we'll also use the Euclidean distance between these two points, rather than trying to determine the distance along a realistic route). This doesn't exactly account for the meandering that drivers might do after each dropoff, but probably is a good indicator of the average scenario.
By assuming the price of gas is roughly $3/gallon and that taxis are averaging around 15 mpg, we can then determine the corrected earnings per hour and find the corresponding 5th and 95th percentiles of drivers:
Limiting to only drivers with more than 300 trips under their belt, this gives us 1153 good and 1153 bad drivers.
Now that we've got our targets labeled, how should we determine our features? We can do this by splitting up our data into spatial and temporal bins. For the latter, we'll bin in 2-hour intervals each day, and further distinguish between weekdays (Mondays-Thursday) and weekends (Friday-Sunday). For the former, we'll first split up the city into spatial zones, using a 40x40 grid:
We'll then focus only on the 250 most active zones, lumping all other zones together (we'll call this Zone 0, or "the Boonies"):
Combining these with our temporal bins, we arrive at 2 (weekday/weekend) x 12 (2-hour time bins) x 251 (spatial zones) = 6024 features.
We then split our 2306 drivers into training and test sets (with a simple 50-50 split) and fit for feature weights for each of the 2 x 12 x 251 spatiotemporal bins, using a linear support vector classifier. We can also use 5-fold cross validation to evaluate the classifier performance; an accuracy of 91±3% is achieved by the model in predicting the performance of drivers in the test set.
Plotting these feature weights as a function of space and time then gives us some more cool animations. Here, red spots show zones frequented by high performing drivers at various times of the day, while blue spots show those zones frequented by low performing drivers:
More precisely, the color of each spot corresponds to the weight that each location/time combination contributes to the determination of driver performance. Red indicates positive weight (i.e., making frequent pickups in location/time combinations indicated by red spots contributes to positive driver performance), while blue indicates negative weight (i.e., drivers should avoid those locations at those particular times). However, there's a lot going on in these animations. To simplify the picture, we can use L1-Norm regularization to extract only the most important spatiotemporal features; setting C=0.06 gives us a model with only 284 features with non-zero weights. This simplified model also achieves about 90% accuracy---despite having fewer moving parts---and allows us to focus on the locations/times that most strongly determine driver performance:
Interestingly enough, we see that the only time it is profitable to make the trip out to Brooklyn to find fares is after midnight on a weekend---perhaps inebriated hipsters are inclined to tip a little more? Similarly, it looks like there are optimal times to pick up fares from LaGuardia during the week.
Examining only the features with large positive weights, we arrive at the top 7 habits of highly effective hacks (i.e., the 7 pickup times/areas that most strongly determine high performing drivers):
1) Weekend 6PM to 8PM Upper East Side
2) Weekday 8PM to 10PM SoHo
3) Weekday 10PM to 12AM SoHo
4) Weekday 6PM to 8PM City Hall
5) Weekend 8PM to 10PM East Village
6) Weekend 4PM to 6PM Yorkville
7) Weekday 6PM to 8PM Lenox Hill
These results seem pretty reasonable! Unsurprisingly, we see many well-to-do and nightlife areas making the list, but it seems that picking up workers at City Hall at the end of the workday also results in profitable trips. How about the 7 habits of less effective hacks?
1) Weekday 4AM to 6AM The Boonies
2) Weekend 6AM to 8AM The Boonies
3) Weekday 12AM to 2AM Port Authority
4) Weekday 2AM to 4AM The Boonies
5) Weekday 8PM to 10PM Lenox Hill
6) Weekday 8AM to 10AM Upper West Side
7) Weekday 2AM to 4AM Grand Central
Again, these results seem reasonable. It looks like picking up passengers from late night trips to Port Authority and Grand Central isn't very fruitful. And it's definitely not very smart to be hunting for fares out in the Boonies during the wee hours of the morning---we don't need a fancy machine-learning analysis to tell us that!
Of course, one could argue that we didn't actually get that fancy here. We simply used the most straightforward features (location/time of pickups); it is easy to imagine more sophisticated analyses are possible with the data. For example, we could try to determine whether it is better for hacks to 1) hunt for fares, or 2) hang out and wait for fares. This was previously studied in a nice paper, which used full GPS trajectories of taxis in Hangzhou (as opposed to just the pickup/dropoff data, which we are limited to here).
We could also try answering a more difficult question. We've found what the best real taxi drivers do---what about the perfect driver? We can imagine playing the game of optimizing net driver profit, using the spatiotemporal grid of pickup/dropoff frequencies to abstract the game as a time-dependent Markov Decision Process, as was done with Beijing taxi trajectories in this paper.
Clearly, there are many intriguing directions to pursue---so stay tuned for future posts!