Monday, March 30, 2015

Week 1: Some Nifty Visualizations + 7 Habits of Highly Effective Hacks

Over the past few days, I've been digging even deeper into the taxi data, resulting in: 1) some nifty visualizations, and 2) a first cut at using machine learning to find optimal driver strategies. I'll discuss these results below, but for the purposes of presentation, I'll keep the python code to a minimum. If you'd like to take a peek under the hood, you can check out the iPython notebooks used to generate the below results in my github repository here (warning: somewhat hacky and uncommented code!).

For this analysis, we'll use the second chunk of the data from http://www.andresmh.com/nyctaxitrips/. First, some basic facts about the data:

Using date range 2013-02-01 00:00:00 to 2013-03-01 01:15:48.
Total of 13,990,176 trips and 32,062 drivers.
Total of 38,355,637.53 miles and 2,727,526.51 hours.
Total of $196,346,621.37 = $163,877,407.41 in fares, $18,311,045.64 in tips, and $14,158,168.32 in tolls/fees.

Next, let's make a simple scatter plot of the pickups and dropoffs. Here, I only use a subset of the data consisting of 500,000 trips (i.e., 1,000,000 pickups and dropoffs):

We can see that the GPS location data is impressively detailed! We can zoom in to some areas of interest---for example, LaGuardia Airport:

We see that the pickups clearly cluster around the taxi stands at each terminal, while the dropoffs trace LaGuardia Road and Central Terminal Drive.

We can also compare the areas around Madison Square Garden/Penn Station and Barclays Center to get an idea of how Manhattan taxi activity compares to that in Brooklyn. Here, I show all trips occurring in the 3-day period from 2/4/13 10AM to 2/6/13 10AM:

We can see that the number of pickups and dropoffs in midtown Manhattan is a few orders of magnitude larger than that in Brooklyn. Also interesting to note is that the GPS accuracy noticeably suffers in Manhattan (most likely due to the signals bouncing off of all of the skyscrapers), while in Brooklyn the resolution is fine enough to show that passengers get picked up from and dropped off on opposite sides of Flatbush!

Of course, the data also contains detailed temporal information. We can thus plot the time series of pickups/dropoffs in these two areas:

Note that the scale of the y-axes differs by a factor of 10. We see that sports events and concerts induce clear spikes in the number of pickups at Barclays after they end, while the lub-dub heartbeat of Penn Station is fairly constant (and is relatively unperturbed by the Piston vs. Knicks game on 2/4/13).

We can also combine both the spatial and temporal information to make some neat animations of the activity in this 3-day period:

Pretty cool! Now let's move on to a more serious and quantitative analysis of the data.

7 Habits of Highly Effective Hacks

We can combine this spatiotemporal information with the fare information to optimize strategies for drivers. That is, we can use machine learning to determine when and where the highest/lowest performing drivers make their pickups, using the spatiotemporal pickup frequencies as features.

However, this raises the question of how to label driver performance. One might guess that we should use total earnings (fares + tips) per hour, but this doesn't take into account that drivers need to pay for their own gas. We should thus penalize drivers by the total distance they travel.

This then raises another question. The taxi data only gives detailed information (i.e., trip time and distance) for periods when the taxi was occupied---not for those when the taxi was empty. So we can't determine the total distance traveled directly from the data. Instead, we'll approximate it by assuming the distance traveled when the taxi was empty is simply given by the distance from the previous dropoff point (we'll also use the Euclidean distance between these two points, rather than trying to determine the distance along a realistic route). This doesn't exactly account for the meandering that drivers might do after each dropoff, but probably is a good indicator of the average scenario.

By assuming the price of gas is roughly $3/gallon and that taxis are averaging around 15 mpg, we can then determine the corrected earnings per hour and find the corresponding 5th and 95th percentiles of drivers:

Limiting to only drivers with more than 300 trips under their belt, this gives us 1153 good and 1153 bad drivers.

Now that we've got our targets labeled, how should we determine our features? We can do this by splitting up our data into spatial and temporal bins. For the latter, we'll bin in 2-hour intervals each day, and further distinguish between weekdays (Mondays-Thursday) and weekends (Friday-Sunday). For the former, we'll first split up the city into spatial zones, using a 40x40 grid:

We'll then focus only on the 250 most active zones, lumping all other zones together (we'll call this Zone 0, or "the Boonies"):

Combining these with our temporal bins, we arrive at 2 (weekday/weekend) x 12 (2-hour time bins) x 251 (spatial zones) = 6024 features.

We then split our 2306 drivers into training and test sets (with a simple 50-50 split) and fit for feature weights for each of the 2 x 12 x 251 spatiotemporal bins, using a linear support vector classifier. We can also use 5-fold cross validation to evaluate the classifier performance; an accuracy of 91±3% is achieved by the model in predicting the performance of drivers in the test set.

Plotting these feature weights as a function of space and time then gives us some more cool animations. Here, red spots show zones frequented by high performing drivers at various times of the day, while blue spots show those zones frequented by low performing drivers:

More precisely, the color of each spot corresponds to the weight that each location/time combination contributes to the determination of driver performance. Red indicates positive weight (i.e., making frequent pickups in location/time combinations indicated by red spots contributes to positive driver performance), while blue indicates negative weight (i.e., drivers should avoid those locations at those particular times). However, there's a lot going on in these animations. To simplify the picture, we can use L1-Norm regularization to extract only the most important spatiotemporal features; setting C=0.06 gives us a model with only 284 features with non-zero weights. This simplified model also achieves about 90% accuracy---despite having fewer moving parts---and allows us to focus on the locations/times that most strongly determine driver performance:

Interestingly enough, we see that the only time it is profitable to make the trip out to Brooklyn to find fares is after midnight on a weekend---perhaps inebriated hipsters are inclined to tip a little more? Similarly, it looks like there are optimal times to pick up fares from LaGuardia during the week.

Examining only the features with large positive weights, we arrive at the top 7 habits of highly effective hacks (i.e., the 7 pickup times/areas that most strongly determine high performing drivers):

1) Weekend  6PM to  8PM Upper East Side
2) Weekday  8PM to 10PM SoHo
3) Weekday 10PM to 12AM SoHo
4) Weekday  6PM to  8PM City Hall
5) Weekend  8PM to 10PM East Village
6) Weekend  4PM to  6PM Yorkville
7) Weekday  6PM to  8PM Lenox Hill

These results seem pretty reasonable! Unsurprisingly, we see many well-to-do and nightlife areas making the list, but it seems that picking up workers at City Hall at the end of the workday also results in profitable trips. How about the 7 habits of less effective hacks?

1) Weekday  4AM to  6AM The Boonies
2) Weekend  6AM to  8AM The Boonies
3) Weekday  12AM to 2AM Port Authority
4) Weekday  2AM to  4AM The Boonies
5) Weekday  8PM to 10PM Lenox Hill
6) Weekday  8AM to 10AM Upper West Side
7) Weekday  2AM to  4AM Grand Central

Again, these results seem reasonable. It looks like picking up passengers from late night trips to Port Authority and Grand Central isn't very fruitful. And it's definitely not very smart to be hunting for fares out in the Boonies during the wee hours of the morning---we don't need a fancy machine-learning analysis to tell us that!

Of course, one could argue that we didn't actually get that fancy here. We simply used the most straightforward features (location/time of pickups); it is easy to imagine more sophisticated analyses are possible with the data. For example, we could try to determine whether it is better for hacks to 1) hunt for fares, or 2) hang out and wait for fares. This was previously studied in a nice paper, which used full GPS trajectories of taxis in Hangzhou (as opposed to just the pickup/dropoff data, which we are limited to here).

We could also try answering a more difficult question. We've found what the best real taxi drivers do---what about the perfect driver? We can imagine playing the game of optimizing net driver profit, using the spatiotemporal grid of pickup/dropoff frequencies to abstract the game as a time-dependent Markov Decision Process, as was done with Beijing taxi trajectories in this paper.

Clearly, there are many intriguing directions to pursue---so stay tuned for future posts!

Sunday, March 29, 2015

Week 0: Exploratory Analysis

To learn how to use pandas, I did an exploratory analysis of the public NYC taxi data set, which contains detailed fare and trip information---including things like tip amount and precise GPS/time data for pickups/dropoffs. I worked with the first chunk of the data available at http://www.andresmh.com/nyctaxitrips/ You can also read more about the data set at http://chriswhong.com/open-data/foil_nyc_taxi/

First, we import the fare and trip data seperately, then join. I had to clean up some of the column header names (removing whitespace).

In [1]:
import pandas as pd

fares_pd = pd.read_csv('trip_fare_1.csv.gz', compression='gzip', nrows=10000000, usecols=[0, 1, 2, 4, 5, 6, 7, 8, 9, 10])
trips_pd = pd.read_csv('trip_data_1.csv.gz', compression='gzip', nrows=10000000, usecols=range(5, 14))

#clean up column headers
fares_old_columns = fares_pd.columns
fares_pd.columns = [header.strip() for header in fares_old_columns]

taxi_pd = pd.merge(fares_pd, trips_pd, left_index=True, right_index=True, how='outer')
del fares_pd, trips_pd
taxi_CRD_pd = pd.DataFrame(taxi_pd[taxi_pd.payment_type == 'CRD']).drop('payment_type', 1)
del taxi_pd

Now to massage the tip and fare information to get tip percentages. Since cash fares are missing tip amounts, we'll work with credit-card fares. There are a few interesting complications to note.

First, when paying by credit card in a NYC taxi, the default tip options presented are 20%, 25%, and 30%. This leads to huge peaks in the histogram of tip percentages at these amounts.

In [2]:
hist((100.*taxi_CRD_pd.tip_amount / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)).values, bins=50, range=(0,50))
xlabel('tip percentage')
ylabel('number of fares')
show()

However, the spikes are not as sharp as you might expect. This is because there are two different credit-card--payment vendors, which use different formulas to calculate tip amounts---one vendor includes taxes and tolls in the total fare amount, the other does not. I discovered after some digging that this was known previously, see e.g. http://iquantny.tumblr.com/post/107245431809/how-software-in-half-of-nyc-cabs-generates-5-2

There are also rounding errors that smear out the spikes. We can correct for both of these factors. This will allow us to delineate between "manual" tips and "automatic" tips---the latter are those that yield a tip percentage of exactly 20%, 25%, or 30% after corrections (this almost definitely includes some "manual" tips erroneously, but we'll assume that New Yorkers aren't that good at doing math in their heads...) We'll filter out the automatic tips for some of the heatmaps we generate later so we can see the patterns underlying the large peaks they introduce.

In [3]:
taxi_CRD_pd['tip_frac'] = (100.*taxi_CRD_pd.tip_amount \
                           / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)) * (taxi_CRD_pd.vendor_id == 'CMT') + \
                          (100.*taxi_CRD_pd.tip_amount \
                           / (taxi_CRD_pd.fare_amount + taxi_CRD_pd.surcharge)) * (taxi_CRD_pd.vendor_id != 'CMT')
#round up/dn for 25% tip truncation of 0.005
taxi_CRD_pd['tip_frac_up'] = (100.*(taxi_CRD_pd.tip_amount + 0.005) \
                              / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)) * (taxi_CRD_pd.vendor_id == 'CMT') + \
                             (100.*(taxi_CRD_pd.tip_amount + 0.005) \
                              / (taxi_CRD_pd.fare_amount + taxi_CRD_pd.surcharge)) * (taxi_CRD_pd.vendor_id != 'CMT')
taxi_CRD_pd['tip_frac_dn'] = (100.*(taxi_CRD_pd.tip_amount - 0.005) \
                              / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)) * (taxi_CRD_pd.vendor_id == 'CMT') + \
                             (100.*(taxi_CRD_pd.tip_amount - 0.005) \
                              / (taxi_CRD_pd.fare_amount + taxi_CRD_pd.surcharge)) * (taxi_CRD_pd.vendor_id != 'CMT')
In [4]:
#filter out auto tips for heatmaps
def is_not_auto_tip(taxi_CRD_pd, tip_frac, thresh):
    if tip_frac == 20:
        return ((abs(around(taxi_CRD_pd.tip_frac, 2) - tip_frac) > thresh))
    return ((abs(around(taxi_CRD_pd.tip_frac, 2) - tip_frac) > thresh) &
            (abs(around(taxi_CRD_pd.tip_frac_up, 2) - tip_frac) > thresh) & 
            (abs(around(taxi_CRD_pd.tip_frac_dn, 2) - tip_frac) > thresh))

taxi_CRD_dist_pd = pd.DataFrame(taxi_CRD_pd[(taxi_CRD_pd.trip_distance <= 50) & 
                                            (taxi_CRD_pd.trip_distance > 0) &
                                            (taxi_CRD_pd.trip_time_in_secs > 0) &
                                            (taxi_CRD_pd.fare_amount < 50) &
                                            (taxi_CRD_pd.fare_amount*2 % 1 == 0) &
                                            (taxi_CRD_pd.tip_frac >= 0) &
                                            (taxi_CRD_pd.tip_frac < 50) &
                                            #include only manual tips
                                            is_not_auto_tip(taxi_CRD_pd, 20, 0.001) &
                                            is_not_auto_tip(taxi_CRD_pd, 25, 0.01) &
                                            is_not_auto_tip(taxi_CRD_pd, 30, 0.01)])

Now we bin the data in trip time, total distance, and tip percentage.

In [5]:
dist_max = 12.
time_max = 35.*60.
tip_frac_max = 30.
ndistbins = 15
ntimebins = 15
ntipbins = 30
trip_dist_bins = linspace(0, dist_max, ndistbins)
trip_time_bins = linspace(0, time_max, ntimebins)
tip_frac_bins = linspace(0, tip_frac_max, ntipbins)
In [6]:
cdist = pd.cut(taxi_CRD_dist_pd.trip_distance.values, trip_dist_bins)
ctime = pd.cut(taxi_CRD_dist_pd.trip_time_in_secs.values, trip_time_bins)
ctip = pd.cut(taxi_CRD_dist_pd.tip_frac.values, tip_frac_bins)

s_tipfrac = pd.Series(taxi_CRD_dist_pd.tip_frac)

gb_dist_tip_cnts = s_tipfrac.groupby([cdist.codes, ctip.codes]).count()
gb_tip_time_cnts = s_tipfrac.groupby([ctip.codes, ctime.codes]).count()
gb_dist_time_mean = s_tipfrac.groupby([cdist.codes, ctime.codes]).mean()

mi1 = pd.MultiIndex.from_product([range(-1, ndistbins-1), range(-1, ntipbins-1)], names=['dist', 'tip'])
mi2 = pd.MultiIndex.from_product([range(-1, ntipbins-1), range(-1, ntimebins-1)], names=['tip', 'time'])
mi3 = pd.MultiIndex.from_product([range(-1, ndistbins-1), range(-1, ntimebins-1)], names=['dist', 'time'])

gb_dist_tip_cnts_ri = gb_dist_tip_cnts.reindex(mi1).fillna(0)
gb_tip_time_cnts_ri = gb_tip_time_cnts.reindex(mi2).fillna(0)
gb_dist_time_mean_ri = gb_dist_time_mean.reindex(mi3).fillna(0)

Now for some plots! First we'll explore the relations between trip time, distance, and tip percentage.

In [7]:
figsize(12, 12)
rcParams.update({'font.size': 14})

#frequency of tip percentage vs. distance
subplot(2, 2, 1)
extent = [trip_dist_bins[0], trip_dist_bins[-1], tip_frac_bins[0], tip_frac_bins[-1]]
imshow(np.log10(reshape(gb_dist_tip_cnts_ri.values, (ndistbins, ntipbins)).T[1:, 1:]),
       origin='lower', extent=extent, aspect=dist_max/tip_frac_max, interpolation='none', vmin=0, vmax=6)
xlabel('trip distance (miles)')
ylabel('tip percentage')
cbar = colorbar(shrink=0.6, ticks=[0, 1, 2, 3, 4, 5, 6])
cbar.set_label('number of fares', rotation=270, labelpad=5)
cbar.set_ticklabels([r'$\leq1$', r'$10^1$', r'$10^2$', r'$10^3$', r'$10^4$', r'$10^5$', r'$\geq10^6$'])

#frequency of tip percentage vs. time
subplot(2, 2, 4)
extent = [tip_frac_bins[0], tip_frac_bins[-1], trip_time_bins[0]/60., trip_time_bins[-1]/60.]
imshow(np.log10(reshape(gb_tip_time_cnts_ri.values, (ntipbins, ntimebins)).T[1:, 1:]),
       origin='lower', extent=extent, aspect=tip_frac_max/(time_max/60.), interpolation='none', vmin=0, vmax=6)
xlabel('tip percentage')
ylabel('trip time (minutes)')
cbar = colorbar(shrink=0.6, ticks=[0, 1, 2, 3, 4, 5, 6])
cbar.set_label('number of fares', rotation=270, labelpad=5)
cbar.set_ticklabels([r'$\leq1$', r'$10^1$', r'$10^2$', r'$10^3$', r'$10^4$', r'$10^5$', r'$\geq10^6$'])

#mean tip percentage vs. distance and time
subplot(2, 2, 3)
extent = [trip_dist_bins[0], trip_dist_bins[-1], trip_time_bins[0]/60., trip_time_bins[-1]/60.]
imshow(reshape(gb_dist_time_mean_ri.values, (ndistbins, ntimebins)).T[1:, 1:],
       origin='lower', extent=extent, aspect=dist_max/(time_max/60.), interpolation='bicubic', vmin=15, vmax=20)
xlabel('trip distance (miles)')
ylabel('trip time (minutes)')
cbar = colorbar(shrink=0.6, ticks=[15, 16, 17, 18, 19, 20])
cbar.set_label('mean tip percentage', rotation=270, labelpad=13)
cbar.set_ticklabels([r'$\leq$15', '16', '17', '18', '19', r'$\geq$20'])

tight_layout(h_pad=-6, w_pad=2)
# plt.savefig('dist_time_auto_tip.png')
show()

We can see that our corrections really cleaned up the peaks from automatic tips---they now show up as sharp spikes, rather than spread-out bumps. We can also see some interesting trends in the dependence of tip percentage vs. both time and distance for manual tips. It also looks like trips that are short in distance and duration garner larger tips on average. The same is true for trips around 10 miles, which most likely are from LaGuardia to Manhattan (and are probably being expensed...)

We now use the GPS data to plot heatmaps of pickup locations. We'll look at both the number of pickups and the mean tip percentage as a function of pickup location. First, we'll bin in longitude and latitude of the pickup location. But before doing so, we'll filter out autotips so we can see the trends in the manual tips underlying those spikes (which would otherwise dominate the mean tip percentage). We can do this by uncommenting lines in a previous cell and reevaluating.

In [8]:
#bin in lat and lng for pickup and mean tip percentage heatmaps

center_lat = 40.76
center_lng = -73.925
dlat = 0.125
dlng = 0.125
nlatbins = 100
nlngbins = 100

def drop_mean(ary):
    if ary.size < 50:
        return 0.
    return mean(ary)

lng_bins = linspace(center_lng - dlng, center_lng + dlng, nlngbins)
lat_bins = linspace(center_lat - dlat, center_lat + dlat, nlatbins)

clng = pd.cut(taxi_CRD_dist_pd.pickup_longitude.values, lng_bins)
clat = pd.cut(taxi_CRD_dist_pd.pickup_latitude.values, lat_bins)

s_tipfrac = pd.Series(taxi_CRD_dist_pd.tip_frac)
gb_mean = s_tipfrac.groupby([clng.codes, clat.codes]).apply(drop_mean)
gb_cnts = s_tipfrac.groupby([clng.codes, clat.codes]).count()
mi = pd.MultiIndex.from_product([range(-1, nlngbins-1), range(-1, nlatbins-1)], names=['lng', 'lat'])
gb_mean_ri = gb_mean.reindex(mi).fillna(0)
gb_cnts_ri = gb_cnts.reindex(mi).fillna(0)

Now we plot the corresponding heatmaps.

In [9]:
figsize(16, 8)
rcParams.update({'font.size': 12})
subplot(1, 2, 1)
weighted_coord_bins_cnts = reshape(gb_cnts_ri.values, (nlngbins, nlatbins))
extent = [lng_bins[0], lng_bins[-1], lat_bins[0], lat_bins[-1]]
imshow(log10(weighted_coord_bins_cnts.T[1:, 1:] + 0.1), interpolation='none', origin='lower', 
       extent=extent, vmin=0, vmax=4.5)
xlabel('longitude (degrees)', size=14)
ylabel('latitude (degrees)', size=14)
cbar = colorbar(shrink=0.7, ticks=[0, 1, 2, 3, 4], pad=0.025)
cbar.set_label('number of fares', rotation=270, labelpad=13, size=14)
cbar.set_ticklabels([r'$\leq1$', r'$10^1$', r'$10^2$', r'$10^3$', r'$\geq10^4$'])

subplot(1, 2, 2)
weighted_coord_bins_mean = reshape(gb_mean_ri.values, (nlngbins, nlatbins))
extent = [lng_bins[0], lng_bins[-1], lat_bins[0], lat_bins[-1]]
imshow(weighted_coord_bins_mean.T[1:, 1:], interpolation='bicubic', origin='lower', 
       extent=extent, vmin=10, vmax=15)
xlabel('longitude (degrees)', size=14)
ylabel('latitude (degrees)', size=14)
cbar = colorbar(shrink=0.7, ticks=[10, 11, 12, 13, 14, 15], pad=0.025)
cbar.set_label('mean tip percentage', rotation=270, labelpad=13, size=14)
cbar.set_ticklabels([r'$\leq$10', '11', '12', '13', '14', r'$\geq$15'])

tight_layout(w_pad=6)
# plt.savefig('heatmaps-pixel.png')
show()

On the left, we can clearly see the geography of NYC traced out by the frequency of taxi pickups---LaGuardia and the Queens Boulevard stand out, and if you can even squint and see Penn Station... From the plot on the right, it looks like there are big tippers on the Lower East Side, near LaGuardia, etc.! One could imagine that ridesharing companies like Uber or Lyft could use data like this to improve dynamic pricing, vehicle dispatch, etc. You could even try to make predictive models of taxi pickups, correlating with other public NYC transportation data sets, weather data, or even social data...one can imagine many intriguing possibilities!

A postscript script: here's some code to make the heatmaps overlaid on a map of NYC with the Google Maps API and the JavaScript example found at https://developers.google.com/maps/documentation/javascript/examples/layer-heatmap. Unfortunately, even after tweaking, the resolution of the resulting heatmap is not as good as those above. Nevertheless, you can see the result here.

In [10]:
#code to plot heatmap using Google Maps API...resolution not as good

import os.path

def write_weighted_coords(html_file, lat_bins, lng_bins, weighted_coord_bins, max_intensity):
    if not os.path.exists(html_file):
        file(html_file, 'w').close()
    else:
        os.remove(html_file)
        
    weights_scaled = weighted_coord_bins
    
    with open('heatmap_template.html') as f_old, open(html_file, 'w') as f_new:
        for line in f_old:
            f_new.write(line)
            if '// LAT LNG COORDS GO HERE' in line:
                for lat_bin in range(1, len(lat_bins)):
                    for lng_bin in range(1, len(lng_bins)):
                        if (lat_bin != len(lat_bins)) & (lng_bin != len(lng_bins)):
                            f_new.write('  {{location: new google.maps.LatLng({:.6f}, {:.6f}), weight: {:.2f}}},\n'\
                                        .format(lat_bins[lat_bin], lng_bins[lng_bin], 
                                                max(weights_scaled[lng_bin][lat_bin], 0.)))
                f_new.write('  {{location: new google.maps.LatLng({:.6f}, {:.6f}), weight: {:.2f}}}'\
                            .format(lat_bins[lat_bin], lng_bins[lng_bin],
                                    max(weights_scaled[lng_bin][lat_bin], 0.)))
            if '// SET MAX INTENSITY HERE' in line:
                f_new.write('var maxIntensity = {:.2f};\n'.format(max_intensity))

write_weighted_coords('mean_pickup_heatmap_all.html', lat_bins, lng_bins, 
                      weighted_coord_bins_mean, 1.)
write_weighted_coords('counts_pickup_heatmap_all.html', lat_bins, lng_bins, 
                      log10(weighted_coord_bins_cnts), 1.)