Sunday, March 29, 2015

Week 0: Exploratory Analysis

To learn how to use pandas, I did an exploratory analysis of the public NYC taxi data set, which contains detailed fare and trip information---including things like tip amount and precise GPS/time data for pickups/dropoffs. I worked with the first chunk of the data available at http://www.andresmh.com/nyctaxitrips/ You can also read more about the data set at http://chriswhong.com/open-data/foil_nyc_taxi/

First, we import the fare and trip data seperately, then join. I had to clean up some of the column header names (removing whitespace).

In [1]:
import pandas as pd

fares_pd = pd.read_csv('trip_fare_1.csv.gz', compression='gzip', nrows=10000000, usecols=[0, 1, 2, 4, 5, 6, 7, 8, 9, 10])
trips_pd = pd.read_csv('trip_data_1.csv.gz', compression='gzip', nrows=10000000, usecols=range(5, 14))

#clean up column headers
fares_old_columns = fares_pd.columns
fares_pd.columns = [header.strip() for header in fares_old_columns]

taxi_pd = pd.merge(fares_pd, trips_pd, left_index=True, right_index=True, how='outer')
del fares_pd, trips_pd
taxi_CRD_pd = pd.DataFrame(taxi_pd[taxi_pd.payment_type == 'CRD']).drop('payment_type', 1)
del taxi_pd

Now to massage the tip and fare information to get tip percentages. Since cash fares are missing tip amounts, we'll work with credit-card fares. There are a few interesting complications to note.

First, when paying by credit card in a NYC taxi, the default tip options presented are 20%, 25%, and 30%. This leads to huge peaks in the histogram of tip percentages at these amounts.

In [2]:
hist((100.*taxi_CRD_pd.tip_amount / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)).values, bins=50, range=(0,50))
xlabel('tip percentage')
ylabel('number of fares')
show()

However, the spikes are not as sharp as you might expect. This is because there are two different credit-card--payment vendors, which use different formulas to calculate tip amounts---one vendor includes taxes and tolls in the total fare amount, the other does not. I discovered after some digging that this was known previously, see e.g. http://iquantny.tumblr.com/post/107245431809/how-software-in-half-of-nyc-cabs-generates-5-2

There are also rounding errors that smear out the spikes. We can correct for both of these factors. This will allow us to delineate between "manual" tips and "automatic" tips---the latter are those that yield a tip percentage of exactly 20%, 25%, or 30% after corrections (this almost definitely includes some "manual" tips erroneously, but we'll assume that New Yorkers aren't that good at doing math in their heads...) We'll filter out the automatic tips for some of the heatmaps we generate later so we can see the patterns underlying the large peaks they introduce.

In [3]:
taxi_CRD_pd['tip_frac'] = (100.*taxi_CRD_pd.tip_amount \
                           / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)) * (taxi_CRD_pd.vendor_id == 'CMT') + \
                          (100.*taxi_CRD_pd.tip_amount \
                           / (taxi_CRD_pd.fare_amount + taxi_CRD_pd.surcharge)) * (taxi_CRD_pd.vendor_id != 'CMT')
#round up/dn for 25% tip truncation of 0.005
taxi_CRD_pd['tip_frac_up'] = (100.*(taxi_CRD_pd.tip_amount + 0.005) \
                              / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)) * (taxi_CRD_pd.vendor_id == 'CMT') + \
                             (100.*(taxi_CRD_pd.tip_amount + 0.005) \
                              / (taxi_CRD_pd.fare_amount + taxi_CRD_pd.surcharge)) * (taxi_CRD_pd.vendor_id != 'CMT')
taxi_CRD_pd['tip_frac_dn'] = (100.*(taxi_CRD_pd.tip_amount - 0.005) \
                              / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)) * (taxi_CRD_pd.vendor_id == 'CMT') + \
                             (100.*(taxi_CRD_pd.tip_amount - 0.005) \
                              / (taxi_CRD_pd.fare_amount + taxi_CRD_pd.surcharge)) * (taxi_CRD_pd.vendor_id != 'CMT')
In [4]:
#filter out auto tips for heatmaps
def is_not_auto_tip(taxi_CRD_pd, tip_frac, thresh):
    if tip_frac == 20:
        return ((abs(around(taxi_CRD_pd.tip_frac, 2) - tip_frac) > thresh))
    return ((abs(around(taxi_CRD_pd.tip_frac, 2) - tip_frac) > thresh) &
            (abs(around(taxi_CRD_pd.tip_frac_up, 2) - tip_frac) > thresh) & 
            (abs(around(taxi_CRD_pd.tip_frac_dn, 2) - tip_frac) > thresh))

taxi_CRD_dist_pd = pd.DataFrame(taxi_CRD_pd[(taxi_CRD_pd.trip_distance <= 50) & 
                                            (taxi_CRD_pd.trip_distance > 0) &
                                            (taxi_CRD_pd.trip_time_in_secs > 0) &
                                            (taxi_CRD_pd.fare_amount < 50) &
                                            (taxi_CRD_pd.fare_amount*2 % 1 == 0) &
                                            (taxi_CRD_pd.tip_frac >= 0) &
                                            (taxi_CRD_pd.tip_frac < 50) &
                                            #include only manual tips
                                            is_not_auto_tip(taxi_CRD_pd, 20, 0.001) &
                                            is_not_auto_tip(taxi_CRD_pd, 25, 0.01) &
                                            is_not_auto_tip(taxi_CRD_pd, 30, 0.01)])

Now we bin the data in trip time, total distance, and tip percentage.

In [5]:
dist_max = 12.
time_max = 35.*60.
tip_frac_max = 30.
ndistbins = 15
ntimebins = 15
ntipbins = 30
trip_dist_bins = linspace(0, dist_max, ndistbins)
trip_time_bins = linspace(0, time_max, ntimebins)
tip_frac_bins = linspace(0, tip_frac_max, ntipbins)
In [6]:
cdist = pd.cut(taxi_CRD_dist_pd.trip_distance.values, trip_dist_bins)
ctime = pd.cut(taxi_CRD_dist_pd.trip_time_in_secs.values, trip_time_bins)
ctip = pd.cut(taxi_CRD_dist_pd.tip_frac.values, tip_frac_bins)

s_tipfrac = pd.Series(taxi_CRD_dist_pd.tip_frac)

gb_dist_tip_cnts = s_tipfrac.groupby([cdist.codes, ctip.codes]).count()
gb_tip_time_cnts = s_tipfrac.groupby([ctip.codes, ctime.codes]).count()
gb_dist_time_mean = s_tipfrac.groupby([cdist.codes, ctime.codes]).mean()

mi1 = pd.MultiIndex.from_product([range(-1, ndistbins-1), range(-1, ntipbins-1)], names=['dist', 'tip'])
mi2 = pd.MultiIndex.from_product([range(-1, ntipbins-1), range(-1, ntimebins-1)], names=['tip', 'time'])
mi3 = pd.MultiIndex.from_product([range(-1, ndistbins-1), range(-1, ntimebins-1)], names=['dist', 'time'])

gb_dist_tip_cnts_ri = gb_dist_tip_cnts.reindex(mi1).fillna(0)
gb_tip_time_cnts_ri = gb_tip_time_cnts.reindex(mi2).fillna(0)
gb_dist_time_mean_ri = gb_dist_time_mean.reindex(mi3).fillna(0)

Now for some plots! First we'll explore the relations between trip time, distance, and tip percentage.

In [7]:
figsize(12, 12)
rcParams.update({'font.size': 14})

#frequency of tip percentage vs. distance
subplot(2, 2, 1)
extent = [trip_dist_bins[0], trip_dist_bins[-1], tip_frac_bins[0], tip_frac_bins[-1]]
imshow(np.log10(reshape(gb_dist_tip_cnts_ri.values, (ndistbins, ntipbins)).T[1:, 1:]),
       origin='lower', extent=extent, aspect=dist_max/tip_frac_max, interpolation='none', vmin=0, vmax=6)
xlabel('trip distance (miles)')
ylabel('tip percentage')
cbar = colorbar(shrink=0.6, ticks=[0, 1, 2, 3, 4, 5, 6])
cbar.set_label('number of fares', rotation=270, labelpad=5)
cbar.set_ticklabels([r'$\leq1$', r'$10^1$', r'$10^2$', r'$10^3$', r'$10^4$', r'$10^5$', r'$\geq10^6$'])

#frequency of tip percentage vs. time
subplot(2, 2, 4)
extent = [tip_frac_bins[0], tip_frac_bins[-1], trip_time_bins[0]/60., trip_time_bins[-1]/60.]
imshow(np.log10(reshape(gb_tip_time_cnts_ri.values, (ntipbins, ntimebins)).T[1:, 1:]),
       origin='lower', extent=extent, aspect=tip_frac_max/(time_max/60.), interpolation='none', vmin=0, vmax=6)
xlabel('tip percentage')
ylabel('trip time (minutes)')
cbar = colorbar(shrink=0.6, ticks=[0, 1, 2, 3, 4, 5, 6])
cbar.set_label('number of fares', rotation=270, labelpad=5)
cbar.set_ticklabels([r'$\leq1$', r'$10^1$', r'$10^2$', r'$10^3$', r'$10^4$', r'$10^5$', r'$\geq10^6$'])

#mean tip percentage vs. distance and time
subplot(2, 2, 3)
extent = [trip_dist_bins[0], trip_dist_bins[-1], trip_time_bins[0]/60., trip_time_bins[-1]/60.]
imshow(reshape(gb_dist_time_mean_ri.values, (ndistbins, ntimebins)).T[1:, 1:],
       origin='lower', extent=extent, aspect=dist_max/(time_max/60.), interpolation='bicubic', vmin=15, vmax=20)
xlabel('trip distance (miles)')
ylabel('trip time (minutes)')
cbar = colorbar(shrink=0.6, ticks=[15, 16, 17, 18, 19, 20])
cbar.set_label('mean tip percentage', rotation=270, labelpad=13)
cbar.set_ticklabels([r'$\leq$15', '16', '17', '18', '19', r'$\geq$20'])

tight_layout(h_pad=-6, w_pad=2)
# plt.savefig('dist_time_auto_tip.png')
show()

We can see that our corrections really cleaned up the peaks from automatic tips---they now show up as sharp spikes, rather than spread-out bumps. We can also see some interesting trends in the dependence of tip percentage vs. both time and distance for manual tips. It also looks like trips that are short in distance and duration garner larger tips on average. The same is true for trips around 10 miles, which most likely are from LaGuardia to Manhattan (and are probably being expensed...)

We now use the GPS data to plot heatmaps of pickup locations. We'll look at both the number of pickups and the mean tip percentage as a function of pickup location. First, we'll bin in longitude and latitude of the pickup location. But before doing so, we'll filter out autotips so we can see the trends in the manual tips underlying those spikes (which would otherwise dominate the mean tip percentage). We can do this by uncommenting lines in a previous cell and reevaluating.

In [8]:
#bin in lat and lng for pickup and mean tip percentage heatmaps

center_lat = 40.76
center_lng = -73.925
dlat = 0.125
dlng = 0.125
nlatbins = 100
nlngbins = 100

def drop_mean(ary):
    if ary.size < 50:
        return 0.
    return mean(ary)

lng_bins = linspace(center_lng - dlng, center_lng + dlng, nlngbins)
lat_bins = linspace(center_lat - dlat, center_lat + dlat, nlatbins)

clng = pd.cut(taxi_CRD_dist_pd.pickup_longitude.values, lng_bins)
clat = pd.cut(taxi_CRD_dist_pd.pickup_latitude.values, lat_bins)

s_tipfrac = pd.Series(taxi_CRD_dist_pd.tip_frac)
gb_mean = s_tipfrac.groupby([clng.codes, clat.codes]).apply(drop_mean)
gb_cnts = s_tipfrac.groupby([clng.codes, clat.codes]).count()
mi = pd.MultiIndex.from_product([range(-1, nlngbins-1), range(-1, nlatbins-1)], names=['lng', 'lat'])
gb_mean_ri = gb_mean.reindex(mi).fillna(0)
gb_cnts_ri = gb_cnts.reindex(mi).fillna(0)

Now we plot the corresponding heatmaps.

In [9]:
figsize(16, 8)
rcParams.update({'font.size': 12})
subplot(1, 2, 1)
weighted_coord_bins_cnts = reshape(gb_cnts_ri.values, (nlngbins, nlatbins))
extent = [lng_bins[0], lng_bins[-1], lat_bins[0], lat_bins[-1]]
imshow(log10(weighted_coord_bins_cnts.T[1:, 1:] + 0.1), interpolation='none', origin='lower', 
       extent=extent, vmin=0, vmax=4.5)
xlabel('longitude (degrees)', size=14)
ylabel('latitude (degrees)', size=14)
cbar = colorbar(shrink=0.7, ticks=[0, 1, 2, 3, 4], pad=0.025)
cbar.set_label('number of fares', rotation=270, labelpad=13, size=14)
cbar.set_ticklabels([r'$\leq1$', r'$10^1$', r'$10^2$', r'$10^3$', r'$\geq10^4$'])

subplot(1, 2, 2)
weighted_coord_bins_mean = reshape(gb_mean_ri.values, (nlngbins, nlatbins))
extent = [lng_bins[0], lng_bins[-1], lat_bins[0], lat_bins[-1]]
imshow(weighted_coord_bins_mean.T[1:, 1:], interpolation='bicubic', origin='lower', 
       extent=extent, vmin=10, vmax=15)
xlabel('longitude (degrees)', size=14)
ylabel('latitude (degrees)', size=14)
cbar = colorbar(shrink=0.7, ticks=[10, 11, 12, 13, 14, 15], pad=0.025)
cbar.set_label('mean tip percentage', rotation=270, labelpad=13, size=14)
cbar.set_ticklabels([r'$\leq$10', '11', '12', '13', '14', r'$\geq$15'])

tight_layout(w_pad=6)
# plt.savefig('heatmaps-pixel.png')
show()

On the left, we can clearly see the geography of NYC traced out by the frequency of taxi pickups---LaGuardia and the Queens Boulevard stand out, and if you can even squint and see Penn Station... From the plot on the right, it looks like there are big tippers on the Lower East Side, near LaGuardia, etc.! One could imagine that ridesharing companies like Uber or Lyft could use data like this to improve dynamic pricing, vehicle dispatch, etc. You could even try to make predictive models of taxi pickups, correlating with other public NYC transportation data sets, weather data, or even social data...one can imagine many intriguing possibilities!

A postscript script: here's some code to make the heatmaps overlaid on a map of NYC with the Google Maps API and the JavaScript example found at https://developers.google.com/maps/documentation/javascript/examples/layer-heatmap. Unfortunately, even after tweaking, the resolution of the resulting heatmap is not as good as those above. Nevertheless, you can see the result here.

In [10]:
#code to plot heatmap using Google Maps API...resolution not as good

import os.path

def write_weighted_coords(html_file, lat_bins, lng_bins, weighted_coord_bins, max_intensity):
    if not os.path.exists(html_file):
        file(html_file, 'w').close()
    else:
        os.remove(html_file)
        
    weights_scaled = weighted_coord_bins
    
    with open('heatmap_template.html') as f_old, open(html_file, 'w') as f_new:
        for line in f_old:
            f_new.write(line)
            if '// LAT LNG COORDS GO HERE' in line:
                for lat_bin in range(1, len(lat_bins)):
                    for lng_bin in range(1, len(lng_bins)):
                        if (lat_bin != len(lat_bins)) & (lng_bin != len(lng_bins)):
                            f_new.write('  {{location: new google.maps.LatLng({:.6f}, {:.6f}), weight: {:.2f}}},\n'\
                                        .format(lat_bins[lat_bin], lng_bins[lng_bin], 
                                                max(weights_scaled[lng_bin][lat_bin], 0.)))
                f_new.write('  {{location: new google.maps.LatLng({:.6f}, {:.6f}), weight: {:.2f}}}'\
                            .format(lat_bins[lat_bin], lng_bins[lng_bin],
                                    max(weights_scaled[lng_bin][lat_bin], 0.)))
            if '// SET MAX INTENSITY HERE' in line:
                f_new.write('var maxIntensity = {:.2f};\n'.format(max_intensity))

write_weighted_coords('mean_pickup_heatmap_all.html', lat_bins, lng_bins, 
                      weighted_coord_bins_mean, 1.)
write_weighted_coords('counts_pickup_heatmap_all.html', lat_bins, lng_bins, 
                      log10(weighted_coord_bins_cnts), 1.)

No comments:

Post a Comment