Monday, April 20, 2015

Weeks 3-4: Deep Yellow

I've spent most of the past couple of weeks putting together a web-based user interface to present my results. This required me to learn/relearn Flask, Bootstrap/CSS templates, JavaScript, Leaflet/MapBox APIs, etc. See the result here!

Tuesday, April 7, 2015

Week 2: Taxis in Spacetime + Grandmaster Cabbies

Last week, we used the taxi data to determine what strategies the highest performing drivers were using. There, the best "strategy" essentially boiled down to a list of locations and times that led to profitable pickups. (More precisely, we found that whether or not a particular driver was in the top 5 percent of all cabbies depended strongly on the number of pickups that driver made at these specific spots.)

However, this list of hot spots is ultimately of limited use to a driver. If I'm a cabbie with a weekend shift, it doesn't do me much good to know that picking up fat cats at City Hall during the work week is a good "strategy." Even if I know which spots are hopping on the weekend after midnight, that leaves me little guidance for the rest of my shift. And finally, even if I know all the best spots at every hour of the day, what happens if I need to drop a passenger off far away from those spots? Do I waste gas driving back to them, or do I try to make another pickup nearby? These are the issues that we'll try to address with our analysis this week.

So what do we actually want to be the end product of our analysis? We're trying to play the game of maximizing profit as a taxi driver. Clearly, a "strategy" for this game that consists of a completely general set of rules to follow would be quite useful! That is, if I'm a cabbie that happens to find himself at location X at time T, these rules should tell me exactly what action to do next. If there's a passenger available, should I pick her up and take her to wherever she wants to go? Or should I leave her on the curb---and if so, where should I go next to give myself the best chance at making a more profitable pickup (without wasting too much time or gas)?

The taxi data gives us the board on which this game is played---it tells us the probabilities of making pickups at each location and time, as well as the payoffs or costs for making each move. As an astrophysicist, it's most natural for me to think of this game board as existing in a sort of spacetime. Recall that we split our 2 spatial dimensions into 251 zones:

We can further split our time dimension into different slices:

From this diagram, we can see that a taxi trip consists of a move on this game board from one point on the spacetime grid to another. Again, the data tells us the probability of making a pickup at each point, as well as the probability distribution for making trips to subsequent points if a pickup is made. It also tells us the distribution of rewards (if a pickup is made) and costs (for gas) associated with moves between all of the points. Note that some moves are not "legal"---cabbies have to obey the speed limit, and they can only move forward in time! (Physicists will recognize this requirement as a consequence of being confined to the future "light cone," which contains all future points that can be causally connected to a current point given the finite speed of light. Here, our "cabbie cone" is determined by the typical taxi speed, which we'll take to be around 15 mph.)

Clearly, our game on this spacetime board contains some element of chance. Whether or not we score a passenger at each spacetime point is random, as are the moves that passengers force us to make. The problem is then to find an optimal strategy for this game of chance, given that we know the probabilities for these random processes. It turns out that our game is simply a case of a Markov Decision Process:

Markov Decision Processes can typically be solved by writing the problem as a matrix equation (involving matrices for move probabilities and expected payoffs), and then using linear or dynamic programming to solve the equation to find the optimal strategy. However, this is difficult to do in our case, because: 1) the size of the spacetime grid is large (251 spatial zones x many 10-minute time slices), and 2) the number of possible actions is large (given any spacetime point, there are many points contained in the cabbie cone that we could choose to go to next). This means that our matrices are large and unwieldy.

Instead, we'll use a different approach to find near-optimal strategies on our spacetime game board. First, we note that we can use the data to test strategies, by simply simulating trips and finding their corresponding payoffs under each strategy. For example, here's a simulated taxi shift that starts at Penn Station at 7AM (a.k.a. "spacetime zone 10914") and ends near City Hall at 7PM (a.k.a. "spacetime zone 28414"):

Starting in spacetime zone 10914
Make pickup: moving to 11428 and making 10.04
Make pickup: moving to 12046 and making 27.42
No   pickup: moving to 12797 and losing -0.84
No   pickup: moving to 13295 and losing -0.84
No   pickup: moving to 14029 and losing -0.84
Make pickup: moving to 13982 and making  6.18
Skip pickup: moving to 14487 and losing -0.98
Make pickup: moving to 15060 and making 39.57
Make pickup: moving to 15060 and making  6.59
Make pickup: moving to 15060 and making  6.59
No   pickup: moving to 15813 and losing -0.69
Make pickup: moving to 16315 and making 20.12
Make pickup: moving to 16901 and making 20.78
Make pickup: moving to 17179 and making  9.72
Make pickup: moving to 17179 and making  3.52
No   pickup: moving to 17428 and losing -0.28
No   pickup: moving to 18228 and losing -1.14
Make pickup: moving to 18689 and making 12.54
Make pickup: moving to 18946 and making 10.15
No   pickup: moving to 19731 and losing -1.19
Make pickup: moving to 20227 and making 18.36
Make pickup: moving to 20634 and making 15.68
Make pickup: moving to 21507 and making 21.34
Make pickup: moving to 21788 and making  7.15
Make pickup: moving to 21995 and making 10.48
No   pickup: moving to 22244 and losing -0.28
Skip pickup: moving to 23021 and losing -0.44
Make pickup: moving to 23788 and making 33.12
Skip pickup: moving to 24277 and losing -0.44
Make pickup: moving to 24688 and making 28.50
No   pickup: moving to 24939 and losing -0.00
Make pickup: moving to 24938 and making  4.70
No   pickup: moving to 25706 and losing -1.41
Make pickup: moving to 26293 and making 19.95
Skip pickup: moving to 27026 and losing -0.44
Make pickup: moving to 27563 and making 29.77
Skip pickup: moving to 28041 and losing -0.89
Make pickup: moving to 28414 and making 32.33
Total / hour = 31.99

To search for near-optimal strategies, we can use sparsely sampled look-ahead trees (as discussed in this paper). That is, given a starting spacetime point, we can randomly sample combinations of possible future passenger trips and actions. We then calculate the corresponding payoffs for these combinations, and choose the action that yields the highest payoff on average. The set of such actions at all possible spacetime points then constitutes our strategy; since we don't sample all possible actions infinitely far into the future, the overall strategy is only near-optimal.

As one might expect, the performance of near-optimal strategies found using sparsely sampled look-ahead trees depends on the tree width/breadth (the number of possible moves and actions we sample) and depth (the number of moves we look ahead). Ideally, we'd like to compare our near-optimal strategies to those employed by real-life drivers (see the histogram of driver performance from last week). However, the fact that our spacetime game board is just an approximation to the real-life game board makes the comparison a bit unfair---for example, our simulated drivers are only allowed to make a single move every 10 minutes. So we'll have to do a little bit of scaling in order to compare apples to apples. We can also compare the performance of our near-optimal strategies to a random strategy, in which the driver simply moves at random if no pickup is made.

Interestingly enough, we see that typical real-life drivers perform no better than a random strategy! Furthermore, it's clear that increasing the width and depth of our trees does indeed lead to better strategies---even sampling just 32 possible actions at each of 4 moves ahead (i.e., at most 32^4) already yields about a $10/hour boost over a real or random strategy. However, it does seem that the variance in performance that arises from our strategies is a little larger than the real variance.

Finally, we also see that for strategies with a depth of 2 or more, the average performance is comparable to that of the top 5th percentile of real drivers. When asked how far he thinks ahead in a typical chess position, Kasparov stated: "Normally, I would calculate three to five moves. You don't need more..." Perhaps the best NYC cabbies do the same.

Monday, March 30, 2015

Week 1: Some Nifty Visualizations + 7 Habits of Highly Effective Hacks

Over the past few days, I've been digging even deeper into the taxi data, resulting in: 1) some nifty visualizations, and 2) a first cut at using machine learning to find optimal driver strategies. I'll discuss these results below, but for the purposes of presentation, I'll keep the python code to a minimum. If you'd like to take a peek under the hood, you can check out the iPython notebooks used to generate the below results in my github repository here (warning: somewhat hacky and uncommented code!).

For this analysis, we'll use the second chunk of the data from http://www.andresmh.com/nyctaxitrips/. First, some basic facts about the data:

Using date range 2013-02-01 00:00:00 to 2013-03-01 01:15:48.
Total of 13,990,176 trips and 32,062 drivers.
Total of 38,355,637.53 miles and 2,727,526.51 hours.
Total of $196,346,621.37 = $163,877,407.41 in fares, $18,311,045.64 in tips, and $14,158,168.32 in tolls/fees.

Next, let's make a simple scatter plot of the pickups and dropoffs. Here, I only use a subset of the data consisting of 500,000 trips (i.e., 1,000,000 pickups and dropoffs):

We can see that the GPS location data is impressively detailed! We can zoom in to some areas of interest---for example, LaGuardia Airport:

We see that the pickups clearly cluster around the taxi stands at each terminal, while the dropoffs trace LaGuardia Road and Central Terminal Drive.

We can also compare the areas around Madison Square Garden/Penn Station and Barclays Center to get an idea of how Manhattan taxi activity compares to that in Brooklyn. Here, I show all trips occurring in the 3-day period from 2/4/13 10AM to 2/6/13 10AM:

We can see that the number of pickups and dropoffs in midtown Manhattan is a few orders of magnitude larger than that in Brooklyn. Also interesting to note is that the GPS accuracy noticeably suffers in Manhattan (most likely due to the signals bouncing off of all of the skyscrapers), while in Brooklyn the resolution is fine enough to show that passengers get picked up from and dropped off on opposite sides of Flatbush!

Of course, the data also contains detailed temporal information. We can thus plot the time series of pickups/dropoffs in these two areas:

Note that the scale of the y-axes differs by a factor of 10. We see that sports events and concerts induce clear spikes in the number of pickups at Barclays after they end, while the lub-dub heartbeat of Penn Station is fairly constant (and is relatively unperturbed by the Piston vs. Knicks game on 2/4/13).

We can also combine both the spatial and temporal information to make some neat animations of the activity in this 3-day period:

Pretty cool! Now let's move on to a more serious and quantitative analysis of the data.

7 Habits of Highly Effective Hacks

We can combine this spatiotemporal information with the fare information to optimize strategies for drivers. That is, we can use machine learning to determine when and where the highest/lowest performing drivers make their pickups, using the spatiotemporal pickup frequencies as features.

However, this raises the question of how to label driver performance. One might guess that we should use total earnings (fares + tips) per hour, but this doesn't take into account that drivers need to pay for their own gas. We should thus penalize drivers by the total distance they travel.

This then raises another question. The taxi data only gives detailed information (i.e., trip time and distance) for periods when the taxi was occupied---not for those when the taxi was empty. So we can't determine the total distance traveled directly from the data. Instead, we'll approximate it by assuming the distance traveled when the taxi was empty is simply given by the distance from the previous dropoff point (we'll also use the Euclidean distance between these two points, rather than trying to determine the distance along a realistic route). This doesn't exactly account for the meandering that drivers might do after each dropoff, but probably is a good indicator of the average scenario.

By assuming the price of gas is roughly $3/gallon and that taxis are averaging around 15 mpg, we can then determine the corrected earnings per hour and find the corresponding 5th and 95th percentiles of drivers:

Limiting to only drivers with more than 300 trips under their belt, this gives us 1153 good and 1153 bad drivers.

Now that we've got our targets labeled, how should we determine our features? We can do this by splitting up our data into spatial and temporal bins. For the latter, we'll bin in 2-hour intervals each day, and further distinguish between weekdays (Mondays-Thursday) and weekends (Friday-Sunday). For the former, we'll first split up the city into spatial zones, using a 40x40 grid:

We'll then focus only on the 250 most active zones, lumping all other zones together (we'll call this Zone 0, or "the Boonies"):

Combining these with our temporal bins, we arrive at 2 (weekday/weekend) x 12 (2-hour time bins) x 251 (spatial zones) = 6024 features.

We then split our 2306 drivers into training and test sets (with a simple 50-50 split) and fit for feature weights for each of the 2 x 12 x 251 spatiotemporal bins, using a linear support vector classifier. We can also use 5-fold cross validation to evaluate the classifier performance; an accuracy of 91±3% is achieved by the model in predicting the performance of drivers in the test set.

Plotting these feature weights as a function of space and time then gives us some more cool animations. Here, red spots show zones frequented by high performing drivers at various times of the day, while blue spots show those zones frequented by low performing drivers:

More precisely, the color of each spot corresponds to the weight that each location/time combination contributes to the determination of driver performance. Red indicates positive weight (i.e., making frequent pickups in location/time combinations indicated by red spots contributes to positive driver performance), while blue indicates negative weight (i.e., drivers should avoid those locations at those particular times). However, there's a lot going on in these animations. To simplify the picture, we can use L1-Norm regularization to extract only the most important spatiotemporal features; setting C=0.06 gives us a model with only 284 features with non-zero weights. This simplified model also achieves about 90% accuracy---despite having fewer moving parts---and allows us to focus on the locations/times that most strongly determine driver performance:

Interestingly enough, we see that the only time it is profitable to make the trip out to Brooklyn to find fares is after midnight on a weekend---perhaps inebriated hipsters are inclined to tip a little more? Similarly, it looks like there are optimal times to pick up fares from LaGuardia during the week.

Examining only the features with large positive weights, we arrive at the top 7 habits of highly effective hacks (i.e., the 7 pickup times/areas that most strongly determine high performing drivers):

1) Weekend  6PM to  8PM Upper East Side
2) Weekday  8PM to 10PM SoHo
3) Weekday 10PM to 12AM SoHo
4) Weekday  6PM to  8PM City Hall
5) Weekend  8PM to 10PM East Village
6) Weekend  4PM to  6PM Yorkville
7) Weekday  6PM to  8PM Lenox Hill

These results seem pretty reasonable! Unsurprisingly, we see many well-to-do and nightlife areas making the list, but it seems that picking up workers at City Hall at the end of the workday also results in profitable trips. How about the 7 habits of less effective hacks?

1) Weekday  4AM to  6AM The Boonies
2) Weekend  6AM to  8AM The Boonies
3) Weekday  12AM to 2AM Port Authority
4) Weekday  2AM to  4AM The Boonies
5) Weekday  8PM to 10PM Lenox Hill
6) Weekday  8AM to 10AM Upper West Side
7) Weekday  2AM to  4AM Grand Central

Again, these results seem reasonable. It looks like picking up passengers from late night trips to Port Authority and Grand Central isn't very fruitful. And it's definitely not very smart to be hunting for fares out in the Boonies during the wee hours of the morning---we don't need a fancy machine-learning analysis to tell us that!

Of course, one could argue that we didn't actually get that fancy here. We simply used the most straightforward features (location/time of pickups); it is easy to imagine more sophisticated analyses are possible with the data. For example, we could try to determine whether it is better for hacks to 1) hunt for fares, or 2) hang out and wait for fares. This was previously studied in a nice paper, which used full GPS trajectories of taxis in Hangzhou (as opposed to just the pickup/dropoff data, which we are limited to here).

We could also try answering a more difficult question. We've found what the best real taxi drivers do---what about the perfect driver? We can imagine playing the game of optimizing net driver profit, using the spatiotemporal grid of pickup/dropoff frequencies to abstract the game as a time-dependent Markov Decision Process, as was done with Beijing taxi trajectories in this paper.

Clearly, there are many intriguing directions to pursue---so stay tuned for future posts!

Sunday, March 29, 2015

Week 0: Exploratory Analysis

To learn how to use pandas, I did an exploratory analysis of the public NYC taxi data set, which contains detailed fare and trip information---including things like tip amount and precise GPS/time data for pickups/dropoffs. I worked with the first chunk of the data available at http://www.andresmh.com/nyctaxitrips/ You can also read more about the data set at http://chriswhong.com/open-data/foil_nyc_taxi/

First, we import the fare and trip data seperately, then join. I had to clean up some of the column header names (removing whitespace).

In [1]:
import pandas as pd

fares_pd = pd.read_csv('trip_fare_1.csv.gz', compression='gzip', nrows=10000000, usecols=[0, 1, 2, 4, 5, 6, 7, 8, 9, 10])
trips_pd = pd.read_csv('trip_data_1.csv.gz', compression='gzip', nrows=10000000, usecols=range(5, 14))

#clean up column headers
fares_old_columns = fares_pd.columns
fares_pd.columns = [header.strip() for header in fares_old_columns]

taxi_pd = pd.merge(fares_pd, trips_pd, left_index=True, right_index=True, how='outer')
del fares_pd, trips_pd
taxi_CRD_pd = pd.DataFrame(taxi_pd[taxi_pd.payment_type == 'CRD']).drop('payment_type', 1)
del taxi_pd

Now to massage the tip and fare information to get tip percentages. Since cash fares are missing tip amounts, we'll work with credit-card fares. There are a few interesting complications to note.

First, when paying by credit card in a NYC taxi, the default tip options presented are 20%, 25%, and 30%. This leads to huge peaks in the histogram of tip percentages at these amounts.

In [2]:
hist((100.*taxi_CRD_pd.tip_amount / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)).values, bins=50, range=(0,50))
xlabel('tip percentage')
ylabel('number of fares')
show()

However, the spikes are not as sharp as you might expect. This is because there are two different credit-card--payment vendors, which use different formulas to calculate tip amounts---one vendor includes taxes and tolls in the total fare amount, the other does not. I discovered after some digging that this was known previously, see e.g. http://iquantny.tumblr.com/post/107245431809/how-software-in-half-of-nyc-cabs-generates-5-2

There are also rounding errors that smear out the spikes. We can correct for both of these factors. This will allow us to delineate between "manual" tips and "automatic" tips---the latter are those that yield a tip percentage of exactly 20%, 25%, or 30% after corrections (this almost definitely includes some "manual" tips erroneously, but we'll assume that New Yorkers aren't that good at doing math in their heads...) We'll filter out the automatic tips for some of the heatmaps we generate later so we can see the patterns underlying the large peaks they introduce.

In [3]:
taxi_CRD_pd['tip_frac'] = (100.*taxi_CRD_pd.tip_amount \
                           / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)) * (taxi_CRD_pd.vendor_id == 'CMT') + \
                          (100.*taxi_CRD_pd.tip_amount \
                           / (taxi_CRD_pd.fare_amount + taxi_CRD_pd.surcharge)) * (taxi_CRD_pd.vendor_id != 'CMT')
#round up/dn for 25% tip truncation of 0.005
taxi_CRD_pd['tip_frac_up'] = (100.*(taxi_CRD_pd.tip_amount + 0.005) \
                              / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)) * (taxi_CRD_pd.vendor_id == 'CMT') + \
                             (100.*(taxi_CRD_pd.tip_amount + 0.005) \
                              / (taxi_CRD_pd.fare_amount + taxi_CRD_pd.surcharge)) * (taxi_CRD_pd.vendor_id != 'CMT')
taxi_CRD_pd['tip_frac_dn'] = (100.*(taxi_CRD_pd.tip_amount - 0.005) \
                              / (taxi_CRD_pd.total_amount - taxi_CRD_pd.tip_amount)) * (taxi_CRD_pd.vendor_id == 'CMT') + \
                             (100.*(taxi_CRD_pd.tip_amount - 0.005) \
                              / (taxi_CRD_pd.fare_amount + taxi_CRD_pd.surcharge)) * (taxi_CRD_pd.vendor_id != 'CMT')
In [4]:
#filter out auto tips for heatmaps
def is_not_auto_tip(taxi_CRD_pd, tip_frac, thresh):
    if tip_frac == 20:
        return ((abs(around(taxi_CRD_pd.tip_frac, 2) - tip_frac) > thresh))
    return ((abs(around(taxi_CRD_pd.tip_frac, 2) - tip_frac) > thresh) &
            (abs(around(taxi_CRD_pd.tip_frac_up, 2) - tip_frac) > thresh) & 
            (abs(around(taxi_CRD_pd.tip_frac_dn, 2) - tip_frac) > thresh))

taxi_CRD_dist_pd = pd.DataFrame(taxi_CRD_pd[(taxi_CRD_pd.trip_distance <= 50) & 
                                            (taxi_CRD_pd.trip_distance > 0) &
                                            (taxi_CRD_pd.trip_time_in_secs > 0) &
                                            (taxi_CRD_pd.fare_amount < 50) &
                                            (taxi_CRD_pd.fare_amount*2 % 1 == 0) &
                                            (taxi_CRD_pd.tip_frac >= 0) &
                                            (taxi_CRD_pd.tip_frac < 50) &
                                            #include only manual tips
                                            is_not_auto_tip(taxi_CRD_pd, 20, 0.001) &
                                            is_not_auto_tip(taxi_CRD_pd, 25, 0.01) &
                                            is_not_auto_tip(taxi_CRD_pd, 30, 0.01)])

Now we bin the data in trip time, total distance, and tip percentage.

In [5]:
dist_max = 12.
time_max = 35.*60.
tip_frac_max = 30.
ndistbins = 15
ntimebins = 15
ntipbins = 30
trip_dist_bins = linspace(0, dist_max, ndistbins)
trip_time_bins = linspace(0, time_max, ntimebins)
tip_frac_bins = linspace(0, tip_frac_max, ntipbins)
In [6]:
cdist = pd.cut(taxi_CRD_dist_pd.trip_distance.values, trip_dist_bins)
ctime = pd.cut(taxi_CRD_dist_pd.trip_time_in_secs.values, trip_time_bins)
ctip = pd.cut(taxi_CRD_dist_pd.tip_frac.values, tip_frac_bins)

s_tipfrac = pd.Series(taxi_CRD_dist_pd.tip_frac)

gb_dist_tip_cnts = s_tipfrac.groupby([cdist.codes, ctip.codes]).count()
gb_tip_time_cnts = s_tipfrac.groupby([ctip.codes, ctime.codes]).count()
gb_dist_time_mean = s_tipfrac.groupby([cdist.codes, ctime.codes]).mean()

mi1 = pd.MultiIndex.from_product([range(-1, ndistbins-1), range(-1, ntipbins-1)], names=['dist', 'tip'])
mi2 = pd.MultiIndex.from_product([range(-1, ntipbins-1), range(-1, ntimebins-1)], names=['tip', 'time'])
mi3 = pd.MultiIndex.from_product([range(-1, ndistbins-1), range(-1, ntimebins-1)], names=['dist', 'time'])

gb_dist_tip_cnts_ri = gb_dist_tip_cnts.reindex(mi1).fillna(0)
gb_tip_time_cnts_ri = gb_tip_time_cnts.reindex(mi2).fillna(0)
gb_dist_time_mean_ri = gb_dist_time_mean.reindex(mi3).fillna(0)

Now for some plots! First we'll explore the relations between trip time, distance, and tip percentage.

In [7]:
figsize(12, 12)
rcParams.update({'font.size': 14})

#frequency of tip percentage vs. distance
subplot(2, 2, 1)
extent = [trip_dist_bins[0], trip_dist_bins[-1], tip_frac_bins[0], tip_frac_bins[-1]]
imshow(np.log10(reshape(gb_dist_tip_cnts_ri.values, (ndistbins, ntipbins)).T[1:, 1:]),
       origin='lower', extent=extent, aspect=dist_max/tip_frac_max, interpolation='none', vmin=0, vmax=6)
xlabel('trip distance (miles)')
ylabel('tip percentage')
cbar = colorbar(shrink=0.6, ticks=[0, 1, 2, 3, 4, 5, 6])
cbar.set_label('number of fares', rotation=270, labelpad=5)
cbar.set_ticklabels([r'$\leq1$', r'$10^1$', r'$10^2$', r'$10^3$', r'$10^4$', r'$10^5$', r'$\geq10^6$'])

#frequency of tip percentage vs. time
subplot(2, 2, 4)
extent = [tip_frac_bins[0], tip_frac_bins[-1], trip_time_bins[0]/60., trip_time_bins[-1]/60.]
imshow(np.log10(reshape(gb_tip_time_cnts_ri.values, (ntipbins, ntimebins)).T[1:, 1:]),
       origin='lower', extent=extent, aspect=tip_frac_max/(time_max/60.), interpolation='none', vmin=0, vmax=6)
xlabel('tip percentage')
ylabel('trip time (minutes)')
cbar = colorbar(shrink=0.6, ticks=[0, 1, 2, 3, 4, 5, 6])
cbar.set_label('number of fares', rotation=270, labelpad=5)
cbar.set_ticklabels([r'$\leq1$', r'$10^1$', r'$10^2$', r'$10^3$', r'$10^4$', r'$10^5$', r'$\geq10^6$'])

#mean tip percentage vs. distance and time
subplot(2, 2, 3)
extent = [trip_dist_bins[0], trip_dist_bins[-1], trip_time_bins[0]/60., trip_time_bins[-1]/60.]
imshow(reshape(gb_dist_time_mean_ri.values, (ndistbins, ntimebins)).T[1:, 1:],
       origin='lower', extent=extent, aspect=dist_max/(time_max/60.), interpolation='bicubic', vmin=15, vmax=20)
xlabel('trip distance (miles)')
ylabel('trip time (minutes)')
cbar = colorbar(shrink=0.6, ticks=[15, 16, 17, 18, 19, 20])
cbar.set_label('mean tip percentage', rotation=270, labelpad=13)
cbar.set_ticklabels([r'$\leq$15', '16', '17', '18', '19', r'$\geq$20'])

tight_layout(h_pad=-6, w_pad=2)
# plt.savefig('dist_time_auto_tip.png')
show()

We can see that our corrections really cleaned up the peaks from automatic tips---they now show up as sharp spikes, rather than spread-out bumps. We can also see some interesting trends in the dependence of tip percentage vs. both time and distance for manual tips. It also looks like trips that are short in distance and duration garner larger tips on average. The same is true for trips around 10 miles, which most likely are from LaGuardia to Manhattan (and are probably being expensed...)

We now use the GPS data to plot heatmaps of pickup locations. We'll look at both the number of pickups and the mean tip percentage as a function of pickup location. First, we'll bin in longitude and latitude of the pickup location. But before doing so, we'll filter out autotips so we can see the trends in the manual tips underlying those spikes (which would otherwise dominate the mean tip percentage). We can do this by uncommenting lines in a previous cell and reevaluating.

In [8]:
#bin in lat and lng for pickup and mean tip percentage heatmaps

center_lat = 40.76
center_lng = -73.925
dlat = 0.125
dlng = 0.125
nlatbins = 100
nlngbins = 100

def drop_mean(ary):
    if ary.size < 50:
        return 0.
    return mean(ary)

lng_bins = linspace(center_lng - dlng, center_lng + dlng, nlngbins)
lat_bins = linspace(center_lat - dlat, center_lat + dlat, nlatbins)

clng = pd.cut(taxi_CRD_dist_pd.pickup_longitude.values, lng_bins)
clat = pd.cut(taxi_CRD_dist_pd.pickup_latitude.values, lat_bins)

s_tipfrac = pd.Series(taxi_CRD_dist_pd.tip_frac)
gb_mean = s_tipfrac.groupby([clng.codes, clat.codes]).apply(drop_mean)
gb_cnts = s_tipfrac.groupby([clng.codes, clat.codes]).count()
mi = pd.MultiIndex.from_product([range(-1, nlngbins-1), range(-1, nlatbins-1)], names=['lng', 'lat'])
gb_mean_ri = gb_mean.reindex(mi).fillna(0)
gb_cnts_ri = gb_cnts.reindex(mi).fillna(0)

Now we plot the corresponding heatmaps.

In [9]:
figsize(16, 8)
rcParams.update({'font.size': 12})
subplot(1, 2, 1)
weighted_coord_bins_cnts = reshape(gb_cnts_ri.values, (nlngbins, nlatbins))
extent = [lng_bins[0], lng_bins[-1], lat_bins[0], lat_bins[-1]]
imshow(log10(weighted_coord_bins_cnts.T[1:, 1:] + 0.1), interpolation='none', origin='lower', 
       extent=extent, vmin=0, vmax=4.5)
xlabel('longitude (degrees)', size=14)
ylabel('latitude (degrees)', size=14)
cbar = colorbar(shrink=0.7, ticks=[0, 1, 2, 3, 4], pad=0.025)
cbar.set_label('number of fares', rotation=270, labelpad=13, size=14)
cbar.set_ticklabels([r'$\leq1$', r'$10^1$', r'$10^2$', r'$10^3$', r'$\geq10^4$'])

subplot(1, 2, 2)
weighted_coord_bins_mean = reshape(gb_mean_ri.values, (nlngbins, nlatbins))
extent = [lng_bins[0], lng_bins[-1], lat_bins[0], lat_bins[-1]]
imshow(weighted_coord_bins_mean.T[1:, 1:], interpolation='bicubic', origin='lower', 
       extent=extent, vmin=10, vmax=15)
xlabel('longitude (degrees)', size=14)
ylabel('latitude (degrees)', size=14)
cbar = colorbar(shrink=0.7, ticks=[10, 11, 12, 13, 14, 15], pad=0.025)
cbar.set_label('mean tip percentage', rotation=270, labelpad=13, size=14)
cbar.set_ticklabels([r'$\leq$10', '11', '12', '13', '14', r'$\geq$15'])

tight_layout(w_pad=6)
# plt.savefig('heatmaps-pixel.png')
show()

On the left, we can clearly see the geography of NYC traced out by the frequency of taxi pickups---LaGuardia and the Queens Boulevard stand out, and if you can even squint and see Penn Station... From the plot on the right, it looks like there are big tippers on the Lower East Side, near LaGuardia, etc.! One could imagine that ridesharing companies like Uber or Lyft could use data like this to improve dynamic pricing, vehicle dispatch, etc. You could even try to make predictive models of taxi pickups, correlating with other public NYC transportation data sets, weather data, or even social data...one can imagine many intriguing possibilities!

A postscript script: here's some code to make the heatmaps overlaid on a map of NYC with the Google Maps API and the JavaScript example found at https://developers.google.com/maps/documentation/javascript/examples/layer-heatmap. Unfortunately, even after tweaking, the resolution of the resulting heatmap is not as good as those above. Nevertheless, you can see the result here.

In [10]:
#code to plot heatmap using Google Maps API...resolution not as good

import os.path

def write_weighted_coords(html_file, lat_bins, lng_bins, weighted_coord_bins, max_intensity):
    if not os.path.exists(html_file):
        file(html_file, 'w').close()
    else:
        os.remove(html_file)
        
    weights_scaled = weighted_coord_bins
    
    with open('heatmap_template.html') as f_old, open(html_file, 'w') as f_new:
        for line in f_old:
            f_new.write(line)
            if '// LAT LNG COORDS GO HERE' in line:
                for lat_bin in range(1, len(lat_bins)):
                    for lng_bin in range(1, len(lng_bins)):
                        if (lat_bin != len(lat_bins)) & (lng_bin != len(lng_bins)):
                            f_new.write('  {{location: new google.maps.LatLng({:.6f}, {:.6f}), weight: {:.2f}}},\n'\
                                        .format(lat_bins[lat_bin], lng_bins[lng_bin], 
                                                max(weights_scaled[lng_bin][lat_bin], 0.)))
                f_new.write('  {{location: new google.maps.LatLng({:.6f}, {:.6f}), weight: {:.2f}}}'\
                            .format(lat_bins[lat_bin], lng_bins[lng_bin],
                                    max(weights_scaled[lng_bin][lat_bin], 0.)))
            if '// SET MAX INTENSITY HERE' in line:
                f_new.write('var maxIntensity = {:.2f};\n'.format(max_intensity))

write_weighted_coords('mean_pickup_heatmap_all.html', lat_bins, lng_bins, 
                      weighted_coord_bins_mean, 1.)
write_weighted_coords('counts_pickup_heatmap_all.html', lat_bins, lng_bins, 
                      log10(weighted_coord_bins_cnts), 1.)