Jeff Breeding-Allison

HomeCVResearchTeachingProgramming

NYC Taxi Data Analysis

(c) Jeff Breeding-Allison, updated May 4, 2016.

greenandyellowtaxi.png

Description

In this project, I analyzed NYC taxi data from January 2015. The data was analyzed with IPython in a Jupyter Notebook.

I plotted the pickup locations for both yellow cabs and green cabs. I also compared the tip amount against the total fare and the total amount paid (which included the tip) for yellow cabs. The data was obtained from the NYC Taxi & Limousine Commission's website:

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

The data files I used were

 "green_tripdata_2015-01.csv" (243.3 MB ::: 1,508,501 rows) and
 "yellow_tripdata_2015-01.csv" (1.99 GB ::: 12,748,986 rows).


Data

I used the packages pandas, matplotlib.pyplot, urllib, os, csv, and scipy.stats.
#!/usr/bin/python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from urllib import urlopen
import os
import csv
import scipy.stats

%matplotlib inline
I saved the taxi data files locally.
#################
# Load taxi data files located in same directory as this notebook
#################
ParentDirectory = os.getcwd()
inputfile_g = os.path.join(ParentDirectory, "green_tripdata_2015-01.csv")
inputfile_y = os.path.join(ParentDirectory, "yellow_tripdata_2015-01.csv")

ydf = pd.read_csv(open(inputfile_y))
gdf = pd.read_csv(open(inputfile_g))
In case you want to load these files directly from the web (not recommended!):
#################
# Load taxi data files from the web
#################
ypage = urlopen("https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-01.csv")
ydf = pd.read_csv(ypage)

gpage = urlopen("https://storage.googleapis.com/tlc-trip-data/2015/green_tripdata_2015-01.csv")
gdf = pd.read_csv(gpage)
Here is what the data looks like:
#################
# Show the first rows of the yellow taxi data
#################
ydf.head()
yellowcabhead.png
#################
# Show the first rows of the green taxi data
#################
gdf.head()
greencabhead


Pickup locations

I made the following scatter plots of the pickup locations for the yellow cabs and green cabs. There were many outliers outside New York's five boroughs, so I restricted the box to more tightly contain the box of New York city limits as stated here. The box determined by this source has
Lat/Lon Northwest: 40.917577, -74.25909
Lat/Lon Southeast: 40.477399, -73.700009

I used the box with the following corners for my scatter plots:
Lat/Lon Northwest: 40.997577, -74.259090
Lat/Lon Southeast: 40.517399, -73.600272

#################
# Extract the coordinates of each pick-up
#################
pydf = ydf[['pickup_latitude','pickup_longitude']]
pgdf = gdf[['Pickup_latitude','Pickup_longitude']]
Here are the first rows of the yellow taxi pickup location data:
#################
# Show the first rows of the yellow taxi location data
#################
print "Yellow taxis"
print "Number of rows: %i" % pydf.shape[0]
pydf.head()
yellowcabpickuphead.png
Here are the first rows of the green taxi pickup location data:
#################
# Show the first rows of the green taxi location data
#################
print "Green taxis"
print "Number of rows: %i" % pgdf.shape[0]
pgdf.head()
greencabpickuphead.png
Now we make our scatter plots. Example code, in the case of both green and yellow pickup locations plotted, is
#################
# Make a scatter plot of the yellow and green taxi location data
#################
fig = plt.figure(figsize=(20,20))
plt.scatter(pgdf['Pickup_longitude'], pgdf['Pickup_latitude'], color='green', s=0.5, label='Green taxis')
plt.scatter(pydf['pickup_longitude'], pydf['pickup_latitude'], color='yellow', s=0.5, label='Yellow taxis')
plt.title('NYC Taxi Pickup Locations January 2015')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.xlim(-74.259090,-73.600272)#-74.259090,-73.700272)
plt.ylim(40.517399, 40.997577)#40.477399, 40.917577)
#plt.grid()
plt.gca().set_axis_bgcolor('black')
plt.legend()
plt.show()
This and similar code yields the following scatter plots. Note we colored the background black for aesthetics.


Green and yellow taxi pickup locations:

greenandyellowtaxi.png

Yellow taxi pickup locations:

yellowtaxi.png

Green taxi pickup locations:

greentaxi.png

Tips

I next compared the tip amount to the total amount for yellow cabs. I cleaned the data to only consider when the total amount was less than or equal to $200.
#################
# Extract the tip and total amount data
#################
tiptotalamountsydf = ydf[['tip_amount','total_amount']]

cleantiptotalamountsydf = tiptotalamountsydf.loc[tiptotalamountsydf['total_amount'].isin(range(1,201))]

#################
# Make a scatter plot of the tip and total amount data
#################
x = cleantiptotalamountsydf['total_amount'] - cleantiptotalamountsydf['tip_amount']
y = cleantiptotalamountsydf['tip_amount']

fig = plt.figure(figsize=(20,20))
plt.scatter(x, y)
plt.title('NYC Taxi Tips by Total Amount, January 2015')
plt.xlabel('Total Amount without Tip')
plt.ylabel('Tip')
plt.xlim(0,200)
plt.ylim(0,200)
plt.show()


Here is a scatter plot of pairs (total_amount-tip, tip):

tiptotal.png


In New York cabs, a payment screen appears at the end of your trip. It has options to tip 20%, 25%, or 30% as indicated in the following image from The Huffington Post:

2016-02-08-1454963673-3366100-taxi_dark_pattern.jpg


This was the case also in January 2015. Let's calculate a regression line to see what people actually tip.
#################
# Make a scatter plot of the tip and total amount data and a linear regression model
#################
x = cleantiptotalamountsydf['total_amount'] - cleantiptotalamountsydf['tip_amount']
y = cleantiptotalamountsydf['tip_amount']

fit = np.polyfit(x, y, deg=1)
print 'Slope =', fit[0]
print 'y-intercept =', fit[1]

fig = plt.figure(figsize=(20,20))
plt.scatter(x, y)
plt.plot(x, fit[0] * x + fit[1], color='red')
plt.title('NYC Taxi Tips by Total Amount, January 2015')
plt.xlabel('Total Amount without Tip')
plt.ylabel('Tip')
plt.xlim(0,200)
plt.ylim(0,200)
plt.show()
Here is a plot of the linear regression model for the pairs (total_amount-tip, tip). This linear regression model has

slope = 0.17064017535 and y-intercept = 0.126236967211:

tiptotal_line.png