NYC Taxi Data Analysis

(c) Jeff Breeding-Allison, updated May 4, 2016.

Description

In this project, I analyzed NYC taxi data from January 2015. The data was analyzed with IPython in a Jupyter Notebook.

I plotted the pickup locations for both yellow cabs and green cabs. I also compared the tip amount against the total fare and the total amount paid (which included the tip) for yellow cabs. The data was obtained from the NYC Taxi & Limousine Commission's website:

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

The data files I used were

"green_tripdata_2015-01.csv" (

"yellow_tripdata_2015-01.csv" (

Data

I used the packages

#!/usr/bin/python import numpy as np import pandas as pd import matplotlib.pyplot as plt from urllib import urlopen import os import csv import scipy.stats %matplotlib inline

################# # Load taxi data files located in same directory as this notebook ################# ParentDirectory = os.getcwd() inputfile_g = os.path.join(ParentDirectory, "green_tripdata_2015-01.csv") inputfile_y = os.path.join(ParentDirectory, "yellow_tripdata_2015-01.csv") ydf = pd.read_csv(open(inputfile_y)) gdf = pd.read_csv(open(inputfile_g))

################# # Load taxi data files from the web ################# ypage = urlopen("https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-01.csv") ydf = pd.read_csv(ypage) gpage = urlopen("https://storage.googleapis.com/tlc-trip-data/2015/green_tripdata_2015-01.csv") gdf = pd.read_csv(gpage)

################# # Show the first rows of the yellow taxi data ################# ydf.head()

################# # Show the first rows of the green taxi data ################# gdf.head()

Pickup locations

I made the following scatter plots of the pickup locations for the yellow cabs and green cabs. There were many outliers outside New York's five boroughs, so I restricted the box to more tightly contain the box of New York city limits as stated here. The box determined by this source has

I used the box with the following corners for my scatter plots:

################# # Extract the coordinates of each pick-up ################# pydf = ydf[['pickup_latitude','pickup_longitude']] pgdf = gdf[['Pickup_latitude','Pickup_longitude']]

################# # Show the first rows of the yellow taxi location data ################# print "Yellow taxis" print "Number of rows: %i" % pydf.shape[0] pydf.head()

Here are the first rows of the green taxi pickup location data:

################# # Show the first rows of the green taxi location data ################# print "Green taxis" print "Number of rows: %i" % pgdf.shape[0] pgdf.head()

Now we make our scatter plots. Example code, in the case of both green and yellow pickup locations plotted, is

################# # Make a scatter plot of the yellow and green taxi location data ################# fig = plt.figure(figsize=(20,20)) plt.scatter(pgdf['Pickup_longitude'], pgdf['Pickup_latitude'], color='green', s=0.5, label='Green taxis') plt.scatter(pydf['pickup_longitude'], pydf['pickup_latitude'], color='yellow', s=0.5, label='Yellow taxis') plt.title('NYC Taxi Pickup Locations January 2015') plt.xlabel('Longitude') plt.ylabel('Latitude') plt.xlim(-74.259090,-73.600272)#-74.259090,-73.700272) plt.ylim(40.517399, 40.997577)#40.477399, 40.917577) #plt.grid() plt.gca().set_axis_bgcolor('black') plt.legend() plt.show()

Green and yellow taxi pickup locations:

Yellow taxi pickup locations:

Green taxi pickup locations:

Tips

I next compared the tip amount to the total amount for yellow cabs. I cleaned the data to only consider when the total amount was less than or equal to $200.

################# # Extract the tip and total amount data ################# tiptotalamountsydf = ydf[['tip_amount','total_amount']] cleantiptotalamountsydf = tiptotalamountsydf.loc[tiptotalamountsydf['total_amount'].isin(range(1,201))] ################# # Make a scatter plot of the tip and total amount data ################# x = cleantiptotalamountsydf['total_amount'] - cleantiptotalamountsydf['tip_amount'] y = cleantiptotalamountsydf['tip_amount'] fig = plt.figure(figsize=(20,20)) plt.scatter(x, y) plt.title('NYC Taxi Tips by Total Amount, January 2015') plt.xlabel('Total Amount without Tip') plt.ylabel('Tip') plt.xlim(0,200) plt.ylim(0,200) plt.show()

Here is a scatter plot of pairs (total_amount-tip, tip):

In New York cabs, a payment screen appears at the end of your trip. It has options to tip 20%, 25%, or 30% as indicated in the following image from The Huffington Post:

This was the case also in January 2015. Let's calculate a regression line to see what people actually tip.

################# # Make a scatter plot of the tip and total amount data and a linear regression model ################# x = cleantiptotalamountsydf['total_amount'] - cleantiptotalamountsydf['tip_amount'] y = cleantiptotalamountsydf['tip_amount'] fit = np.polyfit(x, y, deg=1) print 'Slope =', fit[0] print 'y-intercept =', fit[1] fig = plt.figure(figsize=(20,20)) plt.scatter(x, y) plt.plot(x, fit[0] * x + fit[1], color='red') plt.title('NYC Taxi Tips by Total Amount, January 2015') plt.xlabel('Total Amount without Tip') plt.ylabel('Tip') plt.xlim(0,200) plt.ylim(0,200) plt.show()