The NYC public transportantion system - Metro Transit Authority - provides data for download via csv files. Part of the information available are data from the subway turnstiles, containing weekly logs for cumulative entries and exits by turnstile and by subway station during a provided timeframe.
For this project, we will only use the information available at: http://web.mta.info/developers/turnstile.html.
For this project, you will apply the knowledge acquired in the first month of this course. We will practice basic data acquisition and data cleaning tasks to find out fundamental stuff about the data using what we learned in the Statistics course.
The goal of this project is to explore the relationship between data from the NYC Subway turnstiles and the city weather. For this, besides data from the subway, we will also need data from the weather in NYC.
Here are the main points that will be considered in this work:
- Gathering data from the Internet
- Using Statistics for Data Analysis
- Data handling and simple graphics creation with
Pandas
How to find help: We suggest that you try the following channels, in the following order:
Type of Question\Channels Google Forum Slack Email
Pandas and Python Programming 1 2 3
Projects Requiriments 1 2 3
Projects Specific Parts 1 2 3
Here is the address for each of these channels:
- Forum: https://discussions.udacity.com/c/ndfdsi-project
- Slack: Big Data Foundations
- Email: [email protected]
The student is expected to submit this report including:
- All TODO's completed, as they are crucial for the code to run accordingly
- The ipynb file, exported as html
To submit this project, go to the
classroom,
and submit your zipped .ipynb
and html.
Before we start, there are a few things you must have in mind while using iPython notebooks:
- Remember you can see, in the left side of a code cell, when was the last time it ran, if there is a number inside the keys.
- When starting a new session in the notebook, please make sure to run all cells up to the point where you last left it. Even if the output can still be viewed from the moment you ran your cells in the previews session, the kernel starts in a new state, so you will need to reload all data, etc. in a new session.
- The previous point is useful to have in mind if your answers do not match what is expected from the quizzes in the classroom. Try reloading the data and running all processing steps, one by one, to make sure you're working with the same variables and data from each step of the quizz.
Let's do it!! Now it's your turn to gather data. Please write bellow a Python code to access the link http://web.mta.info/developers/turnstile.html and download all files from June 2017. The file must be named turnstile_100617.txt, where 10/06/17 is the file's date.
Please see below a few commands that might help you:
Use the urllib library to open and redeem a webpage. Use the command below, where url is the webpage path to the following file:
u = urllib.urlopen(url)
html = u.read()
Use the BeautifulSoup library to search for the link to the file you want to donwload in the page. Use the command below to create your soup object and search for all 'a' tags in the document:
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all('a')
A tip to only download the files from June is to check data in the name of the file. For instance, to donwload the 17/06/2017 file, please see if the link ends with "turnstile_170610.txt". If you forget to do this, you will download all files from that page. In order to do this, you can use the following command:
if '1706' in link.get('href'):
Our final tip is to use the command bellow to download the txt file:
urllib.urlretrieve(link_do_arquivo, filename)
Please remember - you first have to load all packages and functions that will be used in your analysys.
In [1]:
import urllib
import requests
from bs4 import BeautifulSoup
url="http://web.mta.info/developers/"
urlLink = requests.get(url + "turnstile.html")
html = urlLink.text
soup = BeautifulSoup(html,"html.parser")
links = soup.find_all('a')
for link in links:
u = str(link.get('href'))
if '1706' in u:
getFileName = link.get('href').split('/')[-1].split('_')
print(u)
date = getFileName[1][0:2]
month = getFileName[1][2:4]
year = getFileName[1][4:6]
string = getFileName[0] + "_" + date + month + year + ".txt"
print(string)
urllib.request.urlretrieve(url + '/' + u, string)
data/nyct/turnstile/turnstile_170624.txt
turnstile_170624.txt
data/nyct/turnstile/turnstile_170617.txt
turnstile_170617.txt
data/nyct/turnstile/turnstile_170610.txt
turnstile_170610.txt
data/nyct/turnstile/turnstile_170603.txt
turnstile_170603.txt
Write down a function that takes the list of all names of the files you downloaded in Exercise 1.1 and compile them into one single file. There must be only one header line in the output file.
For example, if file_1 has: line 1... line 2...
and the other file, file_2, has: line 3... line 4... line 5...
We must combine file_1 and file_2 into one master file, as follows:
'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn' line 1... line 2... line 3... line 4... line 5...
In [2]:
def create_master_turnstile_file(filenames, output_file):
with open(output_file, 'w') as master_file:
master_file.write('C/A,UNIT,SCP,STATION, LINENAME, DIVISION, DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
for filename in filenames:
file1 = open(filename,'r')
count = 0
for row in file1:
if(count == 1):
master_file.writelines(row)
else:
count = 1
create_master_turnstile_file(["turnstile_170624.txt", "turnstile_170617.txt", "turnstile_170610.txt", "turnstile_170603.txt"], "master_file.txt")
For this exercise, you will write a function that reads the master_file created in the previous exercise and load it into a Pandas Dataframe. This function can be filtered, so that the Dataframe only has lines where column "DESCn" has the value "Regular".
For example, if the Pandas Dataframe looks like this:
,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231
The Dataframe must look like the following, after filtering only the lines where column DESCn has the value REGULAR:
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
In [6]:
import pandas
def filter_by_regular(filename):
turnstile_data = pandas.read_csv(filename)
turnstile_data = pandas.DataFrame(turnstile_data)
turnstile_data = turnstile_data[turnstile_data.DESCn == 'REGULAR']
return turnstile_data
file1 = filter_by_regular("master_file.txt")
file1.head()
Out[6]:
C/A
UNIT
SCP
STATION
LINENAME
DIVISION
DATEn
TIMEn
DESCn
ENTRIESn
EXITSn
0
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
00:00:00
REGULAR
6224816
2107317
1
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
04:00:00
REGULAR
6224850
2107322
2
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
08:00:00
REGULAR
6224885
2107352
3
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
12:00:00
REGULAR
6225005
2107452
4
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
16:00:00
REGULAR
6225248
2107513
The NYC Subway data has cumulative entry and exit data in each line. Let's assume you have a Dataframe called df, which contains only lines for one particular turnstile (unique SCP, C/A, and UNIT). The following function must change these cumulative entries for counting all entries since the last reading (entries from the last line of the Dataframe).
More specifically, there are two things you should do:
1 - Create a new column, called ENTRIESn_hourly 2 - Insert in this column the difference between ENTRIESn in the current and the previous column. If the line has any NAN, fill it out/replace by 1.
Tip: The funtions shift() and fillna() in Pandas might be usefull for this exercise.
Below you will find and example of how your Dataframe should look by the end of this exercise:
C/A UNIT SCP DATEn TIMEn DESCn ENTRIESn EXITSn ENTRIESn_hourly
0 A002 R051 02-00-00 05-01-11 00:00:00 REGULAR 3144312 1088151 1
1 A002 R051 02-00-00 05-01-11 04:00:00 REGULAR 3144335 1088159 23
2 A002 R051 02-00-00 05-01-11 08:00:00 REGULAR 3144353 1088177 18
3 A002 R051 02-00-00 05-01-11 12:00:00 REGULAR 3144424 1088231 71
4 A002 R051 02-00-00 05-01-11 16:00:00 REGULAR 3144594 1088275 170
5 A002 R051 02-00-00 05-01-11 20:00:00 REGULAR 3144808 1088317 214
6 A002 R051 02-00-00 05-02-11 00:00:00 REGULAR 3144895 1088328 87
7 A002 R051 02-00-00 05-02-11 04:00:00 REGULAR 3144905 1088331 10
8 A002 R051 02-00-00 05-02-11 08:00:00 REGULAR 3144941 1088420 36
9 A002 R051 02-00-00 05-02-11 12:00:00 REGULAR 3145094 1088753 153
10 A002 R051 02-00-00 05-02-11 16:00:00 REGULAR 3145337 1088823 243
In [9]:
import pandas
def get_hourly_entries(df):
df['ENTRIESn_hourly'] = df.ENTRIESn.diff(1)
df.ENTRIESn_hourly.fillna(1,inplace = True)
return df
file2 = get_hourly_entries(file1)
file2.head()
Out[9]:
C/A
UNIT
SCP
STATION
LINENAME
DIVISION
DATEn
TIMEn
DESCn
ENTRIESn
EXITSn
ENTRIESn_hourly
0
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
00:00:00
REGULAR
6224816
2107317
1.0
1
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
04:00:00
REGULAR
6224850
2107322
34.0
2
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
08:00:00
REGULAR
6224885
2107352
35.0
3
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
12:00:00
REGULAR
6225005
2107452
120.0
4
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
16:00:00
REGULAR
6225248
2107513
243.0
Do the same thing you did in the previous exercise, but taking into account the exits, column EXITSn. For this, you need to create a column called EXITSn_hourly and insert the difference between the column EXITSn in the current line vs he previous line. If there is any NaN, fill it out/replace by 0.
In [11]:
import pandas
def get_hourly_exits(df):
df['EXITSn_hourly'] = df.EXITSn.diff(1)
df.EXITSn_hourly.fillna(1,inplace = True)
return df
file3 = get_hourly_exits(file1)
file3.head()
Out[11]:
C/A
UNIT
SCP
STATION
LINENAME
DIVISION
DATEn
TIMEn
DESCn
ENTRIESn
EXITSn
ENTRIESn_hourly
EXITSn_hourly
0
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
00:00:00
REGULAR
6224816
2107317
1.0
1.0
1
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
04:00:00
REGULAR
6224850
2107322
34.0
5.0
2
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
08:00:00
REGULAR
6224885
2107352
35.0
30.0
3
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
12:00:00
REGULAR
6225005
2107452
120.0
100.0
4
A002
R051
02-00-00
59 ST
NQR456W
BMT
06/17/2017
16:00:00
REGULAR
6225248
2107513
243.0
61.0
Given an entry variable that represents time, in the format: "00:00:00" (hour: minutes: seconds) Write a function to extract the hour part from the time in the entry variable And return it as an integer. For example: 1) if hour is 00, your code must return 0 2) if hour is 01, your code must return 1 3) if hour is 21, your code must return 21 Please return te hour as an integer.
In [2]:
import pandas
def time_to_hour(time):
hour = time.TIMEn.str[-8:-6].astype(int) # your code here
return hour
df=pandas.read_csv("master_file.txt");
print(time_to_hour(df))
0 0
1 4
2 8
3 12
4 16
5 20
6 0
7 4
8 8
9 12
10 16
11 20
12 0
13 4
14 8
15 12
16 16
17 20
18 0
19 4
20 8
21 12
22 16
23 20
24 0
25 4
26 8
27 12
28 16
29 20
..
788184 1
788185 5
788186 9
788187 13
788188 17
788189 21
788190 1
788191 5
788192 9
788193 13
788194 17
788195 21
788196 1
788197 5
788198 9
788199 13
788200 17
788201 21
788202 1
788203 5
788204 9
788205 13
788206 17
788207 21
788208 1
788209 5
788210 9
788211 13
788212 17
788213 21
Name: TIMEn, Length: 788214, dtype: int32
To understand the relationship between the Subway activity and the weather, please complete the data from the file already downloaded with the weather data. We provided you with the file containing NYC weather data and made it available with the Support Material. You can access it through the link: https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv
Now that we have our data in a csv file, write Python code that reads this file and saves it into a Pandas Dataframe.
Tip:
Use the command below to read the file:
pd.read_csv('output_list.txt', sep=",")
In [1]:
import pandas
filename = "turnstile_data_master_with_weather.csv"
filename1 = pandas.read_csv(filename, sep=",")
filename1 = pandas.DataFrame(filename1)
filename1.head()
Out[1]:
Unnamed: 0
UNIT
DATEn
TIMEn
Hour
DESCn
ENTRIESn_hourly
EXITSn_hourly
maxpressurei
maxdewpti
...
meandewpti
meanpressurei
fog
rain
meanwindspdi
mintempi
meantempi
maxtempi
precipi
thunder
0
0
R001
2011-05-01
01:00:00
1
REGULAR
0.0
0.0
30.31
42.0
...
39.0
30.27
0.0
0.0
5.0
50.0
60.0
69.0
0.0
0.0
1
1
R001
2011-05-01
05:00:00
5
REGULAR
217.0
553.0
30.31
42.0
...
39.0
30.27
0.0
0.0
5.0
50.0
60.0
69.0
0.0
0.0
2
2
R001
2011-05-01
09:00:00
9
REGULAR
890.0
1262.0
30.31
42.0
...
39.0
30.27
0.0
0.0
5.0
50.0
60.0
69.0
0.0
0.0
3
3
R001
2011-05-01
13:00:00
13
REGULAR
2451.0
3708.0
30.31
42.0
...
39.0
30.27
0.0
0.0
5.0
50.0
60.0
69.0
0.0
0.0
4
4
R001
2011-05-01
17:00:00
17
REGULAR
4400.0
2501.0
30.31
42.0
...
39.0
30.27
0.0
0.0
5.0
50.0
60.0
69.0
0.0
0.0
5 rows × 22 columns
Now, create a function that calculates the number of rainy days. For this, return the count of the number of days where the column "rain" is equal to 1.
Tip: You might think that interpreting numbers as integers or floats might not work at first. To handle this issue, it might be useful to convert these numbers into integers. You can do this by writting cast (column as integer). So, for example, if we want to launch the column maxtempi as an integer, we have to write something like cast (maxtempi as integer) = 76, instead of just where maxtempi = 76.
In [10]:
def num_rainy_days(df):
count = 0
for i in df:
if(int(i) == 1):
count+=1
return count
countR = num_rainy_days(filename1["rain"])
print(countR)
44104
Calculate if the day was cloudy or not (0 or 1) and the maximum temperature for fog (i.e. the maximum temperature for cloudy days).
In [19]:
import pandas as pd
def max_temp_aggregate_by_fog(df):
df = df.loc[df["fog"] == 1]
count = df["maxtempi"].max()
return count
file = "turnstile_data_master_with_weather.csv"
filename1 = pd.read_csv(file);
output = max_temp_aggregate_by_fog(filename1)
print(output)
81.0
Now, calculate the mean for 'meantempi' for the days that are Saturdays or Sundays (weekend):
In [20]:
def avg_weekend_temperature(filename):
filename["DATEn"] = pandas.to_datetime(filename.DATEn)
aday = (filename.DATEn.dt.weekday_name)
filename = filename.loc[(aday=="Saturday")|(aday=="Sunday")]
mean = filename["meantempi"].mean()
return mean
filename = "turnstile_data_master_with_weather.csv"
filename1 = pandas.read_csv(filename)
output = avg_weekend_temperature(filename1)
print(output)
65.10066685403307
Calculate the mean of the minimum temperature 'mintempi' for the days when the minimum temperature was greater that 55 degrees:
In [21]:
def avg_min_temperature(filename):
avg_min_temp_rainy = filename.loc[filename["mintempi"]>55]["mintempi"].mean()
return avg_min_temp_rainy
filename = "turnstile_data_master_with_weather.csv"
filename1 = pandas.read_csv(filename)
output = avg_min_temperature(filename1)
print(output)
63.2699012987013
Before you make any analysis, it might be useful to look at the data we want to analyse. More specifically, we will evaluate the entries by hour in our data from the NYC Subway to determine the data distribution. This data is stored in the column ['ENTRIESn_hourly']. Draw two histogramns in the same axis, to show the entries when it's raining vs when it's not. Below, you will find an example of how to draw histogramns with Pandas and Matplotlib:
Turnstile_weather ['column_to_graph']. Hist ()
In [2]:
import numpy as np
import pandas
import matplotlib.pyplot as plt
def entries_histogram(turnstile_weather):
plt.figure()
turnstile_weather.loc[turnstile_weather['rain'] == 0 ]['ENTRIESn_hourly'].hist(label = "Not Rainy Day")
turnstile_weather.loc[turnstile_weather['rain'] == 1 ]['ENTRIESn_hourly'].hist(label = "Rainy Day")
plt.xlabel("Number of entries")
plt.ylabel("Total")
plt.title('entries vs total_count_of_that_entry')
plt.legend()
return plt
filename = "turnstile_data_master_with_weather.csv"
filename1 = pandas.read_csv(filename)
output = entries_histogram(filename1)
output.show()
The data you just plotted is in what kind of distribution? Is there a difference in distribution between rainy and non-rainy days?
Answer: The data is positive skewed distribution. Rainy hourly entries is almost the half of non-rainy hourly enteries
Build a function that returns:
- The mean of entries when it's raining
- The mean of entries when it's not raining
In [1]:
import numpy as np
import pandas
def means(turnstile_weather):
rainy = turnstile_weather.loc[turnstile_weather['rain'] == 1 ]['ENTRIESn_hourly']
sunny = turnstile_weather.loc[turnstile_weather['rain'] == 0 ]['ENTRIESn_hourly']
return rainy.mean(), sunny.mean()
filename = "turnstile_data_master_with_weather.csv"
filename1 = pandas.read_csv(filename)
output = means(filename1)
print(output)
(1105.4463767458733, 1090.278780151855)
Answer to the following questions according to your functions' exits:
- What is the mean of entries when it's raining?
- What is the mean of entries when it's not raining?
Answer: Raining : 1105.4463767458733 ; Non-Raining : 1090.278780151855
The entry for this exercise is the same file from the previous session (Exercise 2). You can download the file from this link:
Now, we will create a mapper. For each entry line, the mapper exit must PRINT (not return) UNIT as a key, and the number of ENTRIESn_hourly as the value. Separate the key and the value with a tab. For example: 'R002 \ t105105.0'
Export your mapper into a file named mapper_result.txt and send it with your submission. The code for exporting your mapper is already written in the code bellow.
In [3]:
import sys
def mapper():
sys.stdin.readline()
for line in sys.stdin:
string = line.split(',')
if len(string) != 22:
continue
print("{0}\t{1}".format(string[1],string[6]))
sys.stdin = open('turnstile_data_master_with_weather.csv')
sys.stdout = open('mapper_result.txt', 'w')
mapper()
Now, create the reducer. Given the mapper result from the previous exercise, the reducer must print (not return) one line per UNIT, with the total number of ENTRIESn_hourly during May (which is our data duration), separated by a tab. An example of exit line from the reducer may look like this: 'R001 \ t500625.0'
You can assume that the entry for the reducer is ordered in a way that all lines corresponding to a particular unit are grouped. However, the reducer exit will have repetition, as there are stores that appear in different files' locations.
Export your reducer into a file named reducer_result.txt and send it with your submission.
In [5]:
import sys
def reducer():
key = None
countEntries = 0
for line in sys.stdin:
string = line.split('\t')
if len(string)!=2:
continue
key2,string1 = string
if key and key!=key2:
print(key,'\t',countEntries)
key = key2
countEntries=0
key = key2
countEntries = countEntries + float(string1)
if key != None:
print(key,"\t",countEntries)
sys.stdin = open('mapper_result.txt', 'r').readlines()
sys.stdout = open('reducer_result.txt', 'w')
reducer()