Skip to content

Latest commit

 

History

History
1370 lines (845 loc) · 40 KB

Analyzing_Subway_Data_NDFDSI.md

File metadata and controls

1370 lines (845 loc) · 40 KB

Subway Data Analysis

Introduction

The NYC public transportantion system - Metro Transit Authority - provides data for download via csv files. Part of the information available are data from the subway turnstiles, containing weekly logs for cumulative entries and exits by turnstile and by subway station during a provided timeframe.

For this project, we will only use the information available at: http://web.mta.info/developers/turnstile.html.

About this project

For this project, you will apply the knowledge acquired in the first month of this course. We will practice basic data acquisition and data cleaning tasks to find out fundamental stuff about the data using what we learned in the Statistics course.

The goal of this project is to explore the relationship between data from the NYC Subway turnstiles and the city weather. For this, besides data from the subway, we will also need data from the weather in NYC.

Here are the main points that will be considered in this work:

  • Gathering data from the Internet
  • Using Statistics for Data Analysis
  • Data handling and simple graphics creation with Pandas

How to find help: We suggest that you try the following channels, in the following order:

Type of Question\Channels Google Forum Slack Email


Pandas and Python Programming 1 2 3
Projects Requiriments 1 2 3 Projects Specific Parts 1 2 3

Here is the address for each of these channels:

The student is expected to submit this report including:

  • All TODO's completed, as they are crucial for the code to run accordingly
  • The ipynb file, exported as html

To submit this project, go to the classroom, and submit your zipped .ipynb and html.

Reminders

Before we start, there are a few things you must have in mind while using iPython notebooks:

  • Remember you can see, in the left side of a code cell, when was the last time it ran, if there is a number inside the keys.
  • When starting a new session in the notebook, please make sure to run all cells up to the point where you last left it. Even if the output can still be viewed from the moment you ran your cells in the previews session, the kernel starts in a new state, so you will need to reload all data, etc. in a new session.
  • The previous point is useful to have in mind if your answers do not match what is expected from the quizzes in the classroom. Try reloading the data and running all processing steps, one by one, to make sure you're working with the same variables and data from each step of the quizz.

Session 1 - Data Gathering

Exercise 1.1

Let's do it!! Now it's your turn to gather data. Please write bellow a Python code to access the link http://web.mta.info/developers/turnstile.html and download all files from June 2017. The file must be named turnstile_100617.txt, where 10/06/17 is the file's date.

Please see below a few commands that might help you:

Use the urllib library to open and redeem a webpage. Use the command below, where url is the webpage path to the following file:

u = urllib.urlopen(url)
html = u.read()

Use the BeautifulSoup library to search for the link to the file you want to donwload in the page. Use the command below to create your soup object and search for all 'a' tags in the document:

soup = BeautifulSoup(html, "html.parser")
links = soup.find_all('a')

A tip to only download the files from June is to check data in the name of the file. For instance, to donwload the 17/06/2017 file, please see if the link ends with "turnstile_170610.txt". If you forget to do this, you will download all files from that page. In order to do this, you can use the following command:

if '1706' in link.get('href'):

Our final tip is to use the command bellow to download the txt file:

urllib.urlretrieve(link_do_arquivo, filename)

Please remember - you first have to load all packages and functions that will be used in your analysys.

In [1]:

import urllib
import requests
from bs4 import BeautifulSoup

url="http://web.mta.info/developers/"
urlLink = requests.get(url + "turnstile.html")
html = urlLink.text
soup = BeautifulSoup(html,"html.parser")
links = soup.find_all('a')
for link in links:
    u = str(link.get('href'))
    if '1706' in u:
        getFileName = link.get('href').split('/')[-1].split('_')
        print(u)
        date = getFileName[1][0:2]
        month = getFileName[1][2:4]
        year = getFileName[1][4:6]
        string = getFileName[0] + "_" + date + month + year + ".txt"
        print(string)
        urllib.request.urlretrieve(url + '/' + u, string)

data/nyct/turnstile/turnstile_170624.txt
turnstile_170624.txt
data/nyct/turnstile/turnstile_170617.txt
turnstile_170617.txt
data/nyct/turnstile/turnstile_170610.txt
turnstile_170610.txt
data/nyct/turnstile/turnstile_170603.txt
turnstile_170603.txt

Exercise 1.2

Write down a function that takes the list of all names of the files you downloaded in Exercise 1.1 and compile them into one single file. There must be only one header line in the output file.

For example, if file_1 has: line 1... line 2...

and the other file, file_2, has: line 3... line 4... line 5...

We must combine file_1 and file_2 into one master file, as follows:

'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn' line 1... line 2... line 3... line 4... line 5...

In [2]:

def create_master_turnstile_file(filenames, output_file):
    with open(output_file, 'w') as master_file:
        master_file.write('C/A,UNIT,SCP,STATION, LINENAME, DIVISION, DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
        for filename in filenames:
            file1 = open(filename,'r')
            count = 0
            for row in file1:
                if(count == 1):
                    master_file.writelines(row)
                else:
                    count = 1
create_master_turnstile_file(["turnstile_170624.txt", "turnstile_170617.txt", "turnstile_170610.txt", "turnstile_170603.txt"], "master_file.txt")

Exercise 1.3

For this exercise, you will write a function that reads the master_file created in the previous exercise and load it into a Pandas Dataframe. This function can be filtered, so that the Dataframe only has lines where column "DESCn" has the value "Regular".

For example, if the Pandas Dataframe looks like this:

,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231

The Dataframe must look like the following, after filtering only the lines where column DESCn has the value REGULAR:

0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177

In [6]:

import pandas

def filter_by_regular(filename):
    
    turnstile_data = pandas.read_csv(filename)
    turnstile_data = pandas.DataFrame(turnstile_data) 
    turnstile_data = turnstile_data[turnstile_data.DESCn == 'REGULAR']
    
    
    return turnstile_data
file1 = filter_by_regular("master_file.txt")
file1.head()

Out[6]:

C/A

UNIT

SCP

STATION

LINENAME

DIVISION

DATEn

TIMEn

DESCn

ENTRIESn

EXITSn

0

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

00:00:00

REGULAR

6224816

2107317

1

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

04:00:00

REGULAR

6224850

2107322

2

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

08:00:00

REGULAR

6224885

2107352

3

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

12:00:00

REGULAR

6225005

2107452

4

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

16:00:00

REGULAR

6225248

2107513

Exercise 1.4

The NYC Subway data has cumulative entry and exit data in each line. Let's assume you have a Dataframe called df, which contains only lines for one particular turnstile (unique SCP, C/A, and UNIT). The following function must change these cumulative entries for counting all entries since the last reading (entries from the last line of the Dataframe).

More specifically, there are two things you should do:

1 - Create a new column, called ENTRIESn_hourly 2 - Insert in this column the difference between ENTRIESn in the current and the previous column. If the line has any NAN, fill it out/replace by 1.

Tip: The funtions shift() and fillna() in Pandas might be usefull for this exercise.

Below you will find and example of how your Dataframe should look by the end of this exercise:

    C/A  UNIT       SCP     DATEn     TIMEn    DESCn  ENTRIESn    EXITSn  ENTRIESn_hourly
0     A002  R051  02-00-00  05-01-11  00:00:00  REGULAR   3144312   1088151                1
1     A002  R051  02-00-00  05-01-11  04:00:00  REGULAR   3144335   1088159               23
2     A002  R051  02-00-00  05-01-11  08:00:00  REGULAR   3144353   1088177               18
3     A002  R051  02-00-00  05-01-11  12:00:00  REGULAR   3144424   1088231               71
4     A002  R051  02-00-00  05-01-11  16:00:00  REGULAR   3144594   1088275              170
5     A002  R051  02-00-00  05-01-11  20:00:00  REGULAR   3144808   1088317              214
6     A002  R051  02-00-00  05-02-11  00:00:00  REGULAR   3144895   1088328               87
7     A002  R051  02-00-00  05-02-11  04:00:00  REGULAR   3144905   1088331               10
8     A002  R051  02-00-00  05-02-11  08:00:00  REGULAR   3144941   1088420               36
9     A002  R051  02-00-00  05-02-11  12:00:00  REGULAR   3145094   1088753              153
10    A002  R051  02-00-00  05-02-11  16:00:00  REGULAR   3145337   1088823              243

In [9]:

import pandas

def get_hourly_entries(df):
    df['ENTRIESn_hourly'] = df.ENTRIESn.diff(1)
    df.ENTRIESn_hourly.fillna(1,inplace = True)
    
    
    return df
file2 = get_hourly_entries(file1)
file2.head()

Out[9]:

C/A

UNIT

SCP

STATION

LINENAME

DIVISION

DATEn

TIMEn

DESCn

ENTRIESn

EXITSn

ENTRIESn_hourly

0

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

00:00:00

REGULAR

6224816

2107317

1.0

1

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

04:00:00

REGULAR

6224850

2107322

34.0

2

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

08:00:00

REGULAR

6224885

2107352

35.0

3

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

12:00:00

REGULAR

6225005

2107452

120.0

4

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

16:00:00

REGULAR

6225248

2107513

243.0

Exercise 1.5

Do the same thing you did in the previous exercise, but taking into account the exits, column EXITSn. For this, you need to create a column called EXITSn_hourly and insert the difference between the column EXITSn in the current line vs he previous line. If there is any NaN, fill it out/replace by 0.

In [11]:

import pandas

def get_hourly_exits(df):
    df['EXITSn_hourly'] = df.EXITSn.diff(1)
    df.EXITSn_hourly.fillna(1,inplace = True)
    
    
    return df
file3 = get_hourly_exits(file1)
file3.head()

Out[11]:

C/A

UNIT

SCP

STATION

LINENAME

DIVISION

DATEn

TIMEn

DESCn

ENTRIESn

EXITSn

ENTRIESn_hourly

EXITSn_hourly

0

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

00:00:00

REGULAR

6224816

2107317

1.0

1.0

1

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

04:00:00

REGULAR

6224850

2107322

34.0

5.0

2

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

08:00:00

REGULAR

6224885

2107352

35.0

30.0

3

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

12:00:00

REGULAR

6225005

2107452

120.0

100.0

4

A002

R051

02-00-00

59 ST

NQR456W

BMT

06/17/2017

16:00:00

REGULAR

6225248

2107513

243.0

61.0

Exercise 1.6

Given an entry variable that represents time, in the format:      "00:00:00" (hour: minutes: seconds)      Write a function to extract the hour part from the time in the entry variable And return it as an integer. For example:          1) if hour is 00, your code must return 0          2) if hour is 01, your code must return 1          3) if hour is 21, your code must return 21          Please return te hour as an integer.

In [2]:

import pandas
def time_to_hour(time):
    hour = time.TIMEn.str[-8:-6].astype(int) # your code here
    return hour
df=pandas.read_csv("master_file.txt");
print(time_to_hour(df))

0          0
1          4
2          8
3         12
4         16
5         20
6          0
7          4
8          8
9         12
10        16
11        20
12         0
13         4
14         8
15        12
16        16
17        20
18         0
19         4
20         8
21        12
22        16
23        20
24         0
25         4
26         8
27        12
28        16
29        20
          ..
788184     1
788185     5
788186     9
788187    13
788188    17
788189    21
788190     1
788191     5
788192     9
788193    13
788194    17
788195    21
788196     1
788197     5
788198     9
788199    13
788200    17
788201    21
788202     1
788203     5
788204     9
788205    13
788206    17
788207    21
788208     1
788209     5
788210     9
788211    13
788212    17
788213    21
Name: TIMEn, Length: 788214, dtype: int32

Exercise 2 - Data Analysis

Exercise 2.1

To understand the relationship between the Subway activity and the weather, please complete the data from the file already downloaded with the weather data. We provided you with the file containing NYC weather data and made it available with the Support Material. You can access it through the link: https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv

Now that we have our data in a csv file, write Python code that reads this file and saves it into a Pandas Dataframe.

Tip:

Use the command below to read the file:

pd.read_csv('output_list.txt', sep=",")

In [1]:

import pandas 

filename = "turnstile_data_master_with_weather.csv"
filename1 = pandas.read_csv(filename, sep=",")
filename1 = pandas.DataFrame(filename1)
filename1.head()

Out[1]:

Unnamed: 0

UNIT

DATEn

TIMEn

Hour

DESCn

ENTRIESn_hourly

EXITSn_hourly

maxpressurei

maxdewpti

...

meandewpti

meanpressurei

fog

rain

meanwindspdi

mintempi

meantempi

maxtempi

precipi

thunder

0

0

R001

2011-05-01

01:00:00

1

REGULAR

0.0

0.0

30.31

42.0

...

39.0

30.27

0.0

0.0

5.0

50.0

60.0

69.0

0.0

0.0

1

1

R001

2011-05-01

05:00:00

5

REGULAR

217.0

553.0

30.31

42.0

...

39.0

30.27

0.0

0.0

5.0

50.0

60.0

69.0

0.0

0.0

2

2

R001

2011-05-01

09:00:00

9

REGULAR

890.0

1262.0

30.31

42.0

...

39.0

30.27

0.0

0.0

5.0

50.0

60.0

69.0

0.0

0.0

3

3

R001

2011-05-01

13:00:00

13

REGULAR

2451.0

3708.0

30.31

42.0

...

39.0

30.27

0.0

0.0

5.0

50.0

60.0

69.0

0.0

0.0

4

4

R001

2011-05-01

17:00:00

17

REGULAR

4400.0

2501.0

30.31

42.0

...

39.0

30.27

0.0

0.0

5.0

50.0

60.0

69.0

0.0

0.0

5 rows × 22 columns

Exercise 2.2

Now, create a function that calculates the number of rainy days. For this, return the count of the number of days where the column "rain" is equal to 1.

Tip: You might think that interpreting numbers as integers or floats might not      work at first. To handle this issue, it might be useful to convert      these numbers into integers. You can do this by writting cast (column as integer).      So, for example, if we want to launch the column maxtempi as an integer, we have to      write something like cast (maxtempi as integer) = 76, instead of just      where maxtempi = 76.

In [10]:

def num_rainy_days(df):
    count = 0
    for i in df:
        if(int(i) == 1):
            count+=1
    return count
countR = num_rainy_days(filename1["rain"])
print(countR)

44104

Exercise 2.3

Calculate if the day was cloudy or not (0 or 1) and the maximum temperature for fog (i.e. the maximum temperature      for cloudy days).

In [19]:

import pandas as pd
def max_temp_aggregate_by_fog(df):
    df = df.loc[df["fog"] == 1]
    count = df["maxtempi"].max()

    return count
file = "turnstile_data_master_with_weather.csv"
filename1 = pd.read_csv(file);
output = max_temp_aggregate_by_fog(filename1)
print(output)

81.0

Exercise 2.4

Now, calculate the mean for 'meantempi' for the days that are Saturdays or Sundays (weekend):

In [20]:

def avg_weekend_temperature(filename):
    filename["DATEn"] = pandas.to_datetime(filename.DATEn)
    aday = (filename.DATEn.dt.weekday_name)
    filename = filename.loc[(aday=="Saturday")|(aday=="Sunday")]
    mean = filename["meantempi"].mean()
    return mean

filename = "turnstile_data_master_with_weather.csv"
filename1 = pandas.read_csv(filename)
output = avg_weekend_temperature(filename1)
print(output)

65.10066685403307

Exercise 2.5

Calculate the mean of the minimum temperature 'mintempi' for the days when the minimum temperature was greater that 55 degrees:

In [21]:

def avg_min_temperature(filename):

    avg_min_temp_rainy = filename.loc[filename["mintempi"]>55]["mintempi"].mean()
      
    return avg_min_temp_rainy
filename = "turnstile_data_master_with_weather.csv"
filename1 = pandas.read_csv(filename)
output = avg_min_temperature(filename1)
print(output)

63.2699012987013

Exercise 2.6

Before you make any analysis, it might be useful to look at the data we want to analyse. More specifically, we will evaluate the entries by hour in our data from the NYC Subway to determine the data distribution. This data is stored in the column ['ENTRIESn_hourly'].      Draw two histogramns in the same axis, to show the entries when it's raining vs when it's not. Below, you will find an example of how to draw histogramns with Pandas and Matplotlib:     

Turnstile_weather ['column_to_graph']. Hist ()

In [2]:

import numpy as np
import pandas
import matplotlib.pyplot as plt

def entries_histogram(turnstile_weather):
    
    
    
    plt.figure()
    turnstile_weather.loc[turnstile_weather['rain'] == 0 ]['ENTRIESn_hourly'].hist(label = "Not Rainy Day")
    turnstile_weather.loc[turnstile_weather['rain'] == 1 ]['ENTRIESn_hourly'].hist(label = "Rainy Day")
    plt.xlabel("Number of entries") 
    plt.ylabel("Total")
    plt.title('entries vs total_count_of_that_entry')
    plt.legend()
    return plt
filename = "turnstile_data_master_with_weather.csv"
filename1 = pandas.read_csv(filename)
output = entries_histogram(filename1)
output.show()

Exercise 2.7

The data you just plotted is in what kind of distribution? Is there a difference in distribution between rainy and non-rainy days?

Answer: The data is positive skewed distribution. Rainy hourly entries is almost the half of non-rainy hourly enteries

Exercise 2.8

Build a function that returns:

  1. The mean of entries when it's raining
  2. The mean of entries when it's not raining

In [1]:

import numpy as np

import pandas

def means(turnstile_weather):
    rainy = turnstile_weather.loc[turnstile_weather['rain'] == 1 ]['ENTRIESn_hourly']
    sunny = turnstile_weather.loc[turnstile_weather['rain'] == 0 ]['ENTRIESn_hourly']

    return rainy.mean(), sunny.mean()
filename = "turnstile_data_master_with_weather.csv"
filename1 = pandas.read_csv(filename)
output = means(filename1)
print(output)

(1105.4463767458733, 1090.278780151855)

Answer to the following questions according to your functions' exits:

  1. What is the mean of entries when it's raining?
  2. What is the mean of entries when it's not raining?

Answer: Raining : 1105.4463767458733 ; Non-Raining : 1090.278780151855

Exercise 3 - Map Reduce

Exercise 3.1

The entry for this exercise is the same file from the previous session (Exercise 2). You can download the file from this link:

https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv

Now, we will create a mapper. For each entry line, the mapper exit must PRINT (not return) UNIT as a key, and the number of ENTRIESn_hourly as the value. Separate the key and the value with a tab. For example: 'R002 \ t105105.0'

Export your mapper into a file named mapper_result.txt and send it with your submission. The code for exporting your mapper is already written in the code bellow.

In [3]:

import sys

def mapper():
    sys.stdin.readline()

    for line in sys.stdin:
        string = line.split(',')
        if len(string) != 22: 
            continue
        print("{0}\t{1}".format(string[1],string[6]))
sys.stdin = open('turnstile_data_master_with_weather.csv')
sys.stdout = open('mapper_result.txt', 'w')
mapper()

Exercise 3.2

Now, create the reducer. Given the mapper result from the previous exercise, the reducer must print (not return) one line per UNIT, with the total number of ENTRIESn_hourly during May (which is our data duration), separated by a tab. An example of exit line from the reducer may look like this: 'R001 \ t500625.0'

You can assume that the entry for the reducer is ordered in a way that all lines corresponding to a particular unit are grouped. However, the reducer exit will have repetition, as there are stores that appear in different files' locations.

Export your reducer into a file named reducer_result.txt and send it with your submission.

In [5]:

import sys
def reducer():
    key = None
    countEntries = 0

    for line in sys.stdin:
        string = line.split('\t')
        if len(string)!=2:
            continue
        key2,string1 = string
        if key and key!=key2:
            print(key,'\t',countEntries)
            key = key2
            countEntries=0
        key = key2
        countEntries = countEntries + float(string1)
    if key != None:
        print(key,"\t",countEntries)

        

sys.stdin = open('mapper_result.txt', 'r').readlines()
sys.stdout = open('reducer_result.txt', 'w')
reducer()