Rat Sightings in New York City¶

Project Goals¶

Link to our GitHub

The team, Miranda Diaz and Raiya Dhawala will be investigating data sets related to rat sightings in New York City and how sightings are impacted by different factors such as location and COVID-19.

Why Rats?¶

Rats carry disease, and decrease the overall quality of life in a neighborhood and in a city in general. They are also personally disgusting to me! Clearly New Yorkers care about the presence of rats. The Mayor of New York, Eric Adams, even hired a "rat czar" Kathleen Corradi whose specific job is to get rid of rats.

Project Dataset¶

Our project revolves around analyzing a data set on Rat Sightings in New York City, using records from 2010 to current day, compiled from community-initiated service requests logged in New York City's 311 system. Each row in the dataset represents a distinct rat sighting, identified by a unique key assiged to the service request. In theory, each key corresponds to a different rat, although there is no way to be certain. Some columns in the dataset include details like the sighting's address, date, streets, and borough. Additinally, a column marks when the service request was created, allowing us to divide the rat sightings by time, day, month, and year.

This dataset was chosen for its relevance in addressing diverse questions, particularly in understanding the factos influencing rat sightings across different areas of New York City. Our analysis incorperates the impact of COVID-19 by incorperating supplementary data on daily case counts in various city neightborhoods. We also included population data from the 2020 United States Census to evaluate rat sightings per 1000 people in different boroughs. Future data may look at the population demographics of different neighborhoods, and whether certain neighborhoods make more 311 requests than others.

Collaboration Plan¶

Our collaboration plan is to meet on Thursdays around 4:30 to work on the final project. We will be communicating through text messages to keep each other updated and organized. We will be using google docs to draft things and then transfer it into google colab.

ETL (Extraction, Transform, and Load)¶

We loaded our dataset into the dataset folder. This dataset originally had 38 columns showing things such as the date the request was created, the agency name, the location (including neighborhood and address), and the complaint status. We tidied the data by removing the repetitive location columns, such as cross streets 1 and 2, and only keeping the incident address and neighborhood. We then created a day, month, and year column and set the date as the index.

In [ ]:
#import pandas and matplotlib

import pandas as pd
import matplotlib.pyplot as plt

#mount drive
from google.colab import drive
drive.mount('/content/drive')

#read in rat csv
rat_df = pd.read_csv('/content/drive/My Drive/Rat_Sightings.csv')

#display it
rat_df.head()
Mounted at /content/drive
<ipython-input-1-10037fa3c024>:11: DtypeWarning: Columns (20) have mixed types. Specify dtype option on import or set low_memory=False.
  rat_df = pd.read_csv('/content/drive/My Drive/Rat_Sightings.csv')
Out[ ]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Latitude Longitude Location
0 15636547 01/02/2010 09:38:29 AM NaN DOHMH Department of Health and Mental Hygiene Rodent Rat Sighting Other (Explain Below) 11432.0 NaN ... NaN NaN NaN NaN NaN NaN NaN 40.703342 -73.800203 (40.70334174980328, -73.80020318978804)
1 15636689 01/02/2010 07:09:56 AM NaN DOHMH Department of Health and Mental Hygiene Rodent Rat Sighting Catch Basin/Sewer 11204.0 NaN ... NaN NaN NaN NaN NaN NaN NaN 40.608288 -73.977775 (40.60828832742201, -73.97777515791093)
2 15636710 01/02/2010 09:04:46 AM NaN DOHMH Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11375.0 68-12 YELLOWSTON BOULEVARD ... NaN NaN NaN NaN NaN NaN NaN 40.726060 -73.848329 (40.726059538480236, -73.84832900182792)
3 15636731 01/02/2010 06:27:59 PM NaN DOHMH Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11233.0 1040 HERKIMER STREET ... NaN NaN NaN NaN NaN NaN NaN 40.677957 -73.918921 (40.67795748580213, -73.91892129635086)
4 15636907 01/02/2010 12:50:16 PM NaN DOHMH Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Apt. Building 10034.0 241 SHERMAN AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.866008 -73.919401 (40.86600810878767, -73.9194006525354)

5 rows × 38 columns

We made the formatted date the index of the rat dataframe so that it was easier to manipulate with other datasets such as the COVID data.

In [ ]:
#get rid of unnecesary columns
rat_df.drop(columns = ['Closed Date','Agency', 'Cross Street 1', 'Cross Street 2', 'Intersection Street 1', 'Intersection Street 2', 'Address Type', 'Landmark', 'Status','Due Date','Resolution Action Updated Date', 'Street Name', 'City', 'Facility Type', 'Due Date', 'Park Facility Name', 'Park Borough', 'X Coordinate (State Plane)', 'Y Coordinate (State Plane)', 'Vehicle Type', 'Taxi Company Borough', 'Taxi Pick Up Location', 'Bridge Highway Name', 'Bridge Highway Direction', 'Road Ramp', 'Bridge Highway Segment', 'Location' ], inplace=True)

#use to_datetime function to change Created Date to Datetime format so we can extract month and year
rat_df['Created Date'] = pd.to_datetime(rat_df['Created Date'], format='%m/%d/%Y %I:%M:%S %p')

#Extract month and year into new columns
rat_df['Month'] = rat_df['Created Date'].dt.month
rat_df['Year'] = rat_df['Created Date'].dt.year
rat_df['Day'] = rat_df['Created Date'].dt.day

rat_df['formatted_date'] = pd.to_datetime(rat_df[['Month', 'Day', 'Year']]).dt.strftime('%m/%d/%Y')
rat_df['formatted_date'] = pd.to_datetime(rat_df['formatted_date'], format='%m/%d/%Y')

#set Created Date as index
rat_df.set_index("formatted_date", inplace=True)
rat_df.head()
Out[ ]:
Unique Key Created Date Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Community Board Borough Latitude Longitude Month Year Day
formatted_date
2010-01-02 15636547 2010-01-02 09:38:29 Department of Health and Mental Hygiene Rodent Rat Sighting Other (Explain Below) 11432.0 NaN 12 QUEENS QUEENS 40.703342 -73.800203 1 2010 2
2010-01-02 15636689 2010-01-02 07:09:56 Department of Health and Mental Hygiene Rodent Rat Sighting Catch Basin/Sewer 11204.0 NaN 11 BROOKLYN BROOKLYN 40.608288 -73.977775 1 2010 2
2010-01-02 15636710 2010-01-02 09:04:46 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11375.0 68-12 YELLOWSTON BOULEVARD 06 QUEENS QUEENS 40.726060 -73.848329 1 2010 2
2010-01-02 15636731 2010-01-02 18:27:59 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11233.0 1040 HERKIMER STREET 03 BROOKLYN BROOKLYN 40.677957 -73.918921 1 2010 2
2010-01-02 15636907 2010-01-02 12:50:16 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Apt. Building 10034.0 241 SHERMAN AVENUE 12 MANHATTAN MANHATTAN 40.866008 -73.919401 1 2010 2

We also documented the data types of our columns to facilitate data manipulation when necessary.

In [ ]:
#display datatypes
display(rat_df.dtypes)
Unique Key                   int64
Created Date        datetime64[ns]
Agency Name                 object
Complaint Type              object
Descriptor                  object
Location Type               object
Incident Zip               float64
Incident Address            object
Community Board             object
Borough                     object
Latitude                   float64
Longitude                  float64
Month                        int64
Year                         int64
Day                          int64
dtype: object

We created a data table displaying the daily count of rat sightings to compare it with the daily COVID cases.

In [ ]:
#Make dataframe which has amount of rats spotted per day
rat_per_day_df = rat_df.groupby('formatted_date').size().reset_index(name='rat_sightings_count')
rat_per_day_df.head()
Out[ ]:
formatted_date rat_sightings_count
0 2010-01-01 9
1 2010-01-02 12
2 2010-01-03 3
3 2010-01-04 24
4 2010-01-05 14

EDA (Exploratory Data Analysis)¶

Exploring Rat Populations Across Boroughs¶

Here, we see the cumulative rat reports in each borough since 2010, highlighting Brooklyn with the highest rat count. To gain a more comprehensive perspective, why not visualize this data on a map?

In [ ]:
#show frequency of rat sightings by borough
rat_df['Borough'].value_counts()
Out[ ]:
BROOKLYN         84208
MANHATTAN        61172
BRONX            42182
QUEENS           34808
STATEN ISLAND     9250
Unspecified         25
Name: Borough, dtype: int64

Rat sightings are categorized by their location type. It's important to note that these location types often appear repeatedly (e.g., "3+ Family Apt. Building" versus "3+ Family Apartment Building" are considered as distinct entries).

In [ ]:
value_counts = rat_df['Location Type'].value_counts()
print(value_counts)
3+ Family Apt. Building          95573
1-2 Family Dwelling              43520
Other (Explain Below)            36116
3+ Family Mixed Use Building     15217
Commercial Building              12864
Vacant Lot                        6216
Construction Site                 5254
1-2 Family Mixed Use Building     3840
Vacant Building                   3719
Parking Lot/Garage                2439
Catch Basin/Sewer                 2258
Public Garden                      981
3+ Family Apartment Building       658
Government Building                495
Street Area                        434
School/Pre-School                  397
Single Room Occupancy (SRO)        266
Day Care/Nursery                   235
Office Building                    221
Residential Building               215
Public Stairs                      197
Hospital                           152
Apartment                           82
Other                               76
Residence                           72
Building (Non-Residential)          23
Abandoned Building                  16
1-3 Family Dwelling                 15
Store                               13
Residential Property                12
Restaurant                          11
Summer Camp                          8
3+ Family Apt.                       7
Ground                               7
Restaurant/Bar/Deli/Bakery           6
School                               3
1-3 Family Mixed Use Building        3
Commercial Property                  2
Beach                                2
Grocery Store                        2
Vacant Lot/Property                  2
3+ Family Apt                        1
3+Family Apt.                        1
1-2 FamilyDwelling                   1
Private House                        1
Cafeteria - Public School            1
Retail Store                         1
Street Fair Vendor                   1
Catering Service                     1
Name: Location Type, dtype: int64
In [ ]:
#new dataframe showing boroughs
borough_df = pd.DataFrame({'Borough':['Brooklyn','Manhattan','Bronx', 'Queens', 'Staten Island', 'Unspecified'], 'Rat Sightings': [83461, 60699, 41917, 34507,9213,33]})
borough_df.head(5)
Out[ ]:
Borough Rat Sightings
0 Brooklyn 83461
1 Manhattan 60699
2 Bronx 41917
3 Queens 34507
4 Staten Island 9213

Now, the figure below offers a clearer visual representation of the reported rat sightings over the past few decades. Each color corresponds to a different borough, and each dot signifies a different report. But where exactly have the highest number of rat sightings occurred within each borough? An even better question might be: which areas in NYC should you avoid?

In [ ]:
import plotly.express as px

#don't include rows with no latitue or longitude
rat_df = rat_df.dropna(subset=['Latitude', 'Longitude'])

#don't include unspecified
filtered_rat_df = rat_df[rat_df['Borough'] != 'Unspecified']

#plotting the map with different colors for each borough
fig = px.scatter_mapbox(filtered_rat_df, lat='Latitude', lon='Longitude', color='Borough', mapbox_style='open-street-map')
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

(Image of the figure in case it is not visible after the file is converted to HTML)

newplot.png

Now, where exactly are the locations with the highest frequency of rat sightings in each borough? The following figure utilizes the latitude and longitude of each rat sighting to determine just that.

In [ ]:
import pandas as pd
import plotly.express as px

#remove rows with missing latitude or longitude values
rat_df = rat_df.dropna(subset=['Latitude', 'Longitude'])

#do not include reports that say unspecfied borough
filtered_rat_df = rat_df[rat_df['Borough'] != 'Unspecified']

#group data by Borough and then find latitude and longitude with the most sightings
sightings_count = filtered_rat_df.groupby(['Borough', 'Latitude', 'Longitude']).size().reset_index(name='Sightings')
idx = sightings_count.groupby('Borough')['Sightings'].transform(max) == sightings_count['Sightings']
most_sightings = sightings_count[idx]

#make map showing which latitute and logitude had the most rat sightings for each borough
most_sightings['MarkerSize'] = 10
fig = px.scatter_mapbox(most_sightings, lat='Latitude', lon='Longitude', hover_name='Borough',
                        zoom=10, color='Borough', mapbox_style='open-street-map', size='MarkerSize',
                        hover_data={'Sightings': True,})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
<ipython-input-11-fca2bc3d6c55>:16: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

(Image of the figure in case it is not visible after the file is converted to HTML)

newplot.png

Over time, there has been a noticeable increase in rat sightings, as shown in the plot below. This rise could be attributed to various factors, including the population density and the tendency of individuals to report rat sightings through 311 calls. A significant decline is evident in 2020 and 2021, likely due to the impact of COVID-19, which led to reduced outdoor activities. In 2023, we've observed the highest number of rat sightings reported to date!

In [ ]:
#display rat frequency per year
rat_df.Year.plot.hist()
Out[ ]:
<Axes: ylabel='Frequency'>

But, understanding the rat count alone doesn't provide a meaningful comparison without considering the size or population of each borough. To achieve this, we must incorporate additional data about the population of each borough to calculate the ratio of rats per 1000 people.

In [ ]:
from google.colab import drive

#Bringing in Population data
population_df = pd.read_csv('/content/drive/My Drive/new_pop_again.csv')
population_df = population_df.drop([0]).reset_index(drop=True)
population_df['Population'] = population_df['Population'].str.replace(',', '').astype(int)

population_df.head(10)
Out[ ]:
Index GeoType Borough GeoID Name Population
0 2 Boro Manhattan 1 Manhattan 1694251
1 3 Boro Bronx 2 Bronx 1472654
2 4 Boro Brooklyn 3 Brooklyn 2736074
3 5 Boro Queens 4 Queens 2405464
4 6 Boro Staten Island 5 Staten Island 495747
In [ ]:
#combine borough and population then calculate per capita
result_df = pd.merge(borough_df, population_df, on='Borough')

#unspecified doesn't matter, it is so minimal
#delete unnecessary columns
result_df= result_df.drop(columns=['Index', 'GeoID', 'GeoType', 'Name'])

#rename rat sightings
result_df = result_df.rename(columns={'Rat Sightings': 'Rat_Sightings'})

#set borough as index
result_df.set_index("Borough", inplace=True)

result_df.head(7)
Out[ ]:
Rat_Sightings Population
Borough
Brooklyn 83461 2736074
Manhattan 60699 1694251
Bronx 41917 1472654
Queens 34507 2405464
Staten Island 9213 495747
In [ ]:
#Creating new column calculating the amount of rat sightings per 1000 people.

result_df['per_capita_1000'] = (result_df['Rat_Sightings']/result_df['Population'])*1000
result_df.head(8)
Out[ ]:
Rat_Sightings Population per_capita_1000
Borough
Brooklyn 83461 2736074 30.503926
Manhattan 60699 1694251 35.826451
Bronx 41917 1472654 28.463577
Queens 34507 2405464 14.345257
Staten Island 9213 495747 18.584076

Using this newfound information, we can plot to analyze.

In [ ]:
#Plot rat sightings by borough
result_df.Rat_Sightings.plot.bar()

#Label axes
plt.xlabel('Borough')
plt.ylabel('Rat Sightings')
plt.title('Rat Sightings by Borough')
plt.show()

#Create another graph to compare rat sightings per 1000 people in each borough.
result_df.per_capita_1000.plot.bar()
plt.xlabel('Borough')
plt.ylabel('Per 1000 people Rat Sightings')
plt.title('Rat Sightings per 1000 people by Borough')
plt.show()

So, despite Brooklyn having the largest population, when calculating per 1000 people rat sightings, Manhattan records the highest count! We can also see that despite Staten Island's smaller population, it surpasses Queens in rat sightings per 1000 people. Normalizing the rat sightings offers insight into the actual rat impact across each borough.

Comparing Rat Sightings with COVID cases¶

The next point of interest we will be tackling is how rat sightings have changed before and after the pandemic. To explore this, we're integrating COVID-19 data showcasing case counts by date and borough in New York City. We initiated the process by loading the COVID case data divided by borough.

In [ ]:
# Read the CSV file skipping the first row
COVID_df = pd.read_csv('/content/drive/My Drive/COVID_Daily_Counts.csv')

# Choosing the columns we want to focus on
columns = ['date_of_interest', 'CASE_COUNT', 'BX_CASE_COUNT', 'BK_CASE_COUNT', 'MN_CASE_COUNT', 'QN_CASE_COUNT', 'SI_CASE_COUNT']
new_COVID_df = COVID_df[columns]

# Formatting date_of_interest column
new_COVID_df['date_of_interest'] = pd.to_datetime(new_COVID_df['date_of_interest'], errors='coerce', format='%m/%d/%Y')

new_COVID_df.head()
<ipython-input-17-253d1df6431f>:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[ ]:
date_of_interest CASE_COUNT BX_CASE_COUNT BK_CASE_COUNT MN_CASE_COUNT QN_CASE_COUNT SI_CASE_COUNT
0 2020-02-29 1 0 0 1 0 0
1 2020-03-01 0 0 0 0 0 0
2 2020-03-02 0 0 0 0 0 0
3 2020-03-03 1 0 0 0 1 0
4 2020-03-04 5 0 1 2 2 0

We then made a new dataset that merged the daily rat sightings data with the corresponding COVID-19 case counts by date to facilitate our analysis.

In [ ]:
#merge Covid cases and rat sightings dataframe
new_COVID_df['date_of_interest'] = pd.to_datetime(new_COVID_df['date_of_interest'], format='%m/%d/%Y')

rat_COVID_df = pd.merge(rat_per_day_df, new_COVID_df, left_on='formatted_date', right_on='date_of_interest', how='left')

#Drop the duplicate formatted_date column from COVID case
rat_COVID_df = rat_COVID_df.drop('date_of_interest', axis=1)

#display
rat_COVID_df.head()
<ipython-input-18-68eaf60e929d>:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[ ]:
formatted_date rat_sightings_count CASE_COUNT BX_CASE_COUNT BK_CASE_COUNT MN_CASE_COUNT QN_CASE_COUNT SI_CASE_COUNT
0 2010-01-01 9 NaN NaN NaN NaN NaN NaN
1 2010-01-02 12 NaN NaN NaN NaN NaN NaN
2 2010-01-03 3 NaN NaN NaN NaN NaN NaN
3 2010-01-04 24 NaN NaN NaN NaN NaN NaN
4 2010-01-05 14 NaN NaN NaN NaN NaN NaN

The plot below illustrates the correlation between rat sightings and daily COVID case counts from September 1st, 2019, to August 30th, 2023. This visualization allows us to observe the variations in rat sightings before and after the COVID-19 outbreak. It's noticeable that after COVID, there was an increase in rat sightings, contrasting with lower average sightings in 2019. Notably, as COVID cases increased, rat sightings decreased, likely due to reduced outdoor activities during periods with higher case counts.

In [ ]:
rat_COVID_df.index = pd.to_datetime(rat_COVID_df.index)
date_column = 'formatted_date'

#determining the dates we want to view
start_date = '2019-09-01'
end_date = '2023-08-30'
rat_COVID_zoomed = rat_COVID_df[(rat_COVID_df[date_column] >= start_date) & (rat_COVID_df[date_column] <= end_date)]

plt.figure(figsize=(18, 12))

#plot data and covid case count
plt.plot(rat_COVID_zoomed['formatted_date'], rat_COVID_zoomed['CASE_COUNT'], label='COVID Cases', color='blue')
plt.xlabel('Date')

#setting the color of the left y-axis for COVID cases
plt.ylabel('COVID Cases', color='blue')

#Second y-axis for rat sightings so rat sightings doesnt look like nothing (normalized)
ax2 = plt.gca().twinx()
ax2.plot(rat_COVID_zoomed['formatted_date'], rat_COVID_zoomed['rat_sightings_count'], label='Rat Sightings per Day', color='red')
ax2.set_ylabel('Rat Sightings per Day', color='red')

#Making plot look nice
plt.title('COVID Cases and Rat Sightings per day From 2019-2023')
plt.legend()
plt.grid(True)
plt.show()

Model¶

We thought it would be interesting to see if there were any common features between the top community districts for rat sightings in each borough. We created a top_five variable that indicated if a rat sighting was in those top five community distrcits or not. We used a random forrest model to try to see if location type could predict if a rat would be sighted in the top_five community districts. At first we tried to make a decision tree with different variables, but the tree was not readable becuase it was too big. We then used a random forest model, but the computer was running for several minutes. We decided to look at how the different features factored intro the random forrest model. Interestingly, when looking at how location type factored into top_five or not, "commercial building" was the top feature by far. This is interesting because "commercial building" is not the top location type for rat sightings in general.

In [ ]:
#create column top_5 labeling rat sightings as in or not in top 5.
community_boards_to_match = ['03 BROOKLYN', '07 MANHATTAN', '05 QUEENS', '04 BRONX', '01 STATEN ISLAND']

# Create a new column based on the conditions
rat_df['top_five'] = rat_df['Community Board'].isin(community_boards_to_match).astype(int)

# Display the modified DataFrame
rat_df.head()
Out[ ]:
Unique Key Created Date Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Community Board Borough Latitude Longitude Month Year Day top_five
formatted_date
2010-01-02 15636547 2010-01-02 09:38:29 Department of Health and Mental Hygiene Rodent Rat Sighting Other (Explain Below) 11432.0 NaN 12 QUEENS QUEENS 40.703342 -73.800203 1 2010 2 0
2010-01-02 15636689 2010-01-02 07:09:56 Department of Health and Mental Hygiene Rodent Rat Sighting Catch Basin/Sewer 11204.0 NaN 11 BROOKLYN BROOKLYN 40.608288 -73.977775 1 2010 2 0
2010-01-02 15636710 2010-01-02 09:04:46 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11375.0 68-12 YELLOWSTON BOULEVARD 06 QUEENS QUEENS 40.726060 -73.848329 1 2010 2 0
2010-01-02 15636731 2010-01-02 18:27:59 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11233.0 1040 HERKIMER STREET 03 BROOKLYN BROOKLYN 40.677957 -73.918921 1 2010 2 1
2010-01-02 15636907 2010-01-02 12:50:16 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Apt. Building 10034.0 241 SHERMAN AVENUE 12 MANHATTAN MANHATTAN 40.866008 -73.919401 1 2010 2 0
In [ ]:
rat_df = rat_df[(rat_df["Month"].notna()) & (rat_df["Location Type"].notna())]
rat_df.head()
Out[ ]:
Unique Key Created Date Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Community Board Borough Latitude Longitude Month Year Day top_five
formatted_date
2010-01-02 15636547 2010-01-02 09:38:29 Department of Health and Mental Hygiene Rodent Rat Sighting Other (Explain Below) 11432.0 NaN 12 QUEENS QUEENS 40.703342 -73.800203 1 2010 2 0
2010-01-02 15636689 2010-01-02 07:09:56 Department of Health and Mental Hygiene Rodent Rat Sighting Catch Basin/Sewer 11204.0 NaN 11 BROOKLYN BROOKLYN 40.608288 -73.977775 1 2010 2 0
2010-01-02 15636710 2010-01-02 09:04:46 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11375.0 68-12 YELLOWSTON BOULEVARD 06 QUEENS QUEENS 40.726060 -73.848329 1 2010 2 0
2010-01-02 15636731 2010-01-02 18:27:59 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11233.0 1040 HERKIMER STREET 03 BROOKLYN BROOKLYN 40.677957 -73.918921 1 2010 2 1
2010-01-02 15636907 2010-01-02 12:50:16 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Apt. Building 10034.0 241 SHERMAN AVENUE 12 MANHATTAN MANHATTAN 40.866008 -73.919401 1 2010 2 0
In [ ]:
value_counts = rat_df['top_five'].value_counts()
print(value_counts)
0    195153
1     34226
Name: top_five, dtype: int64
In [ ]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt


#Select feature location type and target variable top_five
X = rat_df[['Location Type']]
y = rat_df['top_five']

#dummies for location type
X = pd.get_dummies(X, columns=['Location Type'])

#training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Create random forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

#predict on test set
y_pred = model.predict(X_test)

#test model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

#results
print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", classification_rep)

#Plot feature importances
feature_importances = model.feature_importances_
feature_names = X.columns
sorted_idx = feature_importances.argsort()

plt.figure(figsize=(10, 8))
plt.barh(range(len(sorted_idx)), feature_importances[sorted_idx])
plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.show()
Accuracy: 0.8491586014473799

Confusion Matrix:
 [[38956     1]
 [ 6919     0]]

Classification Report:
               precision    recall  f1-score   support

           0       0.85      1.00      0.92     38957
           1       0.00      0.00      0.00      6919

    accuracy                           0.85     45876
   macro avg       0.42      0.50      0.46     45876
weighted avg       0.72      0.85      0.78     45876

We then tried to incorperate other variables into the random forrest, but the results weren't as interesting. Intriguingly, "month" was the most important feature, which makes sense because there seem to be less rat sightings in colder months, probably due to the fact that less people go outside.

In [ ]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

#endocde top_five target variable 0 and 1
le = LabelEncoder()
rat_df['top_five'] = le.fit_transform(rat_df['top_five'])

#choose features we want to include
features = ['Location Type','Borough', 'Month', 'Year']
X = rat_df[features]
y = rat_df['top_five']

categorical_features = ['Location Type', 'Borough']
X[categorical_features] = X[categorical_features].fillna('missing')

#One-hot encode categorical features
X = pd.get_dummies(X, columns=categorical_features)

#Training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Create random Forrest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

#predict the test set
y_pred = model.predict(X_test)

#evaluate
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

#esults
print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", classification_rep)

#Plot feature importances
feature_importances = model.feature_importances_
feature_names = X.columns
sorted_idx = feature_importances.argsort()

plt.figure(figsize=(10, 10))
plt.barh(range(len(sorted_idx)), feature_importances[sorted_idx])
plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.show()
<ipython-input-32-25a8357a4f48>:20: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Accuracy: 0.8475673554799895

Confusion Matrix:
 [[38491   466]
 [ 6527   392]]

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.99      0.92     38957
           1       0.46      0.06      0.10      6919

    accuracy                           0.85     45876
   macro avg       0.66      0.52      0.51     45876
weighted avg       0.79      0.85      0.79     45876

We also attempted to make model predicting rat sightings per month based on past month sightings. Unfortunatley, our model was not successful in this aspect. The quantity of monthly sightings proved insufficient, particularly given the sharp increase in rat sightings since 2022. Our model successfully predicted when there would be dips and increases in sightings by month, but overall did not capture rat sightings in 2023.

In [ ]:
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

#time series with monthly rat sightings
time_series = rat_df.groupby(pd.Grouper(key='Created Date', freq='M')).size()

#training and testing sets
train_size = int(len(time_series) * 0.8)
train, test = time_series[:train_size], time_series[train_size:]

#fit ARIMA model
order = (3, 1, 3)
model = ARIMA(train, order=order)
model_fit = model.fit()

#predict
predictions = model_fit.forecast(steps=len(test))

#evaluate model with mse and rmse
mse = mean_squared_error(test, predictions)
rmse = mse**0.5

#Plot the predicted sightings
plt.figure(figsize=(10, 6))
plt.plot(train.index, train, label='Train')
plt.plot(test.index, test, label='Test')
plt.plot(test.index, predictions, label='Predictions')
plt.title(f'ARIMA Predictions (RMSE: {rmse:.2f})')
plt.ylabel('Rat Sightings')
plt.xlabel('Year')
plt.legend()
plt.show()
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:978: UserWarning:

Non-invertible starting MA parameters found. Using zeros as starting parameters.

/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

Final Thoughts¶

Ideas for Future Research¶

Unfortunately, our data was not comprehensive enough to create a useful model. While we found some minor points of interest, such as "commercial building" being the most important feature when predicting for top_five, we couldn't generate a reliable random forest model. While the rat data is interesting, the real question is who is picking up the phone to call in a rat sighting. Rats are everywhere in the city; most people never call a sighting in. This could have skewed our data because people are likely to call in a rat sighting near their residence rather than walking down a random street. In the future, we could look at the demographics of community districts and see how that impacts the rat sightings of that community district. We could look at a community district's average income, age, and racial makeup to help create a more comprehensive model of rat sightings, or rather, called in rat sightings. In the future, we might also want to attempt to classify "location type" more efficiently. Perhaps have three options for location type: residential, commercial, and industrial. This would have allowed us to use the location type data more effectively. These approaches could pave the way for a more insightful understanding of reported rat sightings in the future.