As it currently stands, there are disturbing differences in the survival rates of newborns. Now, obviously, we would like there to be no neonatal deaths, but that is not currently possible. However, what is more disturbing is that there the survival rates of a child can be dependant on factors like race or income. According to the Center for Disease Control and Prevention, the mortality rate of newborn black children is approximately three times higher than that of white children. This begs the question, what causes this difference? As the public has become aware of these differences, there has been increased interest in care within hospitals. Many believe that there is significant evidence of racial bias within hospitals, and there is strong evidence to support these claims. However, few have examined factors affecting child birth outcomes prior to birth. Are there differences in environmental factors, in the United States, linked to race and income, that can be linked to the increase in neonatal deaths among minority and low income populations? That is what I attempt to determine in this analysis.
The purpose of this project is to examine factors that may contribute to the birth death rates of certain counties. I chose to examine this issue using informations on the factors or race, poverty, and pollution levels. I chose these based on the assmptions that each of these factors will have an effect on the rate of births deaths in an area.
Environmental Protection Agency: RSEI Data -
https://www.epa.gov/rsei/ways-get-rsei-results#Microdata
United States Department of Agriculture: Poverty Data -
https://data.ers.usda.gov/reports.aspx?ID=17826
Center for Disease Control and Prevention: Infant Mortality Data -
https://wonder.cdc.gov/cmf-icd10.html
United States Census Beareau: Demographics Data -
https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html
In this project, I hypothesis that I will be able to observe trends to the following:
-A higher percentage minority popultion in a county and higher birth death rates
-Higher pollution levels in a county and higher birth death rates
-Higher levels of poverty in a county and higher birth death rates
We begin with some housekeeping and library work. Here, I import all of the relevant libraries for the analysis I will perform during this project.
%matplotlib inline
import pandas as pd
import numpy as np
import os
import re
import matplotlib
import matplotlib.pyplot as plt
import sqlite3
from scipy import stats
import seaborn as sns
import sklearn
from sklearn import linear_model
matplotlib.style.use('fivethirtyeight')
Next I aquire and load in data on neonatal mortality rates. This data is provided by the CDC, and is publicly available on their website. The data is formatted as a text file, and contains information about the number of birth deaths per county, and provides some basic data on the populations of each county. I then clean and tidy the data, so that it can be more readily used in an analysis.
#load in data from file
comp_mort = pd.read_csv("../data/comp_mort_99-16.txt",
delimiter = '\t',
header=0,
names=["Notes", "county", "county_Code","Infant_Age_Groups",
"Infant Age Groups Code","Deaths","Population","Crude_Rate"],
nrows = 1925 )
Now we need to clean and tidy that data, so that it is workable and can be combined into a larger dataset. First we adjust the information on location, so that we can accurately represent each county.
#delete empty or redudant columns
del comp_mort['Notes']
del comp_mort['Infant Age Groups Code']
#splits the county column into form [county,state]
count = comp_mort.county.str.split(',')
#create list of counties and list of states
counties = np.empty(0)
states = np.empty(0)
for i in count:
counties = np.append(counties,i[0])
states = np.append(states,i[1])
counties_temp = []
for i in counties:
temp_split = i.replace('County','')
counties_temp = counties_temp + [temp_split]
#delete prior county column
del comp_mort['county']
#add in tidier county and state columns
comp_mort['county'] = counties_temp
comp_mort['state'] = states
Next, we need to adjust information on the crude rate. The Crude Rate is the number of deaths per 1000 people, which is a common standardization type for this type of data
#comp_mort
crude = comp_mort.Crude_Rate.str.split()
reliable = np.empty(0)
crude_rate = np.empty(0)
for line in crude:
#print(len(line))
#print(line[0])
if len(line) == 1:
reliable = np.append(reliable,0)
else:
reliable = np.append(reliable,1)
crude_rate = np.append(crude_rate, line[0])
del comp_mort['Crude_Rate']
comp_mort['Crude_Rate'] = crude_rate
comp_mort['Unreliable'] = reliable
comp_mort['Crude_Rate']= pd.to_numeric(comp_mort['Crude_Rate'],errors='coerce')
comp_mort.head()
Next, I check the dtypes of all the columns, to make sure they are registered correctly.
comp_mort.dtypes
From this point, we also need to gather data on the pollution levels in a county. The dataset is in the form of a CSV file, and is loaded and tidied in the following cells. Furthermore, for this project, I chose to use data provided by the EPA. However, it is important to understand what the data presented actually means.
The EPA has a system in place, referred to as the Risk Screening Environmental Indicators (RSEI), which provides information on the amount of risk an area faces of damage from pollution. The RSEI is the result of an aggregation of all the data on pollutants the EPA gathers in a certain loction. That data is then weighted and summed, to give a score for the region. Therefore, the RSEI is a measure of risk, not a raw score of levels of pollutants in an area. This means that certain locations can have an RSEI of 0, meaning they are at incredibly low risk. A score of 0 does not imply that there are zero pollutants found in an area.
poll_data = pd.read_csv("../data/rsei-data-2018.csv",
header=0,
names=['year','state','county',
'rsei_score','rsei_model_hazard'])
poll_data.head()
For the purposes of this project, I chose to pull data from pull data from Alabama, Louisiana, Mississippi, North Carolina, Texas, Virginia, and Washington. These states were chosen because of they all have significant populations, from a variety of different demographics. These states have also all been shown, in recent years, to have issues with pollution. Therefore, the RSEI score should be a more accurate reflection of pollution levels. The data was also restricted to these states because of the large size of the files involved for the entirety of the USA.
#turns the state names into their abbreviations, and converts the 'county,state' to just a 'county'
count = poll_data.county
for i in range(len(count)):
count[i] = count[i][:-4].lower()
poll_data['state'] = poll_data['state'].map({'Louisiana': 'LA','Alabama':'AL', 'Mississippi': 'MS', 'North Carolina':'NC',
'Texas':'TX', 'Virginia':'VA', 'Washington':'WA'})
poll_data['rsei_score'] = (poll_data['rsei_score'].str.split()).apply(lambda x: float(x[0].replace(',', ''))) #removes the commas from the string values for rsei_score and rsei_model_hazard
poll_data['rsei_model_hazard'] = (poll_data['rsei_model_hazard'].str.split()).apply(lambda x: float(x[0].replace(',', '')))
poll_data = poll_data.astype(dtype = {'year': float, 'rsei_score':float,'rsei_model_hazard':float}, copy=False) #converts the types
poll_data.head()
Now we check the dtypes again, to make sure our data is accurately represented.
poll_data.dtypes
At this point, the last dataset we need to load is data on demographics in each county. This data was obtained from Census.gov, and is publicly available. The data is presented as a CSV file.
It is important to note that the census records a significantly larger, and more specific, number of racial demographics than are used colloquially. For simplicity, and the scope of this project, I separate the demographics into what is colloquially thought of as the minority and white populations.
demo_data = pd.read_csv("../data/cc-est2019-alldata.csv",encoding = "ISO-8859-1",
header=0,
names=['sumlev','state_code','county_code','state','county','year','age_group','tot_pop','tot_male','tot_female',
'wa_male','wa_female','ba_male','ba_female','ia_male','ia_female','aa_male','aa_female',
'na_male','na_female','tom_male','tom_male','wac_male','wac_female','bac_male','bac_female','iac_male',
'iac_female','aac_male','aac_female','nac_male','nac_female','nh_male','nh_female','nhwa_male','nhwa_female',
'nhba_male','nhba_female','nhia_male','nhia_female','nhaa_male','nhaa_female','nhna_male','nhna_female',
'nhtom_male','nhtom_female','nhwac_male','nhwac_female','nhbac_male','nhbac_female',
'nhiac_male','nhica_female','nhaac_male','nhaac_female','nhnac_male','nhnac_female','h_male','h_female',
'hwa_male','hwa_female','hba_male','hba_female','hia_male','hia_female','haa_male','haa_female','hna_male',
'hna_female','htom_male','htom_female','hwac_male','hwac_female','hbac_male','hbac_female','hiac_male',
'hiac_female','haac_male','haac_female','hnac_male','hnac_female'])
demo_data.head()
#change the name of each county from "X county" to "X". This will make it easier to compare data later
for i in range(len(demo_data.county)):
temp = demo_data.county[i].replace('county','')
demo_data.at[i,'county'] =temp
demo_data.head()
The key for the year variable is as follows:
1 = 4/1/2010 Census population
:::
12 = 7/1/2019 population estimate
The key for age_group is as follows:
0 = Total
1 = Age 0 to 4 years
2 = Age 5 to 9 years
:::
18 = Age 85 years or older
I chose to leave the data in this format because it is actually easier to filter using this method.
#change the state names to their abbreviation
demo_data['state'] = demo_data['state'].map({
'Alabama': 'AL','Alaska': 'AK','American Samoa': 'AS','Arizona': 'AZ','Arkansas': 'AR','California': 'CA','Colorado': 'CO','Connecticut': 'CT','Delaware': 'DE',
'District of Columbia': 'DC','Florida': 'FL','Georgia': 'GA','Guam': 'GU','Hawaii': 'HI','Idaho': 'ID','Illinois': 'IL','Indiana': 'IN','Iowa': 'IA','Kansas': 'KS',
'Kentucky': 'KY','Louisiana': 'LA','Maine': 'ME','Maryland': 'MD','Massachusetts': 'MA','Michigan': 'MI','Minnesota': 'MN','Mississippi': 'MS','Missouri': 'MO',
'Montana': 'MT','Nebraska': 'NE','Nevada': 'NV','New Hampshire': 'NH','New Jersey': 'NJ','New Mexico': 'NM','New York': 'NY','North Carolina': 'NC','North Dakota': 'ND',
'Northern Mariana Islands':'MP','Ohio': 'OH','Oklahoma': 'OK','Oregon': 'OR','Pennsylvania': 'PA','Puerto Rico': 'PR','Rhode Island': 'RI','South Carolina': 'SC',
'South Dakota': 'SD','Tennessee': 'TN','Texas': 'TX','Utah': 'UT','Vermont': 'VT','Virgin Islands': 'VI','Virginia': 'VA','Washington': 'WA','West Virginia': 'WV',
'Wisconsin': 'WI','Wyoming': 'WY'
})
demo_data.head()
I check the dtypes of the demographic data, but I will not be showing it here, as there are 80 columns in this dataset.
We begin by finding the average crude rate, and determining which counties have a crude rate above the average. This should provide us with information on how our data is skewed. It also provides a nice means of filtering by counties with higher-than-average crude rates.
#determine the average crude rate
avg_crude_state = comp_mort.groupby('state').Crude_Rate.mean()
avg_crude = comp_mort.Crude_Rate.mean()
print("Average Crude Rate: ", avg_crude)
#create a column to represent the counties with a higher-than-average crude rate
comp_mort['Crude_Above_Avg'] = comp_mort.Crude_Rate >= avg_crude
display(comp_mort.Crude_Above_Avg.value_counts())
comp_mort.head()
Now, it becomes relevant to plot the distribution of our crude rates, so that we can see how our data is skewed. In the plot below, we plot a histogram of the crude rate, with the red line representing the average crude rate.
x = comp_mort.Crude_Rate
fig = plt.figure(num=None, figsize=(8, 5), dpi=80, facecolor='w', edgecolor='k') #generate figure
result = plt.hist(x)
plt.axvline(x.mean(), color='r', linewidth=3)
The average crude rate is 2.85, and the majority of counties have a value below the average crude. Therefore, the data is skewed to the left. In order to keep track of the counties with larger crude rates, we create a column, Crude_Above_Avg.
Now, we can explore more general measures, for larger populations. Therefore, I chose to create a list of the states with the largest number of deaths, compared to their popultion. I then ranked and plotted this list.
#determine the # of deaths in each state
death_by_state = comp_mort.groupby('state').Deaths.sum()
#determine the population of each state
pop_of_state = comp_mort.groupby('state').Population.sum()
avg_by_pop = death_by_state/pop_of_state
#plot the graph
fig = plt.figure(num=None, figsize=(12, 5), dpi=80, facecolor='w', edgecolor='k') #generate figure
avg_by_pop.sort_values(ascending=False).plot.bar()
This is rather crowded however, so lets examine a plot of just the 15 states with the highest rates.
states_highest_dpp = avg_by_pop.sort_values(ascending=False).index[:15]
fig = plt.figure(num=None, figsize=(8, 5), dpi=80, facecolor='w', edgecolor='k') #generate figure
avg_by_pop.loc[states_highest_dpp].plot.bar()
This graph shows us the 15 states with the largest number of deaths, given their population. This becomes interesting when we compare this graph to a plot of the demographic distribution of each state.
#determine populations of each state, in the year 2018
year11 = demo_data.set_index('year').loc[11]
state_pop = year11.groupby('state').tot_pop.sum()
#determine the minority populations of each state
minor = year11.groupby('state').tot_pop.sum() - (year11.groupby('state').wa_male.sum() + year11.groupby('state').wa_female.sum())
white = year11.groupby('state').wa_male.sum() + year11.groupby('state').wa_female.sum()
percent_min = (minor/state_pop)[:15]
fig = plt.figure(num=None, figsize=(8, 5), dpi=80, facecolor='w', edgecolor='k') #generate figure
percent_min.sort_values(ascending = False).plot.bar()
Now, through just a cursory analysis, there seems to be a correlation between the states with the largest minority populations and the states with the largest crude rates.
So, now that we have some basic information on our data, we need to gain a more in depth understanding of how, or if, the data all works together. That means our next step is to determine if there is a strong correlation among any of our variables. This will have the dual purpose of informing our next steps and showing us, quantitatively, how certain factors may affect others.
First, we need to isolate the data that is most relevant. This means filtering for the states that we are examining, dropping some columns, and ultimately creating a new dataset of relevant information.
#isolates the states that I chose to look at the pollution data for
part_mort = comp_mort[(comp_mort.state == ' LA') | (comp_mort.state == ' AL') | (comp_mort.state == ' MS') |(comp_mort.state == ' NC') | (comp_mort.state == ' TX' )|(comp_mort.state == ' VA')| (comp_mort.state == ' WA' )]
part_demo = demo_data[(demo_data.state == 'LA') | (demo_data.state == 'AL') | (demo_data.state == 'MS') |(demo_data.state == 'NC') | (demo_data.state == 'TX' )|(demo_data.state == 'VA')| (demo_data.state == 'WA' )]
#isolates the year 2018, and groups by county, so that we have the total demographical data for each county.
#This is important, because the data is also organized and seperated by age group, which is less relevant
part_demo_18 = part_demo[part_demo.year == 11]
part_demo_18 = part_demo_18.groupby(['county','state']).sum()
part_demo_18 = part_demo_18.drop(labels = ['sumlev','state_code','county_code','year','age_group'],axis = 1) #remove useless columns
#display(part_demo_18)
minority18 = part_demo_18.tot_pop - (part_demo_18.wa_male + part_demo_18.wa_female) #determine percent minority. This is chosen to be the non-white population, for simplicity
white18 = part_demo_18.wa_male+ part_demo_18.wa_female #determine white population
per_min18 = minority18/part_demo_18.tot_pop #determine the percentage of individuals, in each county, that are a minority
demograph = pd.DataFrame({'county':part_demo_18.reset_index().county, #create dataframe with relevant data
'state':part_demo_18.reset_index().state,
'total_pop':part_demo_18.reset_index().tot_pop,
'minority_pop':minority18.reset_index()[0],
'white_pop':white18.reset_index()[0],
'percent_minor':per_min18.reset_index()[0],
'percent_white':1-per_min18.reset_index()[0]
})
counties = demograph.county
counties_temp = []
for i in counties:
temp_split = i.replace('County','')
counties_temp = counties_temp + [temp_split]
Lastly, we need to format our "county" and "state" columns correctly, so that they can be combined with other datasets.
#delete prior county column
del demograph['county']
#add in tidier county and state columns
demograph['county'] = counties_temp
demograph['county']=demograph['county'].str.lower() #make county names lowercase and remove any white space
demograph['county']=demograph['county'].str.strip()
demograph.head()
Now we have to combine the datasets to one frame
#format the county column of the dataset, so it merges correctly
poll_data['county']= poll_data['county'].str.lower()
poll_data['county']= poll_data['county'].str.strip()
#merge the demograph and county_data datasets
county_data = poll_data.merge(demograph,on= ['county','state'], how = 'outer')
county_data.dropna() #remove any NaN's
comp_mort['county']= comp_mort['county'].str.lower()#format the county column of the dataset, so it merges correctly
comp_mort['county']= comp_mort['county'].str.strip()
comp_mort_2 = comp_mort.drop(labels = ['county_Code','Infant_Age_Groups','Population'],axis= 1) #drop the useless columns
comp_mort_2['state']= comp_mort['state'].str.strip()
county_data= county_data.merge(comp_mort_2,on= ['county','state'], how = 'left') #merge the datasets, so that we get one large dataset, with all the relevant data
county_data.dropna(inplace = True)
county_data.head()
Now we have our dataset of relevant data. Now we can use this to perform some higher level analysis.
However, prior to any real anaysis, we should see if any of our data was lost. This happens because some states don't report the same information, and other states just don't keep and upload records as rigerously as others.
county_data.groupby('state').total_pop.sum()
part_demo_18.groupby('state').tot_pop.sum()
per_reporting = county_data.groupby('state').total_pop.sum() / part_demo_18.groupby('state').tot_pop.sum()
per_reporting.plot.barh()
plt.xlabel('Percent Reporting')
print(per_reporting.mean()*100)
It looks like we have a good amount of the data represented, just over 88%. The data in county_data represent the counties that reported all the data we need.
Now, we look to see the correlations among our variables. This should allow us to see if factors like RSEI and percent minority popuation are positively correlated with the birth death rate.
to_corr = county_data.drop(labels = 'year',axis = 1) # create corrleation variable set
corr = to_corr.corr() #create corrleation matrix
#plot the correltation data
fig, ax = plt.subplots(figsize=(15,15))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap=colormap, annot=True, fmt = ".4f")
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlational Data")
plt.show()
There are several things we can learn from this graphic. First, we can tell that there is a high correlation between the size of the population and the RSEI score, as well as with the number of deaths. This makes sense, as more people means more pollution. However, it also tells us that there is no correlation between the percent of the population that identifies as a minority and the RSEI score. Therefore, pollution is not correlated to the population demographics. A cursory search will show an abundance of information disclaiming this, which means that there may be some sampling errors in our analyis. First the RSEI score is not a value for actual pollution, but for aggregated and weighted pollution values that are then used to create a score of perceived danger. Furthermore, the it could be a sampling error, due to something specific to the states we are examining.
However,the data does show us that there is a correlation between the crude rate of birth deaths, and the percent minority population. Essentially, this data is telling us that there is a correlation between demographic and the number of neonatal deaths, but shows that it might not be the result of pollution in these areas. There is value in eliminating possible causes of death, so help us find the cause of the issue.
Given the correlations shown above, we need to move away from RSEI Score as a variable linked to birth death rate. There are, however, other factors to consider. Let's look for other reasons though. We'll start by reading in data on the poverty rates in these states.
Below, we load in data on the poverty rates among populations, by county. The data is represented as a CSV file, and includes information on the percentage of a population, and the percentage of children, living in poverty. The data was aquired from the United States Department of Agriculture. The USDA gives one the option to only download data based on state, so I only chose the states that are relevant to our analysis.
poverty = pd.read_csv("../data/PovertyReport.csv", #load a dataset, and format the county column, for merging purposes
names=['fips',"county",'ruc_code', "percent_poverty","lower_bound",
"upper_bound","child_percent","child_upper","child_lower"])
poverty.drop(labels = ['fips','ruc_code','lower_bound','upper_bound','child_upper','child_lower'],axis = 1,inplace = True)
poverty['county']= poverty['county'].str.lower()
poverty['county']= poverty['county'].str.strip()
poverty.head()
Now, to get a better understanding of the distribution of our data, we plot the distribution of percentage of a population living under the poverty line. As before, the red line represents the average percentage of people below the poverty line, among the dataset, which is 17.3%.
x = poverty.percent_poverty
fig = plt.figure(num=None, figsize=(8, 5), dpi=80, facecolor='w', edgecolor='k') #generate figure
result = plt.hist(x)
plt.axvline(x.mean(), color='r', linewidth=3)
print(x.mean())
Now this gives us the percentage of each county, and the percent of children, that lives under the poverty line.
Here, we merge the poverty data with our county data, giving us a more complete picture of the makeup of each county.
county_data= county_data.merge(poverty,on= ['county'], how = 'left') #merge the poverty dataset into the county_data dataset
county_data.dropna(inplace = True)
county_data.head()
From here, I want to get a good sense of the amount and percentage of people, in each state, who live under the poverty line. The first graph below is the percentage of people who live under the poverty line, and the second is the number of people who live under the poverty line. It is worth noting that the first graph is a more accurate representation of the relative wealth of a state.
county_data.groupby('state').percent_poverty.mean().plot.barh()
plt.title('Percentage of People Under Poverty Line')
(county_data.groupby('state').percent_poverty.mean()*county_data.groupby('state').total_pop.mean()).plot.barh()
plt.title("Number of People Under Poverty Line")
Here, we determine the correlations among our variables. This will inform whether the percentage of people living under the poverty line is correlated with the birth death rate, rsei score (just to inform our decisions), and minority percentage of the population.
to_corr = county_data.drop(labels = ['year','rsei_model_hazard','white_pop','percent_white','Unreliable'],axis = 1) #drop useless columns
corr = to_corr.corr() # create correlation matrix
#determine and plot the correlations among the most relevant data
fig, ax = plt.subplots(figsize=(15,15))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap=colormap, annot=True, fmt = ".4f")
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlational Data")
plt.show()
What this new correlation matrix shows us is that there is a correlation, between the neonatal death rate and the percent of people who live in poverty. There is also a similar correlation between the percent minority and percent poverty populations.
Now, it is important to mention that correlation does not show causation. However, it is becoming increasingly obvious that the reasons for neonatal deaths are complex, and there are many confounding variables. Therefore, it becomes important to see if the combination of certain variables, and which variables at that, lead to the birth death rate that we observe.
Lastly, at this point, we need to make some determinations about the data itself. The question I felt most relevant at this time was: is the data precise enough to allow us to make real-to-life determinations?
To answer these questions, I plot the variables under the most scrutiny, as they relate to Crude Rate, below.
#generate a figure
fig, ax = plt.subplots(1, 4, figsize=(30,10)) #create figure
#create scatter plots for the observations I am examining
observations = ['rsei_score','percent_minor','percent_poverty', 'child_percent']
for ax, observations in zip(ax, observations): #for each subplot
ax.scatter(x = county_data[observations], y = county_data['Crude_Rate'])
slope, intercept, r_value, p_value, std_err = stats.linregress(county_data[observations],county_data['Crude_Rate']) #perform linear regression
print('Value of Line for ' + observations)
print('Y = '+ str(slope) + 'X+ ' + str(intercept))
x_vals = np.array(ax.get_xlim()) #set x values for line
y_vals = intercept + (slope * x_vals) #determine y values using slope, intercept, and x values
ax.plot(x_vals, y_vals,'r--') #plot the line using x and y values
ax.set_title('Crude Rate Vs. ' + observations) #set title, axes labels
ax.set_xlabel(observations)
ax.set_ylabel('Crude Rate')
These graphs show us several things. First, they show us that there is a positive trend between the number of neonatal deaths and the percentage of people who either; are a minority, or live in poverty.
On the note of RSEI data; it also shows us that there is a negative correlation between the RSEI score and the neonatal death rate. However, there is an outlier, clearly pulling the trend downward. Lets see what happens when we remove that value.
#find the max rsei_score
county_data.rsei_score.max()
row = county_data.loc[county_data['rsei_score'] == county_data.rsei_score.max()]
county_data.drop(index = 441)
#try again, now without the outlier
fig, ax = plt.subplots(1, 1, figsize=(15,5)) #create figure
ax.scatter(x = county_data.drop(index = 441)['rsei_score'], y = county_data.drop(index = 441)['Crude_Rate'])
slope, intercept, r_value, p_value, std_err = stats.linregress(county_data.drop(index = 441)['rsei_score'],county_data.drop(index = 441)['Crude_Rate']) #perform linear regression
x_vals = np.array(ax.get_xlim()) #set x values for line
y_vals = intercept + (slope * x_vals) #determine y values using slope, intercept, and x values
print('Value of Line:')
print('Y = '+ str(slope) + 'X+ ' + str(intercept))
ax.plot(x_vals, y_vals,'r--') #plot the line using x and y values
ax.set_title('RSEI Vs. Crude Rate (without largest outlier)') #set title, axes labels
ax.set_xlabel('RSEI Score')
ax.set_ylabel('Crude Rate')
So we still have a negative trend, but a much smaller one. Our slope moved all the way from -3.25x10^8 to -1.25x10^7, just by removing that outlier. This, to me, suggests that it may be important to add in more data in the future. This will make the data less susceptible to changes from a single outlier, and could even change the slope to a positive value. Furthermore, with the negative corrleation between RSEI and Crude Rate, combined with what we know of pollution affecting neonates, I believe that RSEI is not a valid represention of pollution, at this time. Lastly, RSEI is an aggregate of all pollutants, therefore it may be better to find disaggregated data on specific chemicals, which have been shown to negatively affect pregnancy outcomes, in the future.
It has been found that pollution does have a negative affect on pregnancy outcomes, which can be explored more here: https://erj.ersjournals.com/content/54/suppl_63/PA297
As a last step, I chose to generate a series of predictive models. This was done to determine which variables, and the combination of which variables, had a stronger affect on our ability to generate an accurate model. The thinking was this: if we can generate a more accurate model and remove variables one-by-one, then we can determine which variables have the strongest affect on the birth death rate.
parameters = ['Crude_Rate','rsei_score','percent_minor', 'percent_poverty','child_percent','total_pop','white_pop','percent_white']
x_vals = ['rsei_score','percent_minor', 'percent_poverty','child_percent','total_pop']
county_pred = county_data[parameters]
X_values = county_pred[x_vals].values
y_values = county_pred['Crude_Rate'].values
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)
Here, I generate a series of different models, which are then trained and compared. This finds the most accurate method, which will be used later.
#creating several different models, to chose the most accurate
lin = sklearn.linear_model.LinearRegression()
ridge = sklearn.linear_model.Ridge()
ridge01 = sklearn.linear_model.Ridge(alpha = .01)
ridge1000 = sklearn.linear_model.Ridge(alpha = 1000)
las = sklearn.linear_model.Lasso()
las01 = sklearn.linear_model.Lasso(alpha = .01)
las0001 = sklearn.linear_model.Lasso(alpha = .0001)
Using the method from the labs in class, I created a get_rmse function. This function returns the Root Mean Squared Error for each model passed to it.
def get_rmse(model):#perform a 10CV, using the method from a prior lab
#A 10CV was chosen because of the relatively small dataset, and the low correlation, hoping that this would balance any excess error
scores = sklearn.model_selection.cross_val_score(model, X_values, y_values, cv = 10, scoring="neg_mean_squared_error")
mse = np.mean(abs(scores))
rmse = np.sqrt(mse)
return rmse
lin_rmse = get_rmse(lin)
las_rmse = get_rmse(las)
las01_rmse = get_rmse(las01)
las0001_rmse = get_rmse(las0001)
ridge_rmse = get_rmse(ridge)
ridge01_rmse = get_rmse(ridge01)
ridge1000_rmse = get_rmse(ridge1000)
print('Linear Model RMSE: ', lin_rmse)
print('Lasso RMSE: ', las_rmse)
print('Lasso with alpha = .01 RMSE: ', las01_rmse)
print('Lasso with alpha = .0001 RMSE: ', las0001_rmse)
print('Ridge RMSE: ', ridge_rmse)
print('Ridge with alpha = .01 RMSE: ', ridge01_rmse)
print('Ridge with alpha = 100 RMSE: ', ridge1000_rmse)
rmses = pd.Series([lin_rmse, las_rmse, las01_rmse,las0001_rmse,ridge_rmse, ridge01_rmse, ridge1000_rmse])
labels = ['linear RMSE', 'lasso RMSE', 'lasso01 RMSE','lasso0001 RMSE','ridge RMSE', 'ridge01 RMSE', 'ridge1000 RMSE']
fig, ax = plt.subplots()
rmses.plot.bar()
ax.set_xticklabels(labels)
The Linear model RMSE has the smallest error, barely. So I'll move forward with that.
From this point onward, I run a series of models where I remove a single variable and check the RMSE. I do this to determine the value of each variable on our predictive capabilities
We start off with all of the variables
lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
with_all = pd.DataFrame({'real_data': y_test, 'predicted_data': y_pred})
rmses_change = []
lin_rmse = get_rmse(lin)
rmses_change = rmses_change + [lin_rmse]
parameters = ['Crude_Rate','percent_minor', 'percent_poverty','child_percent','total_pop']
x_vals = ['percent_minor', 'percent_poverty','child_percent','total_pop']
county_pred = county_data[parameters]
X_values = county_pred[x_vals].values
y_values = county_pred['Crude_Rate'].values
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)
lin = sklearn.linear_model.LinearRegression()
Then we remove the RSEI variable
lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
without_rsei = pd.DataFrame({'real_data': y_test, 'predicted_data': y_pred})
lin_rmse = get_rmse(lin)
rmses_change = rmses_change + [lin_rmse]
parameters = ['Crude_Rate','rsei_score', 'percent_poverty','child_percent','total_pop']
x_vals = ['rsei_score', 'percent_poverty','child_percent','total_pop']
county_pred = county_data[parameters]
X_values = county_pred[x_vals].values
y_values = county_pred['Crude_Rate'].values
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)
lin = sklearn.linear_model.LinearRegression()
Put the RSEI back in and remove the Percent Minority variable
lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
without_per_min = pd.DataFrame({'real_data': y_test, 'predicted_data': y_pred})
lin_rmse = get_rmse(lin)
rmses_change = rmses_change + [lin_rmse]
parameters = ['Crude_Rate','rsei_score', 'percent_minor','child_percent','total_pop']
x_vals = ['rsei_score', 'percent_minor','child_percent','total_pop']
county_pred = county_data[parameters]
X_values = county_pred[x_vals].values
y_values = county_pred['Crude_Rate'].values
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)
lin = sklearn.linear_model.LinearRegression()
Put the Percent Minority back in and remove the Percent Poverty variable
lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
without_per_pov = pd.DataFrame({'real_data': y_test, 'predicted_data': y_pred})
lin_rmse = get_rmse(lin)
rmses_change = rmses_change + [lin_rmse]
parameters = ['Crude_Rate','rsei_score', 'percent_minor','percent_poverty','total_pop']
x_vals = ['rsei_score', 'percent_minor','percent_poverty','total_pop']
county_pred = county_data[parameters]
X_values = county_pred[x_vals].values
y_values = county_pred['Crude_Rate'].values
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)
lin = sklearn.linear_model.LinearRegression()
Put the Percent Poverty back in and remove the Percent Child Poverty variable
lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
without_per_kid = pd.DataFrame({'real_data': y_test, 'predicted_data': y_pred})
lin_rmse = get_rmse(lin)
rmses_change = rmses_change + [lin_rmse]
Now we plot all of these predictive models, against the actual data. This should show us if there are any glaring issues with our models.
with_all.plot.bar(figsize=(30,5), title = 'With All Data')
without_rsei.plot.bar(figsize=(30,5),title = 'Without RSEI Data')
without_per_min.plot.bar(figsize=(30,5),title = 'Without Percent Minority Data')
without_per_pov.plot.bar(figsize=(30,5),title = 'Without Percent Poverty Data')
without_per_kid.plot.bar(figsize=(30,5),title = 'Without Percent Child Poverty Data')
Lastly, we plot the RMSE values for each model, so that we can determine which variables have the greatest effect at creating an accurate predictive model.
rmses_ser = pd.DataFrame(rmses_change)
label = ['With All','W/out RSEI','W\out % Minority','W/out % Poverty','W/out % Child Poverty']
fig = plt.figure(num=None, figsize=(15, 5), dpi=80) #generate figure
print(rmses_change)
plt.bar(label, rmses_ser[0])
plt.title('Error')
There is quite a bit I learned, through this project. Firstly, there is a strong correlation between the percent of a population that identifies as a minority, and the birth death rate. This has been shown before, but to be able to quantitatively give evidence is relevant to creating further discussion. Secondly, the percentage of the population that is living under the poverty line has an affect on birth death outcomes. However, there are many factors affecting the birth death rate in different populations, so it is unlikely that I'll be able to find the specific one, to a high degree of accuracy, at this time. Furthermore, there are considerations on the external validity of using some variables, such as RSEI score. At this time, I believe that there may be a correlation between the release of certain chemicals into the environment and negative pregnancy outcomes. However, an analysis of these components would require a significant amount of research into both biochemistry and pollution affects.
Lastly, I would just like to note what this work means. There is a significant problem in the healthcare system, at this time. And while my conclusions may not allow for any immediate progress, they inform further decision making and allow for more discussion on what is relevant. If I were to redo the entire project, knowing what I learned in the work above, I would make many different decisions, and perhaps have a more accurate way of predicting risk factors and outcomes.