Final Tutorial: An Analysis of Neonatal Deaths on factors of Race, Poverty, and Location¶

By Mason Caplin¶

Intro¶

As it currently stands, there are disturbing differences in the survival rates of newborns. Now, obviously, we would like there to be no neonatal deaths, but that is not currently possible. However, what is more disturbing is that there the survival rates of a child can be dependant on factors like race or income. According to the Center for Disease Control and Prevention, the mortality rate of newborn black children is approximately three times higher than that of white children. This begs the question, what causes this difference? As the public has become aware of these differences, there has been increased interest in care within hospitals. Many believe that there is significant evidence of racial bias within hospitals, and there is strong evidence to support these claims. However, few have examined factors affecting child birth outcomes prior to birth. Are there differences in environmental factors, in the United States, linked to race and income, that can be linked to the increase in neonatal deaths among minority and low income populations? That is what I attempt to determine in this analysis.

The purpose of this project is to examine factors that may contribute to the birth death rates of certain counties. I chose to examine this issue using informations on the factors or race, poverty, and pollution levels. I chose these based on the assmptions that each of these factors will have an effect on the rate of births deaths in an area.

Datasets¶

Environmental Protection Agency: RSEI Data -

https://www.epa.gov/rsei/ways-get-rsei-results#Microdata

United States Department of Agriculture: Poverty Data -

https://data.ers.usda.gov/reports.aspx?ID=17826

Center for Disease Control and Prevention: Infant Mortality Data -

https://wonder.cdc.gov/cmf-icd10.html

United States Census Beareau: Demographics Data -

https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html

Hypothesis¶

In this project, I hypothesis that I will be able to observe trends to the following:

-A higher percentage minority popultion in a county and higher birth death rates
-Higher pollution levels in a county and higher birth death rates
-Higher levels of poverty in a county and higher birth death rates

To Start¶

We begin with some housekeeping and library work. Here, I import all of the relevant libraries for the analysis I will perform during this project.

%matplotlib inline
import pandas as pd
import numpy as np
import os
import re
import matplotlib
import matplotlib.pyplot as plt
import sqlite3
from scipy import stats
import seaborn as sns
import sklearn
from sklearn import linear_model
matplotlib.style.use('fivethirtyeight')

Initial Data Extraction, Transform, Load & Tidy¶

Next I aquire and load in data on neonatal mortality rates. This data is provided by the CDC, and is publicly available on their website. The data is formatted as a text file, and contains information about the number of birth deaths per county, and provides some basic data on the populations of each county. I then clean and tidy the data, so that it can be more readily used in an analysis.

#load in data from file
comp_mort = pd.read_csv("../data/comp_mort_99-16.txt",
                       delimiter = '\t',
                       header=0,
                       names=["Notes", "county", "county_Code","Infant_Age_Groups",
                              "Infant Age Groups Code","Deaths","Population","Crude_Rate"],
                       nrows = 1925 )

Now we need to clean and tidy that data, so that it is workable and can be combined into a larger dataset. First we adjust the information on location, so that we can accurately represent each county.

#delete empty or redudant columns
del comp_mort['Notes']
del comp_mort['Infant Age Groups Code']

#splits the county column into form [county,state]
count = comp_mort.county.str.split(',')

#create list of counties and list of states
counties = np.empty(0)
states = np.empty(0)
for i in count:
    counties = np.append(counties,i[0])
    states = np.append(states,i[1])
    
    
counties_temp = []
for i in counties:
    temp_split = i.replace('County','')
    counties_temp = counties_temp + [temp_split]
    
#delete prior county column
del comp_mort['county']

#add in tidier county and state columns
comp_mort['county'] = counties_temp
comp_mort['state'] = states

Next, we need to adjust information on the crude rate. The Crude Rate is the number of deaths per 1000 people, which is a common standardization type for this type of data

#comp_mort
crude = comp_mort.Crude_Rate.str.split()

reliable = np.empty(0)
crude_rate = np.empty(0)
for line in crude:
    #print(len(line))
    #print(line[0])
    if len(line) == 1:
        reliable = np.append(reliable,0)
    else:
        reliable = np.append(reliable,1)
    crude_rate = np.append(crude_rate, line[0])

del comp_mort['Crude_Rate']
comp_mort['Crude_Rate'] = crude_rate
comp_mort['Unreliable'] = reliable
comp_mort['Crude_Rate']= pd.to_numeric(comp_mort['Crude_Rate'],errors='coerce')
comp_mort.head()

Next, I check the dtypes of all the columns, to make sure they are registered correctly.

comp_mort.dtypes

county_Code            int64
Infant_Age_Groups     object
Deaths                 int64
Population             int64
county                object
state                 object
Crude_Rate           float64
Unreliable           float64
dtype: object

Next Dataset¶

From this point, we also need to gather data on the pollution levels in a county. The dataset is in the form of a CSV file, and is loaded and tidied in the following cells. Furthermore, for this project, I chose to use data provided by the EPA. However, it is important to understand what the data presented actually means.

The EPA has a system in place, referred to as the Risk Screening Environmental Indicators (RSEI), which provides information on the amount of risk an area faces of damage from pollution. The RSEI is the result of an aggregation of all the data on pollutants the EPA gathers in a certain loction. That data is then weighted and summed, to give a score for the region. Therefore, the RSEI is a measure of risk, not a raw score of levels of pollutants in an area. This means that certain locations can have an RSEI of 0, meaning they are at incredibly low risk. A score of 0 does not imply that there are zero pollutants found in an area.

poll_data = pd.read_csv("../data/rsei-data-2018.csv",
                       header=0,
                       names=['year','state','county',
                              'rsei_score','rsei_model_hazard'])
poll_data.head()

Important note on RSEI Data¶

For the purposes of this project, I chose to pull data from pull data from Alabama, Louisiana, Mississippi, North Carolina, Texas, Virginia, and Washington. These states were chosen because of they all have significant populations, from a variety of different demographics. These states have also all been shown, in recent years, to have issues with pollution. Therefore, the RSEI score should be a more accurate reflection of pollution levels. The data was also restricted to these states because of the large size of the files involved for the entirety of the USA.

#turns the state names into their abbreviations, and converts the 'county,state' to just a 'county'
count = poll_data.county
for i in range(len(count)):
    count[i] = count[i][:-4].lower()
poll_data['state'] = poll_data['state'].map({'Louisiana': 'LA','Alabama':'AL', 'Mississippi': 'MS', 'North Carolina':'NC',
                                             'Texas':'TX', 'Virginia':'VA', 'Washington':'WA'})

C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.

poll_data['rsei_score'] = (poll_data['rsei_score'].str.split()).apply(lambda x: float(x[0].replace(',', ''))) #removes the commas from the string values for rsei_score and rsei_model_hazard
poll_data['rsei_model_hazard'] = (poll_data['rsei_model_hazard'].str.split()).apply(lambda x: float(x[0].replace(',', ''))) 
poll_data = poll_data.astype(dtype = {'year': float, 'rsei_score':float,'rsei_model_hazard':float}, copy=False) #converts the types

poll_data.head()

Now we check the dtypes again, to make sure our data is accurately represented.

poll_data.dtypes

year                 float64
state                 object
county                object
rsei_score           float64
rsei_model_hazard    float64
dtype: object

One more dataset (for now)¶

At this point, the last dataset we need to load is data on demographics in each county. This data was obtained from Census.gov, and is publicly available. The data is presented as a CSV file.

It is important to note that the census records a significantly larger, and more specific, number of racial demographics than are used colloquially. For simplicity, and the scope of this project, I separate the demographics into what is colloquially thought of as the minority and white populations.

demo_data = pd.read_csv("../data/cc-est2019-alldata.csv",encoding = "ISO-8859-1",
                       header=0,
                       names=['sumlev','state_code','county_code','state','county','year','age_group','tot_pop','tot_male','tot_female',
                              'wa_male','wa_female','ba_male','ba_female','ia_male','ia_female','aa_male','aa_female',
                              'na_male','na_female','tom_male','tom_male','wac_male','wac_female','bac_male','bac_female','iac_male',
                              'iac_female','aac_male','aac_female','nac_male','nac_female','nh_male','nh_female','nhwa_male','nhwa_female',
                              'nhba_male','nhba_female','nhia_male','nhia_female','nhaa_male','nhaa_female','nhna_male','nhna_female',
                              'nhtom_male','nhtom_female','nhwac_male','nhwac_female','nhbac_male','nhbac_female',
                              'nhiac_male','nhica_female','nhaac_male','nhaac_female','nhnac_male','nhnac_female','h_male','h_female',
                              'hwa_male','hwa_female','hba_male','hba_female','hia_male','hia_female','haa_male','haa_female','hna_male',
                              'hna_female','htom_male','htom_female','hwac_male','hwac_female','hbac_male','hbac_female','hiac_male',
                              'hiac_female','haac_male','haac_female','hnac_male','hnac_female'])
demo_data.head()

C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\site-packages\pandas\io\parsers.py:678: UserWarning: Duplicate names specified. This will raise an error in the future.
  return _read(filepath_or_buffer, kwds)

#change the name of each county from "X county" to "X". This will make it easier to compare data later
for i in range(len(demo_data.county)):
    temp = demo_data.county[i].replace('county','')
    demo_data.at[i,'county'] =temp

demo_data.head()

The key for the year variable is as follows:

1 = 4/1/2010 Census population
:::
12 = 7/1/2019 population estimate

The key for age_group is as follows:

0 = Total
1 = Age 0 to 4 years
2 = Age 5 to 9 years
:::
18 = Age 85 years or older

I chose to leave the data in this format because it is actually easier to filter using this method.

#change the state names to their abbreviation
demo_data['state'] = demo_data['state'].map({ 
    'Alabama': 'AL','Alaska': 'AK','American Samoa': 'AS','Arizona': 'AZ','Arkansas': 'AR','California': 'CA','Colorado': 'CO','Connecticut': 'CT','Delaware': 'DE',
    'District of Columbia': 'DC','Florida': 'FL','Georgia': 'GA','Guam': 'GU','Hawaii': 'HI','Idaho': 'ID','Illinois': 'IL','Indiana': 'IN','Iowa': 'IA','Kansas': 'KS',
    'Kentucky': 'KY','Louisiana': 'LA','Maine': 'ME','Maryland': 'MD','Massachusetts': 'MA','Michigan': 'MI','Minnesota': 'MN','Mississippi': 'MS','Missouri': 'MO',
    'Montana': 'MT','Nebraska': 'NE','Nevada': 'NV','New Hampshire': 'NH','New Jersey': 'NJ','New Mexico': 'NM','New York': 'NY','North Carolina': 'NC','North Dakota': 'ND',
    'Northern Mariana Islands':'MP','Ohio': 'OH','Oklahoma': 'OK','Oregon': 'OR','Pennsylvania': 'PA','Puerto Rico': 'PR','Rhode Island': 'RI','South Carolina': 'SC',
    'South Dakota': 'SD','Tennessee': 'TN','Texas': 'TX','Utah': 'UT','Vermont': 'VT','Virgin Islands': 'VI','Virginia': 'VA','Washington': 'WA','West Virginia': 'WV',
    'Wisconsin': 'WI','Wyoming': 'WY'
})

demo_data.head()

I check the dtypes of the demographic data, but I will not be showing it here, as there are 80 columns in this dataset.

Exploratory Analysis¶

We begin by finding the average crude rate, and determining which counties have a crude rate above the average. This should provide us with information on how our data is skewed. It also provides a nice means of filtering by counties with higher-than-average crude rates.

#determine the average crude rate
avg_crude_state = comp_mort.groupby('state').Crude_Rate.mean()
avg_crude = comp_mort.Crude_Rate.mean()
print("Average Crude Rate: ", avg_crude)
#create a column to represent the counties with a higher-than-average crude rate
comp_mort['Crude_Above_Avg'] = comp_mort.Crude_Rate >= avg_crude
display(comp_mort.Crude_Above_Avg.value_counts())
comp_mort.head()

Average Crude Rate:  2.8451428571428505

False    1166
True      759
Name: Crude_Above_Avg, dtype: int64

Now, it becomes relevant to plot the distribution of our crude rates, so that we can see how our data is skewed. In the plot below, we plot a histogram of the crude rate, with the red line representing the average crude rate.

x = comp_mort.Crude_Rate
fig = plt.figure(num=None, figsize=(8, 5), dpi=80, facecolor='w', edgecolor='k') #generate figure
result = plt.hist(x)
plt.axvline(x.mean(), color='r', linewidth=3)

<matplotlib.lines.Line2D at 0x274c3e71be0>

The average crude rate is 2.85, and the majority of counties have a value below the average crude. Therefore, the data is skewed to the left. In order to keep track of the counties with larger crude rates, we create a column, Crude_Above_Avg.

Now, we can explore more general measures, for larger populations. Therefore, I chose to create a list of the states with the largest number of deaths, compared to their popultion. I then ranked and plotted this list.

#determine the # of deaths in each state
death_by_state = comp_mort.groupby('state').Deaths.sum()
#determine the population of each state
pop_of_state = comp_mort.groupby('state').Population.sum()
avg_by_pop = death_by_state/pop_of_state
#plot the graph
fig = plt.figure(num=None, figsize=(12, 5), dpi=80, facecolor='w', edgecolor='k') #generate figure
avg_by_pop.sort_values(ascending=False).plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x274a760c128>

This is rather crowded however, so lets examine a plot of just the 15 states with the highest rates.

states_highest_dpp = avg_by_pop.sort_values(ascending=False).index[:15]
fig = plt.figure(num=None, figsize=(8, 5), dpi=80, facecolor='w', edgecolor='k') #generate figure
avg_by_pop.loc[states_highest_dpp].plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x274c54fca58>

This graph shows us the 15 states with the largest number of deaths, given their population. This becomes interesting when we compare this graph to a plot of the demographic distribution of each state.

#determine populations of each state, in the year 2018
year11 = demo_data.set_index('year').loc[11]
state_pop = year11.groupby('state').tot_pop.sum()

#determine the minority populations of each state
minor = year11.groupby('state').tot_pop.sum() - (year11.groupby('state').wa_male.sum() + year11.groupby('state').wa_female.sum())
white = year11.groupby('state').wa_male.sum() + year11.groupby('state').wa_female.sum()

percent_min = (minor/state_pop)[:15]
fig = plt.figure(num=None, figsize=(8, 5), dpi=80, facecolor='w', edgecolor='k') #generate figure
percent_min.sort_values(ascending = False).plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x274c74cab70>

First Impressions¶

Now, through just a cursory analysis, there seems to be a correlation between the states with the largest minority populations and the states with the largest crude rates.

More in Depth Analysis + Combining Datasets¶

So, now that we have some basic information on our data, we need to gain a more in depth understanding of how, or if, the data all works together. That means our next step is to determine if there is a strong correlation among any of our variables. This will have the dual purpose of informing our next steps and showing us, quantitatively, how certain factors may affect others.

First, we need to isolate the data that is most relevant. This means filtering for the states that we are examining, dropping some columns, and ultimately creating a new dataset of relevant information.

#isolates the states that I chose to look at the pollution data for
part_mort = comp_mort[(comp_mort.state == ' LA') | (comp_mort.state == ' AL') | (comp_mort.state == ' MS') |(comp_mort.state == ' NC') | (comp_mort.state == ' TX' )|(comp_mort.state == ' VA')| (comp_mort.state == ' WA' )]
part_demo = demo_data[(demo_data.state == 'LA') | (demo_data.state == 'AL') | (demo_data.state == 'MS') |(demo_data.state == 'NC') | (demo_data.state == 'TX' )|(demo_data.state == 'VA')| (demo_data.state == 'WA' )]

#isolates the year 2018, and groups by county, so that we have the total demographical data for each county.
#This is important, because the data is also organized and seperated by age group, which is less relevant
part_demo_18 = part_demo[part_demo.year == 11]
part_demo_18 = part_demo_18.groupby(['county','state']).sum()
part_demo_18 = part_demo_18.drop(labels = ['sumlev','state_code','county_code','year','age_group'],axis = 1) #remove useless columns
#display(part_demo_18)
minority18  = part_demo_18.tot_pop - (part_demo_18.wa_male + part_demo_18.wa_female) #determine percent minority. This is chosen to be the non-white population, for simplicity
white18 = part_demo_18.wa_male+ part_demo_18.wa_female #determine white population
per_min18 = minority18/part_demo_18.tot_pop #determine the percentage of individuals, in each county, that are a minority

demograph = pd.DataFrame({'county':part_demo_18.reset_index().county, #create dataframe with relevant data
                          'state':part_demo_18.reset_index().state,
                          'total_pop':part_demo_18.reset_index().tot_pop,
                         'minority_pop':minority18.reset_index()[0],
                          'white_pop':white18.reset_index()[0],
                          'percent_minor':per_min18.reset_index()[0],
                          'percent_white':1-per_min18.reset_index()[0]
                         })
counties = demograph.county
counties_temp = []
for i in counties:
    temp_split = i.replace('County','')
    counties_temp = counties_temp + [temp_split]

Lastly, we need to format our "county" and "state" columns correctly, so that they can be combined with other datasets.

#delete prior county column
del demograph['county']

#add in tidier county and state columns
demograph['county'] = counties_temp    

demograph['county']=demograph['county'].str.lower() #make county names lowercase and remove any white space
demograph['county']=demograph['county'].str.strip()
demograph.head()

Now we have to combine the datasets to one frame

#format the county column of the dataset, so it merges correctly
poll_data['county']= poll_data['county'].str.lower()
poll_data['county']= poll_data['county'].str.strip()

#merge the demograph and county_data datasets
county_data = poll_data.merge(demograph,on= ['county','state'], how = 'outer')
county_data.dropna() #remove any NaN's

comp_mort['county']= comp_mort['county'].str.lower()#format the county column of the dataset, so it merges correctly
comp_mort['county']= comp_mort['county'].str.strip()

comp_mort_2 = comp_mort.drop(labels = ['county_Code','Infant_Age_Groups','Population'],axis= 1) #drop the useless columns
comp_mort_2['state']= comp_mort['state'].str.strip()

county_data= county_data.merge(comp_mort_2,on= ['county','state'], how = 'left') #merge the datasets, so that we get one large dataset, with all the relevant data

county_data.dropna(inplace = True)
county_data.head()

Now we have our dataset of relevant data. Now we can use this to perform some higher level analysis.

However, prior to any real anaysis, we should see if any of our data was lost. This happens because some states don't report the same information, and other states just don't keep and upload records as rigerously as others.

county_data.groupby('state').total_pop.sum()
part_demo_18.groupby('state').tot_pop.sum()
per_reporting = county_data.groupby('state').total_pop.sum() / part_demo_18.groupby('state').tot_pop.sum()
per_reporting.plot.barh()
plt.xlabel('Percent Reporting')
print(per_reporting.mean()*100)

88.65067881374583

It looks like we have a good amount of the data represented, just over 88%. The data in county_data represent the counties that reported all the data we need.

Determining Corrlelations¶

Now, we look to see the correlations among our variables. This should allow us to see if factors like RSEI and percent minority popuation are positively correlated with the birth death rate.

to_corr = county_data.drop(labels = 'year',axis = 1) # create corrleation variable set
corr = to_corr.corr() #create corrleation matrix

#plot the correltation data
fig, ax = plt.subplots(figsize=(15,15))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap=colormap, annot=True, fmt = ".4f")
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlational Data")
plt.show()

There are several things we can learn from this graphic. First, we can tell that there is a high correlation between the size of the population and the RSEI score, as well as with the number of deaths. This makes sense, as more people means more pollution. However, it also tells us that there is no correlation between the percent of the population that identifies as a minority and the RSEI score. Therefore, pollution is not correlated to the population demographics. A cursory search will show an abundance of information disclaiming this, which means that there may be some sampling errors in our analyis. First the RSEI score is not a value for actual pollution, but for aggregated and weighted pollution values that are then used to create a score of perceived danger. Furthermore, the it could be a sampling error, due to something specific to the states we are examining.

However,the data does show us that there is a correlation between the crude rate of birth deaths, and the percent minority population. Essentially, this data is telling us that there is a correlation between demographic and the number of neonatal deaths, but shows that it might not be the result of pollution in these areas. There is value in eliminating possible causes of death, so help us find the cause of the issue.

Next Steps:¶

Given the correlations shown above, we need to move away from RSEI Score as a variable linked to birth death rate. There are, however, other factors to consider. Let's look for other reasons though. We'll start by reading in data on the poverty rates in these states.

Last Data ETL¶

"Last one. But this time, with feeling"¶

Below, we load in data on the poverty rates among populations, by county. The data is represented as a CSV file, and includes information on the percentage of a population, and the percentage of children, living in poverty. The data was aquired from the United States Department of Agriculture. The USDA gives one the option to only download data based on state, so I only chose the states that are relevant to our analysis.

poverty = pd.read_csv("../data/PovertyReport.csv", #load a dataset, and format the county column, for merging purposes
                       names=['fips',"county",'ruc_code', "percent_poverty","lower_bound",
                              "upper_bound","child_percent","child_upper","child_lower"])
poverty.drop(labels = ['fips','ruc_code','lower_bound','upper_bound','child_upper','child_lower'],axis = 1,inplace = True)
poverty['county']= poverty['county'].str.lower()
poverty['county']= poverty['county'].str.strip()
poverty.head()

Now, to get a better understanding of the distribution of our data, we plot the distribution of percentage of a population living under the poverty line. As before, the red line represents the average percentage of people below the poverty line, among the dataset, which is 17.3%.

x = poverty.percent_poverty
fig = plt.figure(num=None, figsize=(8, 5), dpi=80, facecolor='w', edgecolor='k') #generate figure
result = plt.hist(x)
plt.axvline(x.mean(), color='r', linewidth=3)
print(x.mean())

17.33031123139379

Now this gives us the percentage of each county, and the percent of children, that lives under the poverty line.

One final merge:¶

Here, we merge the poverty data with our county data, giving us a more complete picture of the makeup of each county.

county_data= county_data.merge(poverty,on= ['county'], how = 'left') #merge the poverty dataset into the county_data dataset
county_data.dropna(inplace = True)
county_data.head()

From here, I want to get a good sense of the amount and percentage of people, in each state, who live under the poverty line. The first graph below is the percentage of people who live under the poverty line, and the second is the number of people who live under the poverty line. It is worth noting that the first graph is a more accurate representation of the relative wealth of a state.

county_data.groupby('state').percent_poverty.mean().plot.barh()
plt.title('Percentage of People Under Poverty Line')

Text(0.5,1,'Percentage of People Under Poverty Line')

(county_data.groupby('state').percent_poverty.mean()*county_data.groupby('state').total_pop.mean()).plot.barh()
plt.title("Number of People Under Poverty Line")

Text(0.5,1,'Number of People Under Poverty Line')

Determining Correlations:¶

Here, we determine the correlations among our variables. This will inform whether the percentage of people living under the poverty line is correlated with the birth death rate, rsei score (just to inform our decisions), and minority percentage of the population.

to_corr = county_data.drop(labels = ['year','rsei_model_hazard','white_pop','percent_white','Unreliable'],axis = 1) #drop useless columns
corr = to_corr.corr()  # create correlation matrix
#determine and plot the correlations among the most relevant data
fig, ax = plt.subplots(figsize=(15,15))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap=colormap, annot=True, fmt = ".4f")
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlational Data")
plt.show()

What We Learn from this:¶

What this new correlation matrix shows us is that there is a correlation, between the neonatal death rate and the percent of people who live in poverty. There is also a similar correlation between the percent minority and percent poverty populations.

Now, it is important to mention that correlation does not show causation. However, it is becoming increasingly obvious that the reasons for neonatal deaths are complex, and there are many confounding variables. Therefore, it becomes important to see if the combination of certain variables, and which variables at that, lead to the birth death rate that we observe.

Lastly, at this point, we need to make some determinations about the data itself. The question I felt most relevant at this time was: is the data precise enough to allow us to make real-to-life determinations?

To answer these questions, I plot the variables under the most scrutiny, as they relate to Crude Rate, below.

#generate a figure
fig, ax = plt.subplots(1, 4, figsize=(30,10)) #create figure

#create scatter plots for the observations I am examining
observations = ['rsei_score','percent_minor','percent_poverty', 'child_percent']
for ax, observations in zip(ax, observations): #for each subplot 
    ax.scatter(x = county_data[observations],  y = county_data['Crude_Rate'])
    slope, intercept, r_value, p_value, std_err = stats.linregress(county_data[observations],county_data['Crude_Rate']) #perform linear regression
    print('Value of Line for ' + observations)
    print('Y = '+ str(slope) + 'X+ ' + str(intercept))
    x_vals = np.array(ax.get_xlim()) #set x values for line
    y_vals = intercept + (slope * x_vals) #determine y values using slope, intercept, and x values
    ax.plot(x_vals, y_vals,'r--') #plot the line using x and y values
    ax.set_title('Crude Rate Vs. ' + observations) #set title, axes labels
    ax.set_xlabel(observations)
    ax.set_ylabel('Crude Rate')

Value of Line for rsei_score
Y = -3.251216398832935e-08X+ 3.0475995381303935
Value of Line for percent_minor
Y = 3.9882564075235942X+ 1.9922228943356979
Value of Line for percent_poverty
Y = 0.06370456073999181X+ 1.9294652382059874
Value of Line for child_percent
Y = 0.04497082893542392X+ 1.9247727280952467

These graphs show us several things. First, they show us that there is a positive trend between the number of neonatal deaths and the percentage of people who either; are a minority, or live in poverty.

On the note of RSEI data; it also shows us that there is a negative correlation between the RSEI score and the neonatal death rate. However, there is an outlier, clearly pulling the trend downward. Lets see what happens when we remove that value.

#find the max rsei_score
county_data.rsei_score.max()
row = county_data.loc[county_data['rsei_score'] == county_data.rsei_score.max()]
county_data.drop(index = 441)
#try again, now without the outlier
fig, ax = plt.subplots(1, 1, figsize=(15,5)) #create figure
ax.scatter(x = county_data.drop(index = 441)['rsei_score'], y = county_data.drop(index = 441)['Crude_Rate'])
slope, intercept, r_value, p_value, std_err = stats.linregress(county_data.drop(index = 441)['rsei_score'],county_data.drop(index = 441)['Crude_Rate']) #perform linear regression
x_vals = np.array(ax.get_xlim()) #set x values for line
y_vals = intercept + (slope * x_vals) #determine y values using slope, intercept, and x values
print('Value of Line:')
print('Y = '+ str(slope) + 'X+ ' + str(intercept))
ax.plot(x_vals, y_vals,'r--') #plot the line using x and y values
ax.set_title('RSEI Vs. Crude Rate (without largest outlier)') #set title, axes labels
ax.set_xlabel('RSEI Score')
ax.set_ylabel('Crude Rate')

Value of Line:
Y = -1.254715643652992e-07X+ 3.069845765286437

Text(0,0.5,'Crude Rate')

So we still have a negative trend, but a much smaller one. Our slope moved all the way from -3.25x10^8 to -1.25x10^7, just by removing that outlier. This, to me, suggests that it may be important to add in more data in the future. This will make the data less susceptible to changes from a single outlier, and could even change the slope to a positive value. Furthermore, with the negative corrleation between RSEI and Crude Rate, combined with what we know of pollution affecting neonates, I believe that RSEI is not a valid represention of pollution, at this time. Lastly, RSEI is an aggregate of all pollutants, therefore it may be better to find disaggregated data on specific chemicals, which have been shown to negatively affect pregnancy outcomes, in the future.

It has been found that pollution does have a negative affect on pregnancy outcomes, which can be explored more here: https://erj.ersjournals.com/content/54/suppl_63/PA297

Prediction Models¶

As a last step, I chose to generate a series of predictive models. This was done to determine which variables, and the combination of which variables, had a stronger affect on our ability to generate an accurate model. The thinking was this: if we can generate a more accurate model and remove variables one-by-one, then we can determine which variables have the strongest affect on the birth death rate.

parameters = ['Crude_Rate','rsei_score','percent_minor', 'percent_poverty','child_percent','total_pop','white_pop','percent_white']
x_vals = ['rsei_score','percent_minor', 'percent_poverty','child_percent','total_pop']
county_pred = county_data[parameters]
X_values = county_pred[x_vals].values
y_values = county_pred['Crude_Rate'].values

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)

Here, I generate a series of different models, which are then trained and compared. This finds the most accurate method, which will be used later.

#creating several different models, to chose the most accurate
lin = sklearn.linear_model.LinearRegression()
ridge = sklearn.linear_model.Ridge()
ridge01 = sklearn.linear_model.Ridge(alpha = .01)
ridge1000 = sklearn.linear_model.Ridge(alpha = 1000)
las = sklearn.linear_model.Lasso()
las01 = sklearn.linear_model.Lasso(alpha = .01)
las0001 = sklearn.linear_model.Lasso(alpha = .0001)

Using the method from the labs in class, I created a get_rmse function. This function returns the Root Mean Squared Error for each model passed to it.

def get_rmse(model):#perform a 10CV, using the method from a prior lab
    #A 10CV was chosen because of the relatively small dataset, and the low correlation, hoping that this would balance any excess error
    scores = sklearn.model_selection.cross_val_score(model, X_values, y_values, cv = 10, scoring="neg_mean_squared_error")
    mse = np.mean(abs(scores))
    rmse = np.sqrt(mse)
    return rmse

lin_rmse = get_rmse(lin)
las_rmse = get_rmse(las)
las01_rmse = get_rmse(las01)
las0001_rmse = get_rmse(las0001)
ridge_rmse = get_rmse(ridge)
ridge01_rmse = get_rmse(ridge01)
ridge1000_rmse = get_rmse(ridge1000)

print('Linear Model RMSE: ', lin_rmse)
print('Lasso RMSE: ', las_rmse)
print('Lasso with alpha = .01 RMSE: ', las01_rmse)
print('Lasso with alpha = .0001 RMSE: ', las0001_rmse)
print('Ridge RMSE: ', ridge_rmse)
print('Ridge with alpha = .01 RMSE: ', ridge01_rmse)
print('Ridge with alpha = 100 RMSE: ', ridge1000_rmse)

Linear Model RMSE:  1.058961872745437
Lasso RMSE:  1.276634422485061
Lasso with alpha = .01 RMSE:  1.064708627143088
Lasso with alpha = .0001 RMSE:  1.0589986841670347
Ridge RMSE:  1.0608720505380345
Ridge with alpha = .01 RMSE:  1.0589734224289102
Ridge with alpha = 100 RMSE:  1.2679263185732834

rmses = pd.Series([lin_rmse, las_rmse, las01_rmse,las0001_rmse,ridge_rmse, ridge01_rmse, ridge1000_rmse])
labels = ['linear RMSE', 'lasso RMSE', 'lasso01 RMSE','lasso0001 RMSE','ridge RMSE', 'ridge01 RMSE', 'ridge1000 RMSE']
fig, ax = plt.subplots()
rmses.plot.bar()
ax.set_xticklabels(labels)

[Text(0,0,'linear RMSE'),
 Text(0,0,'lasso RMSE'),
 Text(0,0,'lasso01 RMSE'),
 Text(0,0,'lasso0001 RMSE'),
 Text(0,0,'ridge RMSE'),
 Text(0,0,'ridge01 RMSE'),
 Text(0,0,'ridge1000 RMSE')]

The Linear model RMSE has the smallest error, barely. So I'll move forward with that.

Removing and Alternating Variables¶

From this point onward, I run a series of models where I remove a single variable and check the RMSE. I do this to determine the value of each variable on our predictive capabilities

We start off with all of the variables

lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
with_all = pd.DataFrame({'real_data': y_test, 'predicted_data': y_pred})
rmses_change = []
lin_rmse = get_rmse(lin)
rmses_change = rmses_change + [lin_rmse]

parameters = ['Crude_Rate','percent_minor', 'percent_poverty','child_percent','total_pop']
x_vals = ['percent_minor', 'percent_poverty','child_percent','total_pop']
county_pred = county_data[parameters]
X_values = county_pred[x_vals].values
y_values = county_pred['Crude_Rate'].values

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)

lin = sklearn.linear_model.LinearRegression()

Then we remove the RSEI variable

lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
without_rsei = pd.DataFrame({'real_data': y_test, 'predicted_data': y_pred})
lin_rmse = get_rmse(lin)
rmses_change = rmses_change + [lin_rmse]

parameters = ['Crude_Rate','rsei_score', 'percent_poverty','child_percent','total_pop']
x_vals = ['rsei_score', 'percent_poverty','child_percent','total_pop']
county_pred = county_data[parameters]
X_values = county_pred[x_vals].values
y_values = county_pred['Crude_Rate'].values

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)

lin = sklearn.linear_model.LinearRegression()

Put the RSEI back in and remove the Percent Minority variable

lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
without_per_min = pd.DataFrame({'real_data': y_test, 'predicted_data': y_pred})
lin_rmse = get_rmse(lin)
rmses_change = rmses_change + [lin_rmse]

parameters = ['Crude_Rate','rsei_score', 'percent_minor','child_percent','total_pop']
x_vals = ['rsei_score', 'percent_minor','child_percent','total_pop']
county_pred = county_data[parameters]
X_values = county_pred[x_vals].values
y_values = county_pred['Crude_Rate'].values

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)

lin = sklearn.linear_model.LinearRegression()

Put the Percent Minority back in and remove the Percent Poverty variable

lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
without_per_pov = pd.DataFrame({'real_data': y_test, 'predicted_data': y_pred})
lin_rmse = get_rmse(lin)
rmses_change = rmses_change + [lin_rmse]

parameters = ['Crude_Rate','rsei_score', 'percent_minor','percent_poverty','total_pop']
x_vals = ['rsei_score', 'percent_minor','percent_poverty','total_pop']
county_pred = county_data[parameters]
X_values = county_pred[x_vals].values
y_values = county_pred['Crude_Rate'].values

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)

lin = sklearn.linear_model.LinearRegression()

Put the Percent Poverty back in and remove the Percent Child Poverty variable

lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
without_per_kid = pd.DataFrame({'real_data': y_test, 'predicted_data': y_pred})
lin_rmse = get_rmse(lin)
rmses_change = rmses_change + [lin_rmse]

Now we plot all of these predictive models, against the actual data. This should show us if there are any glaring issues with our models.

with_all.plot.bar(figsize=(30,5), title = 'With All Data')
without_rsei.plot.bar(figsize=(30,5),title = 'Without RSEI Data')
without_per_min.plot.bar(figsize=(30,5),title = 'Without Percent Minority Data')
without_per_pov.plot.bar(figsize=(30,5),title = 'Without Percent Poverty Data')
without_per_kid.plot.bar(figsize=(30,5),title = 'Without Percent Child Poverty Data')

<matplotlib.axes._subplots.AxesSubplot at 0x274c3eb5198>

Lastly, we plot the RMSE values for each model, so that we can determine which variables have the greatest effect at creating an accurate predictive model.

rmses_ser = pd.DataFrame(rmses_change)
label = ['With All','W/out RSEI','W\out % Minority','W/out % Poverty','W/out % Child Poverty']
fig = plt.figure(num=None, figsize=(15, 5), dpi=80) #generate figure
print(rmses_change)
plt.bar(label, rmses_ser[0])
plt.title('Error')

[1.058961872745437, 1.0382115288004892, 1.2738521832516883, 1.0593955003216176, 1.0640243085498184]

Text(0.5,1,'Error')

Conclusion¶

There is quite a bit I learned, through this project. Firstly, there is a strong correlation between the percent of a population that identifies as a minority, and the birth death rate. This has been shown before, but to be able to quantitatively give evidence is relevant to creating further discussion. Secondly, the percentage of the population that is living under the poverty line has an affect on birth death outcomes. However, there are many factors affecting the birth death rate in different populations, so it is unlikely that I'll be able to find the specific one, to a high degree of accuracy, at this time. Furthermore, there are considerations on the external validity of using some variables, such as RSEI score. At this time, I believe that there may be a correlation between the release of certain chemicals into the environment and negative pregnancy outcomes. However, an analysis of these components would require a significant amount of research into both biochemistry and pollution affects.

Lastly, I would just like to note what this work means. There is a significant problem in the healthcare system, at this time. And while my conclusions may not allow for any immediate progress, they inform further decision making and allow for more discussion on what is relevant. If I were to redo the entire project, knowing what I learned in the work above, I would make many different decisions, and perhaps have a more accurate way of predicting risk factors and outcomes.

	year	state	county	rsei_score	rsei_model_hazard
0	2018.0	AL	autauga	48129.0	3.931107e+09
1	2018.0	AL	baldwin	871476.0	3.073264e+10
2	2018.0	AL	barbour	268.0	3.017065e+07
3	2018.0	AL	bibb	4973.0	1.556437e+07
4	2018.0	AL	blount	154.0	7.052463e+06

	state	total_pop	minority_pop	white_pop	percent_minor	percent_white	county
0	LA	124428	25496	98932	0.204906	0.795094	acadia parish
1	VA	65162	20848	44314	0.319941	0.680059	accomack
2	MS	62088	34532	27556	0.556178	0.443822	adams
3	WA	39472	4484	34988	0.113600	0.886400	adams
4	NC	333028	86904	246124	0.260951	0.739049	alamance

	year	state	county	rsei_score	rsei_model_hazard	total_pop	minority_pop	white_pop	percent_minor	percent_white	Deaths	Crude_Rate	Unreliable	Crude_Above_Avg
0	2018.0	AL	autauga	48129.0	3.931107e+09	111066.0	25734.0	85332.0	0.231700	0.768300	35.0	3.0	0.0	True
1	2018.0	AL	baldwin	871476.0	3.073264e+10	435710.0	54854.0	380856.0	0.125896	0.874104	78.0	2.1	0.0	False
2	2018.0	AL	barbour	268.0	3.017065e+07	49744.0	25336.0	24408.0	0.509328	0.490672	21.0	3.5	0.0	True
3	2018.0	AL	bibb	4973.0	1.556437e+07	44734.0	10302.0	34432.0	0.230295	0.769705	19.0	3.9	1.0	True
4	2018.0	AL	blount	154.0	7.052463e+06	115542.0	4740.0	110802.0	0.041024	0.958976	23.0	1.9	0.0	False

	county	percent_poverty	child_percent
0	acadia parish	25.2	35.4
1	allen parish	19.6	22.7
2	ascension parish	10.8	14.2
3	assumption parish	20.7	28.7
4	avoyelles parish	27.4	36.9

	year	state	county	rsei_score	rsei_model_hazard	total_pop	minority_pop	white_pop	percent_minor	percent_white	Deaths	Crude_Rate	Unreliable	Crude_Above_Avg	percent_poverty	child_percent
0	2018.0	AL	autauga	48129.0	3.931107e+09	111066.0	25734.0	85332.0	0.231700	0.768300	35.0	3.0	0.0	True	13.8	19.3
1	2018.0	AL	baldwin	871476.0	3.073264e+10	435710.0	54854.0	380856.0	0.125896	0.874104	78.0	2.1	0.0	False	9.8	13.9
2	2018.0	AL	barbour	268.0	3.017065e+07	49744.0	25336.0	24408.0	0.509328	0.490672	21.0	3.5	0.0	True	30.9	43.9
3	2018.0	AL	bibb	4973.0	1.556437e+07	44734.0	10302.0	34432.0	0.230295	0.769705	19.0	3.9	1.0	True	21.8	27.8
4	2018.0	AL	blount	154.0	7.052463e+06	115542.0	4740.0	110802.0	0.041024	0.958976	23.0	1.9	0.0	False	13.2	18.0