Dataverse

May 3, 2018

Salmon Hatcheries in Washington State

by Dane Miller

The State of Washington contains 1166 dams within the state. The Columbia River contains more than 60 dams, containing some of the largest dams in the United States. However, the fish ladders the allow salmon and steal head to move upstream stop at Chief Joseph Dam in Bridgeport, Washington (47.9953° N, 119.6333° W). There is roughly a 500 mile stretch of the Columbia River where salmon and steelhead trout from the Pacific Ocean can not migrate upstream to the headwaters due to the Chief Joseph Dam.

Here is a map of the Columbia River watershed.
Columbia River Watershed

In this post I have mapped the salmon hatcheries in Washington State using folium in Python.

Here is the salmon hatchery interactive map.
Washington

Here is a map of the dams in Washington state.

Here is the interactive map by The Northwest Power and Conservation Council.
Dams along the Columbia River

Perhaps one day the Chief Joseph Dam along the Columbia River will be removed and salmon could one day return to the upper section of the Columbia River.

April 28, 2018

Mapping with mapbox compared to folium in python

by Dane Miller

A quick comparison of mapping looking at mapbox and folium in python. Mapbox is a mapping program based in GeoJSON, very easy to use and produces maps very quickly. Folium is a python based mapping program that requires several dependencies in order to produce a map.

If you need to produce a map quickly that has high resolution go with mapbox using GeoJSON. It will save you hours worth of work. If your end goal is to create statistical modeled map you will need to use python or R to likely create such a map.

Click on the links below to see the difference.

GeoJSON map
This map shows conifer cones species collected across the Western United States. The data was saved in a CSV file of latitude/longitude data.
Cone Map (fixing the link)

folium map (python)
Conifer_Map

April 27, 2018

Gas production

by Dane Miller

In this post I will be comparing the rate of two chemical reactions.

1) Hydrochloric acid and seashell
CaCO3 (s) + 2HCl (aq) –> CaCl2 (aq) + H2O (l) + CO2 (g)

2) Hydrogen peroxide and Yeast
2H2O2 (aq) — (catalyst Yeast) —> H2O (aq) + O2 (g) + heat

Both of these chemical reactions were measured with Vernier labquest O2/CO2 probe. Then the data was converted to CSV file and ran some descriptive statistics in jupyter notebook (Python 3.6).

1) Hydrochloric acid and seashell

2) Hydrogen peroxide and Yeast

First import the necessary modules you will be using for the analysis.

# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns

from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
plt.style.use('seaborn-white')

df = pd.read_csv('/.../YeastO2.csv') # load csv file into python
df.info()
df.head()

est = smf.ols('Yeast_O2_ppm ~ Time', df).fit()
est.summary().tables[1]

Yeast and H2O2

coef std err t P>|t| [0.025 0.975]
Intercept -467.1246 122.092 -3.826 0.000 -709.382 -224.867
Time 17.2861 0.527 32.779 0.000 16.240 18.333

HCl and Seashell

coef std err t P>|t| [0.025 0.975]
Intercept -1575.5714 527.309 -2.988 0.006 -2659.471 -491.672
Time 87.8234 8.379 10.481 0.000 70.599 105.048

We can compare correlations.
Time Yeast_O2_ppm
Time 1.000000 0.956887
Yeast_O2_ppm 0.956887 1.000000

Time HCl_CO2_ppm
Time 1.000000 0.899226
HCl_CO2_ppm 0.899226 1.000000

We can compare the linear regression models in the first two figures above.

regr = skl_lm.LinearRegression()

X = df[['Time']].as_matrix()
y = df.Yeast_O2_ppm # ran this same code for the HCl dataset

regr.fit(X,y)
print(regr.coef_)
print(regr.intercept_)

Yeast and H2O2
[17.28611823]
-467.1246359930119

HCl and Seashell
[87.8234127]
-1575.571428571428

April 24, 2018

Folium mapping from CSV file

by Dane Miller

This is some graduate school field data collecting different species conifer cones. Here is a link to that publication:

https://www.sciencedirect.com/science/article/pii/S0033589414000738

Creating interactive maps with multiple latitude and longitude coordinates. The folium mapping module is very powerful and interactive.

Here is a link to the interactive map. The map allows you to zoom in and scroll over the cloud icons for additional information.
Conifer_Map

import folium
from folium import plugins
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

df = pd.read_csv('/.../gradcone.csv')
df.head()

I set the start point of this map at Kenosha Pass, Colorado. I could have easily put in a different location.

m = folium.Map([39.4133, -105.7567], zoom_start=5)
m

Make sure you specify row (lat and long) and in popup additional information you want to include.

for index, row in df.iterrows():
    folium.Marker([row['Latitude'], row['Longitude']], 
                  popup=row['Location'],
                  icon=folium.Icon(icon='cloud')
                 ).add_to(m)
m
m.save('/.../map4.html')
# in order for the map to popup on in jupyter notebook comment out the m.save.

If you are interested in digging into folium mapping with python take a look at the links below.
http://folium.readthedocs.io/en/latest/index.html
https://alysivji.github.io/getting-started-with-folium.html

April 23, 2018

Mapping with folium

by Dane Miller

Here is a very easy to use interactive mapping module in python called folium. It is a fast way to make maps while the map can be interactive.

Click on the link to open the map.
Santa fe map

Here is some documentation how to work through and create your own map.
quickstart[1]

In order to run to create a map you will need to install folium.

https://anaconda.org/conda-forge/folium

I would also suggest installing Ipyleaflet which contains lots of mapping features.

https://anaconda.org/conda-forge/ipyleaflet

folium.Map(location=[35.6870, -105.9378],
          tiles='Stamen Toner',
          zoom_start=14)
map_osm.save('/.../map3.html')

map_1 = folium.Map(location=[35.6870, -105.9378],
                   zoom_start=12,
                   tiles='Stamen Terrain')
folium.Marker([35.6892, -105.9413], popup='Georgia O Keeffe Museum').add_to(map_1)
folium.Marker([35.6865, -105.9359], popup='Cathedral Basilica of St. Francis of Assisi').add_to(map_1)
folium.Marker([35.6641, -105.9266], popup='Museum of International Folk Art').add_to(map_1)
folium.Marker([35.5889, -106.0775], popup='arroyo de los chamisos trail').add_to(map_1)
folium.Marker([35.6661433, -105.8308525], popup='Thompson Peak Trail, NM').add_to(map_1)

map_1
map_1.save('/.../map3.html')

April 20, 2018

Basemap

by Dane Miller

Here is a step by step method of using matplotlib basemap in python. Before you get started make sure you have installed matplotlib and basemap.

Links:
Matplotlib
https://matplotlib.org/

The process is a lot easier if you are using Anaconda with Jupyter notebook.
https://anaconda.org/conda-forge/matplotlib
https://anaconda.org/anaconda/basemap

Also make sure your jupyter notebook latest version has been updated.
http://jupyter.readthedocs.io/en/latest/projects/upgrade-notebook.html

Once you have successfully installed and updated all modules then we can get to the fun stuff! Mapping!!!

Basemap has a lot of features, in this post I am focusing on a couple simple features.
https://matplotlib.org/basemap/api/basemap_api.html#module-mpl_toolkits.basemap

from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

m = Basemap(projection='mill',
           llcrnrlat = -90,
           llcrnrlon = -180,
           urcrnrlat = 90,
           urcrnrlon = 180, 
           resolution = 'l')

m.drawcoastlines()
m.drawcountries(linewidth=2)
m.drawrivers(color='blue')
m.fillcontinents(color='g', lake_color='blue', alpha=0.5)

plt.title('Basemap of the globe')
plt.show()

Here is our first map of the globe.

# Lambert Conformal map of lower 48 states.
m = Basemap(llcrnrlon=-119,llcrnrlat=20,urcrnrlon=-64,urcrnrlat=49,
            projection='lcc',lat_1=33,lat_2=45,lon_0=-95)

m.drawcoastlines()
m.drawcountries(linewidth=2)
m.drawrivers(color='blue')
m.fillcontinents(color='g', lake_color='blue', alpha=0.5)
m.drawstates()
# m.bluemarble()

plt.title('Basemap of the United States of America')
plt.show()

Map of the lower 48 states in the US. Note you can see all the states and rivers drawn out to the map. This is easily done by using m.drawrivers(color=select a color) and m.drawstates().

# Lambert Conformal map of California
m = Basemap(width=1284000,height=1164000,projection='lcc',lat_1=30.,lat_2=60,
             lat_0=37,lon_0=-120.5,resolution='h',rsphere=6370000.00)

m.drawcoastlines()
m.drawcountries(linewidth=2)
m.drawrivers(color='blue')
m.fillcontinents(color='g', lake_color='blue', alpha=0.5)
m.drawstates()
m.drawcounties()
# m.bluemarble()

plt.title('Basemap of California')
plt.show()

Zooming in even closer to a single state, California.

# Lambert Conformal map of California
m = Basemap(width=1284000,height=1164000,projection='lcc',lat_1=36,lat_2=38,
             lat_0=37.7749,lon_0=-122.4194,resolution='h',rsphere=63700000.00)

m.drawcoastlines()
m.drawcountries(linewidth=2)
m.drawrivers(color='blue')
m.fillcontinents(color='tan', lake_color='blue', alpha=0.5)
m.drawcounties(linewidth=1, color='black')
# m.bluemarble()

plt.title('Basemap San Francisco Bay')
plt.show()

And finally, zoomed into the San Francisco Bay.

April 16, 2018

Syrian Civil War

by Dane Miller

This is an analysis of the causalities from the Syrian Civil War between 2011-2018. The data can be found on (http://www.vdc-sy.info/index.php/en/martyrs) or Data.World. Looking over this dataset is very depressing to take in how many people have lost their lives in the past 6 1/2 years.

Note that the Violations Documentation Center in Syria (http://www.vdc-sy.info/index.php/en/martyrs) are likely under reporting the total number causalities. According to NPR (https://www.pbs.org/wgbh/frontline/article/a-staggering-new-death-toll-for-syrias-war-470000/) that total causalities are reported at 470,000 individuals. The Human Right Watch (https://www.hrw.org/world-report/2017/country-chapters/syria) reports 470,000 causalities as well. Another website (http://www.iamsyria.org/death-tolls.html) reports causalities closer to 500,000 individuals.

This analysis is looking at how the Violations Documentation Center in Syria is under reporting causalities. There is a terrible Human Rights tragedy happening in Syria.

Map of Syria with its largest cities for context.

This dataset is an csv file 20.2 MB containing 211,910 rows of data. I will provide a link to my github page.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns

from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
plt.style.use('seaborn-white')

df = pd.read_csv('/.../Syria.csv') # add your location for your file in ...

Missing data

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Number of causalities by gender and age class. The vast majority of individuals who have died have been adult males during the 6 year conflict.

sns.set_style('whitegrid')
sns.countplot(x='gender',data=df,palette='rainbow')

The vast majority of causalities in this data set individuals died by shootings, shellings, or warplane shelling.

g = sns.factorplot("deathCause", data=df, aspect=1.5, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

Number of total causalities by province in Syria. Aleppo and Damascus suburbs have had the highest number of causalities.

g = sns.factorplot("province", data=df, aspect=1.5, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

The number of civilian causalities were 3 times as many as non-civilian. Majority of deaths have been civilians in populated and suburban environments.

g = sns.factorplot("status", data=df, aspect=1.5, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

fig = plt.figure(figsize=(20,14))
fig.suptitle('Syria Death Toll', fontsize=20)

g = sns.factorplot(x='province', hue='gender', data=df, aspect=4.0, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

This figure looks at the causalities by cause of death by gender/age.

fig = plt.figure(figsize=(15,10))
fig.suptitle('Syria Death Toll', fontsize=20)

g = sns.factorplot(x='deathCause', hue='gender', data=df, aspect=4.0, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

This looks at the cause of death by province in Syria.

fig = plt.figure(figsize=(50,45))
fig.suptitle('Syria Death Toll', fontsize=20)

g = sns.factorplot(x='province', hue='deathCause', data=df, aspect=4.0, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

The number of causalities by year (2011-2018) by gender and age group. Even though the causalities have decreased, nearly 10,000 individuals are still dying per year. We are only four months into 2018, I really hope the causalities do not increase.

This figure compares the cause of death by year during the Syrian Civil War. In 2012 and 2013 the vast number of causalities were from shellings and shootings. Since 2013, there has been an increase use of chemical weapons.

fig = plt.figure(figsize=(20,14))
fig.suptitle('Syria Death Toll', fontsize=20)

g = sns.factorplot(x='year', hue='deathCause', data=df, aspect=4.0, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

This figure compares causalities by year at the major provinces in Syria.

fig = plt.figure(figsize=(20,14))
fig.suptitle('Syria Death Toll', fontsize=20)

g = sns.factorplot(x='year', hue='province', data=df, aspect=4.0, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

Distribution of the number of causalities per year in Syria.

sns.distplot(df['year'].dropna(),kde=False,color='darkred',bins=50)

Cause of death sorted by the year.

g = sns.factorplot(x='deathCause', hue='year', data=df, aspect=4.0, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

The cause of death by province during the Syrian Civil War.

g = sns.factorplot(x='deathCause', hue='province', data=df, aspect=4.0, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

The number of causalities by gender and ages grouping by province. The vast majority of all causalities have been adult males. However, in Aleppo ~5000 child males have died during attacks since 2011.

g = sns.factorplot(x='province', hue='gender', data=df, aspect=4.0, kind="count", palette='rainbow')
g.set_xticklabels(rotation=90)

Variables in the dataset:

Name – The reported name of the individual killed

Status – The individual’s status as a civilian or non-civilian

Gender- The individual’s gender and age category

Province – The province where death occurred

Birth place – The individual’s place of birth

Death data – The reported date when death occurred.

Death Cause – The category that best describes the proximate cause of death

Actor – The entity linked to the action which resulted in death

Year – (I added year into the data set in order to summarize the data).

Source of data:

Violations Documentation Center in Syria
http://www.vdc-sy.info/index.php/en/martyrs

Data.World
https://data.world/polymathic/casualties-of-the-syrian-civil-war/workspace/data-dictionary

April 12, 2018

Bottled Water pH

by Dane Miller

Here is a quick analysis of bottle drinking pH plotted with seaborn. I started by looking up cited material on documenting bottle water pH analysis (see chart below). With my analysis converted the pH for each brand to H+ and OH-.

Brands	pH	[H+]aq mol-1	[OH-]aq mol-1
Coca-Cola	2.24	0.0224	2.24E-13
VitaminWater	2.49	0.0249	2.49E-13
Gatorade	2.92	0.0292	2.92E-13
Ozarka water	5.16	0.0000516	5.16E-09
Aquafina	5.63	0.0000563	5.63E-09
Dasnia	5.72	0.0000572	5.72E-09
Nestle Pure Life	6.24	0.00000624	6.24E-08
Evian	6.89	0.00000689	6.89E-08
Fiji	6.9	0.0000069	0.000000069
Smart Water	6.91	0.00000691	6.91E-08
Houston Tap Water	7.29	0.000000729	0.000000729
Pasadena Tap Water	7.58	0.000000758	0.000000758
Evamor	8.78	8.78E-08	0.00000878
Essentia	10.38	1.038E-09	0.001038

Here is the article if you would like more information: http://jdh.adha.org/content/89/suppl_2/6.full.pdf

Plotting the [H+]aq mol-1 and [OH-]aq mol-1 with pH show us a clearer picture of the relationship between pH and H/OH. When the pH values are high are associated with OH values and low pH values are associated with H values.

H: Hydrogen ion concentration

OH: Hydroxide ion concentration

pH and [H+] – Hydrogen

download (1).png

pH and [OH-] Hydroxide

download (3).png

April 12, 2018

Python code – bottled water pH

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns

from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
plt.style.use('seaborn-white')

df = pd.read_csv('/.../phwater.csv')

df.head()

g = sns.lmplot(x="OH", y="pH", hue="Brands", data=df)
g.set(ylim=(0,14))

g = sns.lmplot(x="H", y="pH", hue="Brands", data=df)
g.set(ylim=(0,14))

April 10, 2018

Python code “Old faithful geyser dataset rebooted with Python”

# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns

from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
plt.style.use('seaborn-white')

of = pd.read_csv('/.../oldfaith.csv')
of.info()
of.head()

regr = skl_lm.LinearRegression()

# Linear fit
X = of.wait_time_min.values.reshape(-1,1)
y = of.duration_sec
regr.fit(X, y)

of['pred1'] = regr.predict(X)
of['resid1'] = of.duration_sec - of.pred1

# Quadratic fit
X2 = of[['wait_time_min', 'wait_time_min']].as_matrix()
regr.fit(X2, y)

of['pred2'] = regr.predict(X2)
of['resid2'] = of.duration_sec - of.pred2

fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))

# Left plot
sns.regplot(of.pred1, of.resid1, lowess=True, 
            ax=ax1, line_kws={'color':'r', 'lw':1},
            scatter_kws={'facecolors':'None', 'edgecolors':'k', 'alpha':0.5})
ax1.hlines(0,xmin=ax1.xaxis.get_data_interval()[0],
           xmax=ax1.xaxis.get_data_interval()[1], linestyles='dotted')
ax1.set_title('Residual Plot for Linear Fit')

# Right plot
sns.regplot(of.pred2, of.resid2, lowess=True,
            line_kws={'color':'r', 'lw':1}, ax=ax2,
            scatter_kws={'facecolors':'None', 'edgecolors':'k', 'alpha':0.5})
ax2.hlines(0,xmin=ax2.xaxis.get_data_interval()[0],
           xmax=ax2.xaxis.get_data_interval()[1], linestyles='dotted')
ax2.set_title('Residual Plot for Quadratic Fit')

for ax in fig.axes:
    ax.set_xlabel('Fitted values')
    ax.set_ylabel('Residuals')

est = smf.ols('wait_time_min ~ duration_sec', of).fit()
est.summary().tables[1]

sns.jointplot(x='wait_time_min',y='duration_sec',data=of,kind='reg')

g = sns.jointplot("wait_time_min", "duration_sec", data=of,
...                   kind="kde", space=0, color="g")

 g = (sns.jointplot("wait_time_min", "duration_sec",
...                    data=of, color="k")
...         .plot_joint(sns.kdeplot, zorder=0, n_levels=6))

g = sns.jointplot("wait_time_min", "duration_sec", data=of,
...                   marginal_kws=dict(bins=15, rug=True),
...                   annot_kws=dict(stat="r"),
...                   s=40, edgecolor="w", linewidth=1)