Bottled Water pH

by Dane Miller

Here is a quick analysis of bottle drinking pH plotted with seaborn. I started by looking up cited material on documenting bottle water pH analysis (see chart below). With my analysis converted the pH for each brand to H+ and OH-.

 

Brands pH [H+]aq mol-1 [OH-]aq mol-1
Coca-Cola 2.24 0.0224 2.24E-13
VitaminWater 2.49 0.0249 2.49E-13
Gatorade 2.92 0.0292 2.92E-13
Ozarka water 5.16 0.0000516 5.16E-09
Aquafina 5.63 0.0000563 5.63E-09
Dasnia 5.72 0.0000572 5.72E-09
Nestle Pure Life 6.24 0.00000624 6.24E-08
Evian 6.89 0.00000689 6.89E-08
Fiji 6.9 0.0000069 0.000000069
Smart Water 6.91 0.00000691 6.91E-08
Houston Tap Water 7.29 0.000000729 0.000000729
Pasadena Tap Water 7.58 0.000000758 0.000000758
Evamor 8.78 8.78E-08 0.00000878
Essentia 10.38 1.038E-09 0.001038

phw.jpg

Here is the article if you would like more information:  http://jdh.adha.org/content/89/suppl_2/6.full.pdf

 

Plotting the [H+]aq mol-1 and [OH-]aq mol-1 with pH show us a clearer picture of the relationship between pH and H/OH. When the pH values are high are associated with OH values and low pH values are associated with H values.

H: Hydrogen ion concentration

OH: Hydroxide ion concentration

 

pH and [H+] – Hydrogen

download (1).png

pH and [OH-] Hydroxide

download (3).png

 

 

Python code – bottled water pH

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns

from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
plt.style.use('seaborn-white')
df = pd.read_csv('/.../phwater.csv')

df.head()
g = sns.lmplot(x="OH", y="pH", hue="Brands", data=df)
g.set(ylim=(0,14))

g = sns.lmplot(x="H", y="pH", hue="Brands", data=df)
g.set(ylim=(0,14))

Python code “Old faithful geyser dataset rebooted with Python”

# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns

from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
plt.style.use('seaborn-white')
of = pd.read_csv('/.../oldfaith.csv')
of.info()
of.head()
regr = skl_lm.LinearRegression()

# Linear fit
X = of.wait_time_min.values.reshape(-1,1)
y = of.duration_sec
regr.fit(X, y)

of['pred1'] = regr.predict(X)
of['resid1'] = of.duration_sec - of.pred1

# Quadratic fit
X2 = of[['wait_time_min', 'wait_time_min']].as_matrix()
regr.fit(X2, y)

of['pred2'] = regr.predict(X2)
of['resid2'] = of.duration_sec - of.pred2
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))

# Left plot
sns.regplot(of.pred1, of.resid1, lowess=True, 
            ax=ax1, line_kws={'color':'r', 'lw':1},
            scatter_kws={'facecolors':'None', 'edgecolors':'k', 'alpha':0.5})
ax1.hlines(0,xmin=ax1.xaxis.get_data_interval()[0],
           xmax=ax1.xaxis.get_data_interval()[1], linestyles='dotted')
ax1.set_title('Residual Plot for Linear Fit')

# Right plot
sns.regplot(of.pred2, of.resid2, lowess=True,
            line_kws={'color':'r', 'lw':1}, ax=ax2,
            scatter_kws={'facecolors':'None', 'edgecolors':'k', 'alpha':0.5})
ax2.hlines(0,xmin=ax2.xaxis.get_data_interval()[0],
           xmax=ax2.xaxis.get_data_interval()[1], linestyles='dotted')
ax2.set_title('Residual Plot for Quadratic Fit')

for ax in fig.axes:
    ax.set_xlabel('Fitted values')
    ax.set_ylabel('Residuals')
est = smf.ols('wait_time_min ~ duration_sec', of).fit()
est.summary().tables[1]
sns.jointplot(x='wait_time_min',y='duration_sec',data=of,kind='reg')
g = sns.jointplot("wait_time_min", "duration_sec", data=of,
...                   kind="kde", space=0, color="g")
 g = (sns.jointplot("wait_time_min", "duration_sec",
...                    data=of, color="k")
...         .plot_joint(sns.kdeplot, zorder=0, n_levels=6))
g = sns.jointplot("wait_time_min", "duration_sec", data=of,
...                   marginal_kws=dict(bins=15, rug=True),
...                   annot_kws=dict(stat="r"),
...                   s=40, edgecolor="w", linewidth=1)

Old faithful geyser dataset rebooted with Python

dm.jpg by Dane Miller – 4/9/18

Here is a popular dataset on old faithful geyser eruptions in Yellowstone, WY. The dataset comes from Weisberg (2005) publication in Applied Linear Regression. This type of dataset can be extremely useful to National Park Service Rangers for predicting eruptions for visiting tourist. I would highly recommend visiting Yellowstone and seeing old faithful geyser in person it is truly amazing!

Source of the data: http://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat

Weisberg, S. (2005). Applied Linear Regression, 3rd edition. New York: Wiley, Problem 1.4.

Yellowstone NPS https://www.nps.gov/yell/planyourvisit/exploreoldfaithful.htm

seaborn.jointplot https://seaborn.pydata.org/generated/seaborn.jointplot.html

This dataset contains only two variables duration of the current eruption, and the wait time in between eruptions.

Let’s look at a theoretical model: μ = β0 + β1Xi

μ : Wait time         β1Xi: Duration

Empirical model:  ^yi = b0 +b1xi1

y= observed wait time          b1xi1: observed duration

coef std err t P>|t| [0.025 0.975]
Intercept 35.0774 1.184 29.630 0.000 32.748 37.407
duration_sec 10.7499 0.325 33.111 0.000 10.111 11.389

Wait time = 35.0774 + 10.7499Duration

When I was initially introduced to this dataset in graduate school during a stats course. My focus then was to complete the problems as quickly as possible so that I could get back to my graduate research. However, I missed on some important subtleties in this simply dataset.

Rushing for a dataset in graduate school with Microsoft Excel. Looks pretty crappy! What was I thinking!!!

excel.jpg

Plotting the residuals:

The data is separating into two groups.

download.png

The same old faithful dataset now using seaborn.jointplot in python.

22.png44.png

33.png

Focus your efforts on learning python or R it will drastically improve your work. And there you have it a rebooted old faithful dataset plotted with seaborn.jointplot in python.

 

 

San Francisco Police Department traffic stop 2017 – visuals

Here are the corresponding visualizations for the python code.

Figure 1: Heatmap to show missing data in the data set.

missing data.png

Figure 2: San Francisco Police Department demographics of race description by sex (gender). Where M indicates Male, F indicates Female, and U is Unidentified. The vast majority of the traffic incidents in 2017 were by white males. Nearly 2 times as much as the next race description category.

SFPD demographic chart.png

Figure 3: This is a count plot with seaborn showing the distribution of race description.

histogram of race.png

Figure 4: This is a count plot with seaborn showing the distribution of sex (gender).

sex histogram.png

Figure 5: This is a histogram of age of individuals who had traffic violations in 2017.

age.png

Figure 6: Boxplot with race description and age. The horizontal line in the boxplot indicates the mean.

race and age boxplot.png

 

Figure 7: This is another way of displaying the race description data.

race histo.png

Figure 8: Hexbin plot of race description and age. The cluster of the darker colors show tighter correlation.

hex age.png

Figure 9: This is a lmplot preformed with Seaborn. Comparing sex, race, and time of day.

lmplot age - race - sex.png

 

 

 

 

San Francisco Police Department traffic stops data 2017 Python code

http://sanfranciscopolice.org/data#trafficstops

(See file) Stops by Race and Ethnicity – data (2017)

# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns

from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
plt.style.use('seaborn-white')


df = pd.read_csv('/.../sfpd2017.csv')
df.head()
# I renamed the file so that it was easier to load

df.info()
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
# to find missing data in the data set
fig = plt.figure(figsize=(15,9))
fig.suptitle('SFPD demographic chart', fontsize=20)

sns.set_style('whitegrid')
sns.countplot(x='Race_description',hue='Sex',data=df,palette='RdBu_r')
sns.distplot(df['Age'].dropna(),kde=False,color='darkred',bins=50)
sns.countplot(x='Race_description',data=df)
sns.countplot(x='Sex',data=df)
plt.figure(figsize=(12, 7))
sns.boxplot(x='Race_description',y='Age',data=df,palette='winter')
df['Race_description'].hist(color='green',bins=40,figsize=(8,4))
df = pd.DataFrame(np.random.randn(1000, 2), columns=['Race_description', 'Age'])
df.plot.hexbin(x='Race_description',y='Age',gridsize=25,cmap='Oranges')
sns.lmplot(x='Time_hour',y='Age',data=df,col='Race_description',hue='Sex',palette='coolwarm',
          aspect=0.6,size=8)

Trends on emission data – python code

# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns

from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
plt.style.use('seaborn-white')
# https://catalog.data.gov/dataset/greenhouse-gas-emissions-from-fuel-combustion-million-metric-tons-beginning-1990
df = pd.read_csv('/.../GreenhouseEmissions.csv') # add your location for your file in ...

df.head()
sns.regplot(df.Year, df.Commercial, order=1, ci=None, scatter_kws={'color':'r', 's':9})
plt.xlim(1990, 2016)
plt.ylim(15,40);
sns.jointplot(x='Year',y='Transportation',data=df,kind='reg')
sns.pairplot(df)