Loading IMPROVE Data

In this tutorial we will read and load data from the IMPROVE aerosol network (http://vista.cira.colostate.edu/Improve/)

To do this we need to first download the data from the colstate.edu website. There are a few format requirements needed to be compatible with MONET. First go to http://views.cira.colostate.edu/fed/DataWizard/Default.aspx to download data.

  • Select the Raw data

  • Datasets is the “IMPROVE aerosol”

  • Select any number of sites

  • Select your parameters

  • Select the dates in question

  • Fields

    • Dataset (required)

    • Site (required)

    • POC (required)

    • Date (required)

    • Parameter (required)

    • Data Value (required)

    • Latitude (required)

    • Longitude (required)

    • State (optional)

    • EPACode (optional)

  • Options

    • skinny format

    • delimited (note you will need to pass this onto the open command, ‘,’ by default)

After downloading we can then read the data. Here we included the EPACode and State to add additional meta data stored on the EPA auxiliary files used in the EPA AQS and AirNow datasets in MONET. Let’s make a few imports from monet and some other libraries that will aid us later.

from monet.obs import improve
import matplotlib.pyplot as plt
from monet.util import tools
from monet.plots import *
import pandas as pd

Now we will load the data (in this case the file is ‘2018620155016911Q1N0Mq.txt’)

df = improve.add_data('/Users/barry/Desktop/20186258419212M0xuxu.txt', add_meta=False, delimiter=',')
from numpy import NaN
df['obs'].loc[df.obs < 0] = NaN
/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:189: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

Let’s look at the dataframe.

df.head(20)
siteid poc time variable obs unc mdl units latitude longitude elevation state_name epaid
0 ACAD1 1 2016-01-01 ALf NaN 0.00121 0.00199 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
1 ACAD1 1 2016-01-01 ASf NaN 0.00014 0.00022 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
2 ACAD1 1 2016-01-01 BRf 0.00050 0.00011 0.00016 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
3 ACAD1 1 2016-01-01 CAf NaN 0.00143 0.00234 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
4 ACAD1 1 2016-01-01 CHLf NaN 0.00376 0.00744 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
5 ACAD1 1 2016-01-01 CLf 0.00085 0.00051 0.00081 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
6 ACAD1 1 2016-01-01 CRf 0.00015 0.00010 0.00015 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
7 ACAD1 1 2016-01-01 CUf NaN 0.00014 0.00022 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
8 ACAD1 1 2016-01-01 ECf 0.11758 0.01481 0.00900 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
9 ACAD1 1 2016-01-01 EC1f 0.15773 0.01700 0.01270 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
10 ACAD1 1 2016-01-01 EC2f 0.08037 0.01439 0.00900 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
11 ACAD1 1 2016-01-01 EC3f NaN 0.00450 0.00900 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
12 ACAD1 1 2016-01-01 FEf 0.00397 0.00088 0.00139 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
13 ACAD1 1 2016-01-01 Kf 0.01480 0.00069 0.00087 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
14 ACAD1 1 2016-01-01 MF 1.38909 0.16326 0.31570 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
15 ACAD1 1 2016-01-01 MGf NaN 0.00207 0.00340 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
16 ACAD1 1 2016-01-01 MNf NaN 0.00020 0.00033 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
17 ACAD1 1 2016-01-01 MT 2.80114 0.22824 0.42441 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
18 ACAD1 1 2016-01-01 N2f NaN 0.02791 0.05438 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
19 ACAD1 1 2016-01-01 NAf NaN 0.00257 0.00412 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103

Now this is in the long pandas format. Let’s use the monet.util.tools.long_to_wide utility to reformat the dataframe into a wide format.

from monet.util import tools
df1 = tools.long_to_wide(df)
df1.head()
time siteid ALf ASf BRf CAf CHLf CLf CM_calculated CRf ... variable obs unc mdl units latitude longitude elevation state_name epaid
0 2016-01-01 ACAD1 NaN NaN 0.0005 NaN NaN 0.00085 1.41205 0.00015 ... ALf NaN 0.00121 0.00199 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
1 2016-01-01 ACAD1 NaN NaN 0.0005 NaN NaN 0.00085 1.41205 0.00015 ... ASf NaN 0.00014 0.00022 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
2 2016-01-01 ACAD1 NaN NaN 0.0005 NaN NaN 0.00085 1.41205 0.00015 ... BRf 0.0005 0.00011 0.00016 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
3 2016-01-01 ACAD1 NaN NaN 0.0005 NaN NaN 0.00085 1.41205 0.00015 ... CAf NaN 0.00143 0.00234 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
4 2016-01-01 ACAD1 NaN NaN 0.0005 NaN NaN 0.00085 1.41205 0.00015 ... CHLf NaN 0.00376 0.00744 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103

5 rows × 65 columns

Let’s now plot some of the different measurements with time from a site. In this case we will look at the PHOE1 site in Phoenix, Arizona.

acad1 = df1.loc[df1.siteid == 'ACAD1']
acad1.head()
time siteid ALf ASf BRf CAf CHLf CLf CM_calculated CRf ... variable obs unc mdl units latitude longitude elevation state_name epaid
0 2016-01-01 ACAD1 NaN NaN 0.0005 NaN NaN 0.00085 1.41205 0.00015 ... ALf NaN 0.00121 0.00199 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
1 2016-01-01 ACAD1 NaN NaN 0.0005 NaN NaN 0.00085 1.41205 0.00015 ... ASf NaN 0.00014 0.00022 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
2 2016-01-01 ACAD1 NaN NaN 0.0005 NaN NaN 0.00085 1.41205 0.00015 ... BRf 0.0005 0.00011 0.00016 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
3 2016-01-01 ACAD1 NaN NaN 0.0005 NaN NaN 0.00085 1.41205 0.00015 ... CAf NaN 0.00143 0.00234 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103
4 2016-01-01 ACAD1 NaN NaN 0.0005 NaN NaN 0.00085 1.41205 0.00015 ... CHLf NaN 0.00376 0.00744 ug/m^3 LC 44.3771 -68.261 157.3333 ME 230090103

5 rows × 65 columns

Trend Analysis

Let’s look at SIf as an example from ACAD1.

acad1.index = acad1.time

acad1.plot(y='SIf', figsize=(14,5))
<matplotlib.axes._subplots.AxesSubplot at 0x1c286eb4a8>
../_images/improve_trends_kmeans_11_1.png

Now this is good, but let’s resample to see if we can see a trend.

ax = acad1.resample('W').mean().plot(y='SIf', figsize=(14,5), label = 'weekly')
ax = acad1.resample('M').mean().plot(y='SIf', ax=ax, label='monthly')
../_images/improve_trends_kmeans_13_0.png

Simply resampling is fine, but let’s try to get a signal out using a Kolmogorov–Zurbenko filter. See https://doi.org/10.1080/10473289.2005.10464718 for more information.

q = acad1.SIf.copy()
for i in range(1000):
    q = q.rolling(10, min_periods=1, win_type='triang',center=True).mean()
ax = acad1.resample('M').mean().plot(y='SIf', figsize=(14,4), label='monthly')
q.resample('M').mean().plot(ax=ax,label='KZ Filter')
plt.legend()
<matplotlib.legend.Legend at 0x1c290de080>
../_images/improve_trends_kmeans_15_1.png

KMEANS Clustering using scikit-learn

Clustering algorithms can be very useful in finding signals within data. As an example we are going to use the KMeans algorithm from scikit-learn (http://scikit-learn.org/stable/modules/clustering.html#k-means) to analyse dust signals using the improve data. First we need to import some tools from sklearn to use in our analysis.

from sklearn.preprocessing import RobustScaler # to scale our data
from sklearn.cluster import KMeans # clustering algorithm

First we want to separate out different variables that may be useful such as Si, PM2.5, PM10, Fe, and SOILf. We will also need to drop any NaN values so let us go ahead and do that.

dfkm = df1[['SIf','MF','MT','FEf','SOILf']].dropna()
dfkm.head()
SIf MF MT FEf SOILf
0 0.00553 1.38909 2.80114 0.00397 0.028611
1 0.00553 1.38909 2.80114 0.00397 0.028611
2 0.00553 1.38909 2.80114 0.00397 0.028611
3 0.00553 1.38909 2.80114 0.00397 0.028611
4 0.00553 1.38909 2.80114 0.00397 0.028611

Usually, with sklearn it is better to scale the data first before putting it through the algorithm. We will use the RobustScaler to do this.

X_scaled = RobustScaler().fit(dfkm).transform(dfkm)

Now we will define our clustering algorithm to have 2 clusters. You may need to adjust this as this is just a starting point for further analysis.

km = KMeans(n_clusters=2).fit(X_scaled)

The clusters can be found under km.labels_. These are integers representing the different clusters.

clusters = km.labels_

Let’s plot this so that we can see where there is dust.

plt.figure(figsize=(10,6))
plt.scatter(dfkm.SOILf,dfkm.MT,c=clusters,edgecolor='k')
plt.xlabel('SOILf')
plt.ylabel('PM10')
Text(0,0.5,'PM10')
../_images/improve_trends_kmeans_27_1.png