Loading IMPROVE Data

In this tutorial we will read and load data from the IMPROVE aerosol network (http://vista.cira.colostate.edu/Improve/)

To do this we need to first download the data from the colstate.edu website. There are a few format requirements needed to be compatible with MONET. First go to http://views.cira.colostate.edu/fed/DataWizard/Default.aspx to download data.

Select the Raw data
Datasets is the “IMPROVE aerosol”
Select any number of sites
Select your parameters
Select the dates in question
Fields
- Dataset (required)
- Site (required)
- POC (required)
- Date (required)
- Parameter (required)
- Data Value (required)
- Latitude (required)
- Longitude (required)
- State (optional)
- EPACode (optional)
Options
- skinny format
- delimited (note you will need to pass this onto the open command, ‘,’ by default)

After downloading we can then read the data. Here we included the EPACode and State to add additional meta data stored on the EPA auxiliary files used in the EPA AQS and AirNow datasets in MONET. Let’s make a few imports from monet and some other libraries that will aid us later.

from monet.obs import improve
import matplotlib.pyplot as plt
from monet.util import tools
from monet.plots import *
import pandas as pd

Now we will load the data (in this case the file is ‘2018620155016911Q1N0Mq.txt’)

df = improve.add_data('/Users/barry/Desktop/20186258419212M0xuxu.txt', add_meta=False, delimiter=',')
from numpy import NaN
df['obs'].loc[df.obs < 0] = NaN

/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:189: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

Let’s look at the dataframe.

df.head(20)

	siteid	poc	time	variable	obs	unc	mdl	units	latitude	longitude	elevation	state_name	epaid
0	ACAD1	1	2016-01-01	ALf	NaN	0.00121	0.00199	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
1	ACAD1	1	2016-01-01	ASf	NaN	0.00014	0.00022	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
2	ACAD1	1	2016-01-01	BRf	0.00050	0.00011	0.00016	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
3	ACAD1	1	2016-01-01	CAf	NaN	0.00143	0.00234	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
4	ACAD1	1	2016-01-01	CHLf	NaN	0.00376	0.00744	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
5	ACAD1	1	2016-01-01	CLf	0.00085	0.00051	0.00081	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
6	ACAD1	1	2016-01-01	CRf	0.00015	0.00010	0.00015	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
7	ACAD1	1	2016-01-01	CUf	NaN	0.00014	0.00022	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
8	ACAD1	1	2016-01-01	ECf	0.11758	0.01481	0.00900	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
9	ACAD1	1	2016-01-01	EC1f	0.15773	0.01700	0.01270	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
10	ACAD1	1	2016-01-01	EC2f	0.08037	0.01439	0.00900	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
11	ACAD1	1	2016-01-01	EC3f	NaN	0.00450	0.00900	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
12	ACAD1	1	2016-01-01	FEf	0.00397	0.00088	0.00139	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
13	ACAD1	1	2016-01-01	Kf	0.01480	0.00069	0.00087	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
14	ACAD1	1	2016-01-01	MF	1.38909	0.16326	0.31570	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
15	ACAD1	1	2016-01-01	MGf	NaN	0.00207	0.00340	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
16	ACAD1	1	2016-01-01	MNf	NaN	0.00020	0.00033	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
17	ACAD1	1	2016-01-01	MT	2.80114	0.22824	0.42441	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
18	ACAD1	1	2016-01-01	N2f	NaN	0.02791	0.05438	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
19	ACAD1	1	2016-01-01	NAf	NaN	0.00257	0.00412	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103

Now this is in the long pandas format. Let’s use the monet.util.tools.long_to_wide utility to reformat the dataframe into a wide format.

from monet.util import tools
df1 = tools.long_to_wide(df)
df1.head()

	time	siteid	ALf	ASf	BRf	CAf	CHLf	CLf	CM_calculated	CRf	...	variable	obs	unc	mdl	units	latitude	longitude	elevation	state_name	epaid
0	2016-01-01	ACAD1	NaN	NaN	0.0005	NaN	NaN	0.00085	1.41205	0.00015	...	ALf	NaN	0.00121	0.00199	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
1	2016-01-01	ACAD1	NaN	NaN	0.0005	NaN	NaN	0.00085	1.41205	0.00015	...	ASf	NaN	0.00014	0.00022	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
2	2016-01-01	ACAD1	NaN	NaN	0.0005	NaN	NaN	0.00085	1.41205	0.00015	...	BRf	0.0005	0.00011	0.00016	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
3	2016-01-01	ACAD1	NaN	NaN	0.0005	NaN	NaN	0.00085	1.41205	0.00015	...	CAf	NaN	0.00143	0.00234	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
4	2016-01-01	ACAD1	NaN	NaN	0.0005	NaN	NaN	0.00085	1.41205	0.00015	...	CHLf	NaN	0.00376	0.00744	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103

5 rows × 65 columns

Let’s now plot some of the different measurements with time from a site. In this case we will look at the PHOE1 site in Phoenix, Arizona.

acad1 = df1.loc[df1.siteid == 'ACAD1']
acad1.head()

	time	siteid	ALf	ASf	BRf	CAf	CHLf	CLf	CM_calculated	CRf	...	variable	obs	unc	mdl	units	latitude	longitude	elevation	state_name	epaid
0	2016-01-01	ACAD1	NaN	NaN	0.0005	NaN	NaN	0.00085	1.41205	0.00015	...	ALf	NaN	0.00121	0.00199	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
1	2016-01-01	ACAD1	NaN	NaN	0.0005	NaN	NaN	0.00085	1.41205	0.00015	...	ASf	NaN	0.00014	0.00022	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
2	2016-01-01	ACAD1	NaN	NaN	0.0005	NaN	NaN	0.00085	1.41205	0.00015	...	BRf	0.0005	0.00011	0.00016	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
3	2016-01-01	ACAD1	NaN	NaN	0.0005	NaN	NaN	0.00085	1.41205	0.00015	...	CAf	NaN	0.00143	0.00234	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103
4	2016-01-01	ACAD1	NaN	NaN	0.0005	NaN	NaN	0.00085	1.41205	0.00015	...	CHLf	NaN	0.00376	0.00744	ug/m^3 LC	44.3771	-68.261	157.3333	ME	230090103

5 rows × 65 columns

Trend Analysis

Let’s look at SIf as an example from ACAD1.

acad1.index = acad1.time

acad1.plot(y='SIf', figsize=(14,5))

<matplotlib.axes._subplots.AxesSubplot at 0x1c286eb4a8>

../_images/improve_trends_kmeans_11_1.png

Now this is good, but let’s resample to see if we can see a trend.

ax = acad1.resample('W').mean().plot(y='SIf', figsize=(14,5), label = 'weekly')
ax = acad1.resample('M').mean().plot(y='SIf', ax=ax, label='monthly')

../_images/improve_trends_kmeans_13_0.png

Simply resampling is fine, but let’s try to get a signal out using a Kolmogorov–Zurbenko filter. See https://doi.org/10.1080/10473289.2005.10464718 for more information.

q = acad1.SIf.copy()
for i in range(1000):
    q = q.rolling(10, min_periods=1, win_type='triang',center=True).mean()
ax = acad1.resample('M').mean().plot(y='SIf', figsize=(14,4), label='monthly')
q.resample('M').mean().plot(ax=ax,label='KZ Filter')
plt.legend()

<matplotlib.legend.Legend at 0x1c290de080>

../_images/improve_trends_kmeans_15_1.png

KMEANS Clustering using scikit-learn

Clustering algorithms can be very useful in finding signals within data. As an example we are going to use the KMeans algorithm from scikit-learn (http://scikit-learn.org/stable/modules/clustering.html#k-means) to analyse dust signals using the improve data. First we need to import some tools from sklearn to use in our analysis.

from sklearn.preprocessing import RobustScaler # to scale our data
from sklearn.cluster import KMeans # clustering algorithm

First we want to separate out different variables that may be useful such as Si, PM2.5, PM10, Fe, and SOILf. We will also need to drop any NaN values so let us go ahead and do that.

dfkm = df1[['SIf','MF','MT','FEf','SOILf']].dropna()
dfkm.head()

	SIf	MF	MT	FEf	SOILf
0	0.00553	1.38909	2.80114	0.00397	0.028611
1	0.00553	1.38909	2.80114	0.00397	0.028611
2	0.00553	1.38909	2.80114	0.00397	0.028611
3	0.00553	1.38909	2.80114	0.00397	0.028611
4	0.00553	1.38909	2.80114	0.00397	0.028611

Usually, with sklearn it is better to scale the data first before putting it through the algorithm. We will use the RobustScaler to do this.

X_scaled = RobustScaler().fit(dfkm).transform(dfkm)

Now we will define our clustering algorithm to have 2 clusters. You may need to adjust this as this is just a starting point for further analysis.

km = KMeans(n_clusters=2).fit(X_scaled)

The clusters can be found under km.labels_. These are integers representing the different clusters.

clusters = km.labels_

Let’s plot this so that we can see where there is dust.

plt.figure(figsize=(10,6))
plt.scatter(dfkm.SOILf,dfkm.MT,c=clusters,edgecolor='k')
plt.xlabel('SOILf')
plt.ylabel('PM10')

Text(0,0.5,'PM10')