How to read AQS data from PAMS and do a quick analysis

The first tutorial for MONET will load data from the AQS data, specifically the PAMS dataset, and show to reshape the dataframe and do a statistical analysis with different meteorological and composition data. For more examples, please see https://monet-arl.readthedocs.io/en/develop/. This will later be able to be found on the github page and readthedocs.

We will first begin by importing monet and a few helper classes for later

import numpy as np          # numpy
import pandas as pd         # pandas
import monetio as mio       # observations from MONET
import matplotlib.pyplot as plt # plotting
import seaborn as sns       # better color palettes
import cartopy.crs as ccrs  # map projections
import cartopy.feature as cfeature # politcal and geographic features
sns.set_style('ticks')   # plot configuration
sns.set_context("talk")  # plot configure for size of text

Now we have all the imports we could need lets load some data. Most of the PAMS data is on daily data so lets add the kwarg daily=True to the call. We will also create this for the year 2015 and 2016. Some variables that may be valuable are the VOCS, ozone, NO2, NOX, temperature. For all of the measurements available please see https://aqs.epa.gov/aqsweb/airdata/download_files.html

dates = pd.date_range(start='2004-01-01',end='2004-12-31')
df = mio.aqs.add_data(dates,daily=True,param=['VOC','OZONE'], download=True)

 Retrieving: daily_VOCS_2004.zip
https://aqs.epa.gov/aqsweb/airdata/daily_VOCS_2004.zip



 Retrieving: daily_44201_2004.zip
https://aqs.epa.gov/aqsweb/airdata/daily_44201_2004.zip


[########################################] | 100% Completed | 26.1s
[########################################] | 100% Completed | 26.2s
[########################################] | 100% Completed | 26.3s
[########################################] | 100% Completed | 26.4s
[########################################] | 100% Completed | 26.5s
[########################################] | 100% Completed | 34.8s
[########################################] | 100% Completed | 34.9s
[########################################] | 100% Completed | 34.9s
[########################################] | 100% Completed | 35.0s
[########################################] | 100% Completed | 35.1s
Monitor File Path: /anaconda3/lib/python3.6/site-packages/data/monitoring_site_locations.hdf
Monitor File Not Found... Reprocessing

/anaconda3/lib/python3.6/site-packages/python/core/interactiveshell.py:2963: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:189: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

Now we have the data in aqs.df and a copy of it in df… just in case. So lets take a look at it.

#df.head()
df.head()

	time_local	state_code	county_code	site_num	parameter_code	poc	latitude	longitude	datum	parameter_name	...	first_year_of_data	gmt_offset	land_use	location_setting	monitor_type	msa_code	networks	state_name	tribe_name	time
0	2004-05-15	04	013	4003	43000	10	33.40316	-112.07533	WGS84	Sum of PAMS target compounds	...	NaN	-7.0	RESIDENTIAL	URBAN AND CENTER CITY	OTHER	NaN	NaN	AZ	NaN	2004-05-15 07:00:00
1	2004-05-21	04	013	4003	43000	10	33.40316	-112.07533	WGS84	Sum of PAMS target compounds	...	NaN	-7.0	RESIDENTIAL	URBAN AND CENTER CITY	OTHER	NaN	NaN	AZ	NaN	2004-05-21 07:00:00
2	2004-05-27	04	013	4003	43000	10	33.40316	-112.07533	WGS84	Sum of PAMS target compounds	...	NaN	-7.0	RESIDENTIAL	URBAN AND CENTER CITY	OTHER	NaN	NaN	AZ	NaN	2004-05-27 07:00:00
3	2004-06-02	04	013	4003	43000	10	33.40316	-112.07533	WGS84	Sum of PAMS target compounds	...	NaN	-7.0	RESIDENTIAL	URBAN AND CENTER CITY	OTHER	NaN	NaN	AZ	NaN	2004-06-02 07:00:00
4	2004-06-08	04	013	4003	43000	10	33.40316	-112.07533	WGS84	Sum of PAMS target compounds	...	NaN	-7.0	RESIDENTIAL	URBAN AND CENTER CITY	OTHER	NaN	NaN	AZ	NaN	2004-06-08 07:00:00

5 rows × 46 columns

Notice that in this printed format it obscures some of the dataframe columns from view. Lets see what they are!

from numpy import sort
for i in sort(df.columns): # loop over the sorted columns and print them
    print(i)

1st_max_hour
1st_max_value
address
airnow_flag
aqi
cbsa_name
city_name
cmsa_name
collecting_agency
county_code
county_name
date_of_last_change
datum
elevation
epa_region
event_type
first_year_of_data
gmt_offset
land_use
latitude
local_site_name
location_setting
longitude
method_code
method_name
monitor_type
msa_code
msa_name
networks
obs
observation_count
observation_percent
parameter_code
parameter_name
poc
pollutant_standard
sample_duration
site_num
siteid
state_code
state_name
time
time_local
tribe_name
units
variable

We have lots of columns but this is actually the long format (data is stacked on variable). Data analysis could be done easier in a wide format. So lets use a utility function in MONET to aid with reshaping the dataframe.

from monet.util import tools
new = tools.long_to_wide(df)
new.head()

	time	siteid	1,1,2,2-TETRACHLOROETHANE	1,1,2-TRICHLORO-1,2,2-TRIFLUOROETHANE	1,1,2-TRICHLOROETHANE	1,1-DICHLOROETHANE	1,1-DICHLOROETHYLENE	1,2,3-TRIMETHYLBENZENE	1,2,4-TRICHLOROBENZENE	1,2,4-TRIMETHYLBENZENE	...	epa_region	first_year_of_data	gmt_offset	land_use	location_setting	monitor_type	msa_code	networks	state_name	tribe_name
0	2004-01-01 05:00:00	090031003	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	2002.0	-5.0	RESIDENTIAL	SUBURBAN	NaN	NaN	NaN	CT	NaN
1	2004-01-01 05:00:00	100031007	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	2003.0	-5.0	AGRICULTURAL	RURAL	OTHER	NaN	NaN	DE	NaN
2	2004-01-01 05:00:00	100031013	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	2003.0	-5.0	RESIDENTIAL	SUBURBAN	SLAMS	NaN	NaN	DE	NaN
3	2004-01-01 05:00:00	110010025	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	1980.0	-5.0	COMMERCIAL	URBAN AND CENTER CITY	NaN	NaN	NaN	District Of Columbia	NaN
4	2004-01-01 05:00:00	110010041	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	1993.0	-5.0	RESIDENTIAL	URBAN AND CENTER CITY	NaN	NaN	NaN	District Of Columbia	NaN

5 rows × 157 columns

Lets see how many ISOPRENE sites there are. We will drop the NaN values along the ISOPRENE column and then find the unique siteid’s and look at the shape of them

new.dropna(subset=['ISOPRENE']).siteid.unique().shape

(140,)

Now as you can see we have lots of columns that is sorted by time and siteid. But what measurements are included in the dataframe? Let’s see all the new columns generated from pivoting the table.

from numpy import sort
for i in sort(new.columns):
    print(i)

1,1,2,2-TETRACHLOROETHANE
1,1,2-TRICHLORO-1,2,2-TRIFLUOROETHANE
1,1,2-TRICHLOROETHANE
1,1-DICHLOROETHANE
1,1-DICHLOROETHYLENE
1,2,3-TRIMETHYLBENZENE
1,2,4-TRICHLOROBENZENE
1,2,4-TRIMETHYLBENZENE
1,2-DICHLOROBENZENE
1,2-DICHLOROPROPANE
1,3,5-TRIMETHYLBENZENE
1,3-BUTADIENE
1,3-DICHLOROBENZENE
1,4-DICHLOROBENZENE
1-BUTENE
1-PENTENE
1st_max_hour
1st_max_value
2,2,4-TRIMETHYLPENTANE
2,2-DIMETHYLBUTANE
2,3,4-TRIMETHYLPENTANE
2,3-DIMETHYLBUTANE
2,3-DIMETHYLPENTANE
2,4-DIMETHYLPENTANE
2-METHYLHEPTANE
2-METHYLHEXANE
2-METHYLPENTANE
3-CHLOROPROPENE
3-METHYLHEPTANE
3-METHYLHEXANE
3-METHYLPENTANE
ACETALDEHYDE
ACETONE
ACETONITRILE
ACETYLENE
ACROLEIN - UNVERIFIED
ACRYLONITRILE
BENZENE
BENZYL CHLORIDE
BROMOCHLOROMETHANE
BROMODICHLOROMETHANE
BROMOFORM
BROMOMETHANE
CARBON DISULFIDE
CARBON TETRACHLORIDE
CHLOROBENZENE
CHLOROETHANE
CHLOROFORM
CHLOROMETHANE
CHLOROPRENE
CIS-1,2-DICHLOROETHENE
CIS-1,3-DICHLOROPROPENE
CIS-2-BUTENE
CIS-2-PENTENE
CYCLOHEXANE
CYCLOPENTANE
DIBROMOCHLOROMETHANE
DICHLORODIFLUOROMETHANE
DICHLOROMETHANE
ETHANE
ETHYL ACRYLATE
ETHYLBENZENE
ETHYLENE
ETHYLENE DIBROMIDE
ETHYLENE DICHLORIDE
FORMALDEHYDE
FREON 113
FREON 114
HEXACHLOROBUTADIENE
ISOBUTANE
ISOPENTANE
ISOPRENE
ISOPROPYLBENZENE
M-DIETHYLBENZENE
M-ETHYLTOLUENE
M/P XYLENE
METHYL CHLOROFORM
METHYL ETHYL KETONE
METHYL ISOBUTYL KETONE
METHYL METHACRYLATE
METHYL TERT-BUTYL ETHER
METHYLCYCLOHEXANE
METHYLCYCLOPENTANE
N-BUTANE
N-DECANE
N-HEPTANE
N-HEXANE
N-NONANE
N-OCTANE
N-PENTANE
N-PROPYLBENZENE
N-UNDECANE
O-ETHYLTOLUENE
O-XYLENE
OZONE
P-DIETHYLBENZENE
P-ETHYLTOLUENE
PROPANE
PROPYLENE
STYRENE
SUM OF PAMS TARGET COMPOUNDS
TERT-AMYL METHYL ETHER
TERT-BUTYL ETHYL ETHER
TETRACHLOROETHYLENE
TOLUENE
TOTAL NMOC (NON-METHANE ORGANIC COMPOUND)
TRANS-1,2-DICHLOROETHYLENE
TRANS-1,3-DICHLOROPROPENE
TRANS-2-BUTENE
TRANS-2-PENTENE
TRICHLOROETHYLENE
TRICHLOROFLUOROMETHANE
VINYL CHLORIDE
address
airnow_flag
aqi
cbsa_name
city_name
cmsa_name
collecting_agency
county_code
county_name
date_of_last_change
datum
elevation
epa_region
event_type
first_year_of_data
gmt_offset
land_use
latitude
local_site_name
location_setting
longitude
method_code
method_name
monitor_type
msa_code
msa_name
networks
obs
observation_count
observation_percent
parameter_code
parameter_name
poc
pollutant_standard
sample_duration
site_num
siteid
state_code
state_name
time
time_local
tribe_name
units
variable

Now as you can see we have lots of columns that is sorted by time and siteid. This can be very useful as we can now do some direct comparisons using the dataframe. Lets get a description of the dataset first so we can see some averages and ranges of the different chemical species.

new.describe()

	1,1,2,2-TETRACHLOROETHANE	1,1,2-TRICHLORO-1,2,2-TRIFLUOROETHANE	1,1,2-TRICHLOROETHANE	1,1-DICHLOROETHANE	1,1-DICHLOROETHYLENE	1,2,3-TRIMETHYLBENZENE	1,2,4-TRICHLOROBENZENE	1,2,4-TRIMETHYLBENZENE	1,2-DICHLOROBENZENE	1,2-DICHLOROPROPANE	...	obs	1st_max_value	1st_max_hour	aqi	method_code	cmsa_name	elevation	first_year_of_data	gmt_offset	msa_code
count	714352.000000	132606.000000	665982.000000	475211.000000	704962.000000	766240.000000	407466.000000	1.105874e+06	441391.000000	713931.000000	...	1.501618e+06	1.501618e+06	1.501618e+06	335758.000000	1.165860e+06	0.0	0.0	1.393782e+06	1.501618e+06	0.0
mean	0.020421	0.169611	0.019323	0.012979	0.020375	0.474538	0.111623	1.011792e+00	0.129964	0.030783	...	3.764996e+00	8.070852e+00	5.336468e+00	37.632730	1.404985e+02	NaN	NaN	1.993263e+03	-5.978275e+00	NaN
std	0.157866	0.215456	0.158109	0.185480	0.178133	1.307923	1.129665	2.255642e+00	0.947958	0.230669	...	3.997054e+01	1.091979e+02	6.966935e+00	19.249021	2.685583e+01	NaN	NaN	1.223728e+01	1.006215e+00	NaN
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000	1.100000e+01	NaN	NaN	1.959000e+03	-1.000000e+01	NaN
25%	0.000000	0.100000	0.000000	0.000000	0.000000	0.050000	0.000000	1.800000e-01	0.000000	0.000000	...	2.000000e-02	3.000000e-02	0.000000e+00	26.000000	1.260000e+02	NaN	NaN	1.982000e+03	-6.000000e+00	NaN
50%	0.000000	0.180000	0.000000	0.000000	0.000000	0.230000	0.000000	5.000000e-01	0.000000	0.000000	...	8.000000e-02	1.000000e-01	0.000000e+00	35.000000	1.280000e+02	NaN	NaN	1.997000e+03	-6.000000e+00	NaN
75%	0.010000	0.220000	0.010000	0.000000	0.010000	0.468182	0.000000	1.219583e+00	0.000000	0.020000	...	6.600000e-01	1.000000e+00	1.000000e+01	43.000000	1.740000e+02	NaN	NaN	2.003000e+03	-5.000000e+00	NaN
max	10.000000	10.000000	10.000000	10.000000	10.000000	39.266667	54.700000	1.493500e+02	59.880000	15.000000	...	9.474708e+03	3.854257e+04	2.300000e+01	212.000000	2.110000e+02	NaN	NaN	2.018000e+03	-4.000000e+00	NaN

8 rows × 127 columns

This gives us a format that allows simple statistics and plots using pandas, matplotlib, and seaborn. For time series it is often useful to have the index as the time. Lets do that

new.index = new.time
new['OZONE_ppb'] = new.OZONE * 1000.
new.OZONE_ppb.mean()

27.581303457170307

Plotting

As you can see the data is now indexed with the UTC time. Lets make a time series plot of the average ISOPRENE.

f,ax = plt.subplots(figsize=(10,4)) # this is so we can control the figure size.
new.ISOPRENE.resample('D').mean().plot(ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x1c2bb1f860>

This is quite noisy with the daily data. Lets resample in time to every month using the average Isoprene concentration to weekly and monthly.

f,ax = plt.subplots(figsize=(10,4)) # this is so we can control the figure size.
new.ISOPRENE.resample('D').mean().plot(ax=ax, label='daily')
new.ISOPRENE.resample('W').mean().plot(ax=ax, label='weekly')
new.ISOPRENE.resample('M').mean().plot(ax=ax, label='monthly')
plt.ylabel('ISOP')
plt.legend()
sns.despine()

Where are these measurements. Lets plot this on a map and see where it is. We can use a utility plotting function in monet to generate the plot

from monet import plots
ax = plots.draw_map(states=True, extent=[-130,-60,20,50], resolution='10m')
# get only where ISOPRENE is not NAN
isop = new.dropna(subset=['ISOPRENE'])
ax.scatter(isop.longitude, isop.latitude)

<matplotlib.collections.PathCollection at 0x1c47460f60>

There are monitors all across the US with many in TX, CA, New England and the Mid-Atlantic.

What if we wanted to do a linear regression between two variables. Lets say OZONE and temperature. To do this we will use the statsmodels package. It is a robust library for curve fitting. For specific information for this module look here https://www.statsmodels.org/stable/index.html

import statsmodels.api as sm # load statsmodels api
#first clean of nan values
fit_df = new[['ISOPRENE','OZONE']].dropna()
x = fit_df.ISOPRENE
y = fit_df.OZONE
result = sm.OLS(y,x).fit()
print(result.summary())

fit_df.plot.scatter(x='ISOPRENE',y='OZONE')
plt.plot(x,result.predict(x),'--r')

                            OLS Regression Results
==============================================================================
Dep. Variable:                  OZONE   R-squared:                       0.248
Model:                            OLS   Adj. R-squared:                  0.248
Method:                 Least Squares   F-statistic:                 1.833e+05
Date:                Tue, 19 Jun 2018   Prob (F-statistic):               0.00
Time:                        09:31:12   Log-Likelihood:             1.2390e+06
No. Observations:              556733   AIC:                        -2.478e+06
Df Residuals:                  556732   BIC:                        -2.478e+06
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
ISOPRENE       0.0057   1.33e-05    428.134      0.000       0.006       0.006
==============================================================================
Omnibus:                   205153.426   Durbin-Watson:                   0.007
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          2465877.353
Skew:                          -1.433   Prob(JB):                         0.00
Kurtosis:                      12.904   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[<matplotlib.lines.Line2D at 0x1c1e7f0f28>]

Lets save this to a csv file

new.to_csv('/Users/barry/Desktop/new.csv')