library(tidyverse)
library(duckdb)
library(RAQSAPI)
Introduction
DuckDB is an in-process SQL database management system designed for fast and efficient data analysis. It’s particularly advantageous for data analysts due to its seamless integration with popular programming languages like Python and R, allowing easy querying of large datasets directly from within those environments. DuckDB’s columnar storage format and vectorized execution enable high-performance analytics on large datasets without the need for complex setups, making it an ideal choice for exploratory data analysis, interactive queries, and handling structured data efficiently on local machines.
Data
The RAQSAPI::
package provides a streamlined interface for accessing air quality data from the U.S. EPA’s Air Quality System (AQS) Data Mart API, handling tasks like formatting requests, parsing JSON responses, and managing API rate limits.1 It allows users to easily retrieve and analyze large-scale air quality datasets directly within R.
Setup
Libraries
Fetching data from the EPA’s Air Quality Service Data Mart API
If it’s your first time accessing the AQS API, you’ll need to use the RAQSAPI::aqs_sign_up()
function, and provide your email address as a string. (Note: when I first set up my credentials, it took about 48 hours to receive the email from their service).
# RAQSAPI::aqs_sign_up("abcd@efg.com")
Then, use the RAQSAPI::aqs_credentials()
function to provide your username and key.
The AQS service is rich with a variety of functions and arguments that help you pull exactly what you need. There is so much available through their API, and I won’t even begin to cover it here. For this demo, I’d like to look at ozone measurements over the last year in Colorado.
Some common arguments that you’ll find in RAQSAPI::
package:
parameter=
:the specific pollutant that you’re interested in. You can feed this a vector of multiple pollutants.2bdate=
: start date. If the grain of analysis is daily, the API limits each call to one year of data.edate=
: end date.stateFIPS
andcounty
: FIPS codes for your states or counties (if that’s your grain of analysis)
<- RAQSAPI::aqs_dailysummary_by_county(
df_ozone # '44201' is ozone
# '43201' is methane
parameter = '44201',
bdate = as.Date('20240101',
format = "%Y%m%d"),
edate = as.Date('20240810',
format = "%Y%m%d"),
# '08' is Colorado
stateFIPS = '08',
# '059' is Jefferson County
county = '059'
)
Now, I’ll dplyr::glimpse()
to see the structure of the data.
%>%
df_ozone glimpse()
Rows: 1,440
Columns: 32
$ state_code <chr> "08", "08", "08", "08", "08", "08", "08", "08", "…
$ county_code <chr> "059", "059", "059", "059", "059", "059", "059", …
$ site_number <chr> "0006", "0006", "0006", "0006", "0006", "0006", "…
$ parameter_code <chr> "44201", "44201", "44201", "44201", "44201", "442…
$ poc <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ latitude <dbl> 39.9128, 39.9128, 39.9128, 39.9128, 39.9128, 39.9…
$ longitude <dbl> -105.1886, -105.1886, -105.1886, -105.1886, -105.…
$ datum <chr> "WGS84", "WGS84", "WGS84", "WGS84", "WGS84", "WGS…
$ parameter <chr> "Ozone", "Ozone", "Ozone", "Ozone", "Ozone", "Ozo…
$ sample_duration_code <chr> "1", "W", "W", "W", "1", "W", "W", "W", "1", "W",…
$ sample_duration <chr> "1 HOUR", "8-HR RUN AVG BEGIN HOUR", "8-HR RUN AV…
$ pollutant_standard <chr> "Ozone 1-hour 1979", "Ozone 8-Hour 1997", "Ozone …
$ date_local <chr> "2024-01-01", "2024-01-01", "2024-01-01", "2024-0…
$ units_of_measure <chr> "Parts per million", "Parts per million", "Parts …
$ event_type <chr> "No Events", "No Events", "No Events", "No Events…
$ observation_count <int> 24, 24, 24, 17, 23, 24, 24, 17, 24, 24, 24, 17, 2…
$ observation_percent <dbl> 100, 100, 100, 100, 96, 100, 100, 100, 100, 100, …
$ validity_indicator <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
$ arithmetic_mean <dbl> 0.041333, 0.041833, 0.041833, 0.042294, 0.033870,…
$ first_max_value <dbl> 0.048, 0.046, 0.046, 0.046, 0.046, 0.037, 0.037, …
$ first_max_hour <int> 14, 8, 8, 8, 23, 22, 22, 22, 15, 10, 10, 10, 23, …
$ aqi <int> NA, 43, 43, 43, NA, 34, 34, 34, NA, 36, 36, 36, N…
$ method_code <chr> "087", "087", "087", "087", "087", "087", "087", …
$ method <chr> "INSTRUMENTAL - ULTRA VIOLET ABSORPTION", "INSTRU…
$ local_site_name <chr> "ROCKY FLATS-N", "ROCKY FLATS-N", "ROCKY FLATS-N"…
$ site_address <chr> "16600 W COLO #128", "16600 W COLO #128", "16600 …
$ state <chr> "Colorado", "Colorado", "Colorado", "Colorado", "…
$ county <chr> "Jefferson", "Jefferson", "Jefferson", "Jefferson…
$ city <chr> "Not in a city", "Not in a city", "Not in a city"…
$ cbsa_code <chr> "19740", "19740", "19740", "19740", "19740", "197…
$ cbsa <chr> "Denver-Aurora-Lakewood, CO", "Denver-Aurora-Lake…
$ date_of_last_change <chr> "2024-05-24", "2024-05-24", "2024-05-24", "2024-0…
Setting up and querying DuckDB in R
(This section is not meant to be analysis! I’m just demonstrating the setup and querying of the database.)
<- DBI::dbConnect(duckdb(), path = ":memory:")
con
<- copy_to(con, df_ozone %>% filter(sample_duration_code == '1'), overwrite = TRUE) ozone
# write our data to duckdb table
<- "ozone"
table_name ::dbWriteTable(con, table_name, df_ozone) duckdb
dbGetQuery(con,
"SELECT *
FROM ozone
limit 10")
state_code county_code site_number parameter_code poc latitude longitude
1 08 059 0006 44201 1 39.9128 -105.1886
2 08 059 0006 44201 1 39.9128 -105.1886
3 08 059 0006 44201 1 39.9128 -105.1886
4 08 059 0006 44201 1 39.9128 -105.1886
5 08 059 0006 44201 1 39.9128 -105.1886
6 08 059 0006 44201 1 39.9128 -105.1886
7 08 059 0006 44201 1 39.9128 -105.1886
8 08 059 0006 44201 1 39.9128 -105.1886
9 08 059 0006 44201 1 39.9128 -105.1886
10 08 059 0006 44201 1 39.9128 -105.1886
datum parameter sample_duration_code sample_duration
1 WGS84 Ozone 1 1 HOUR
2 WGS84 Ozone W 8-HR RUN AVG BEGIN HOUR
3 WGS84 Ozone W 8-HR RUN AVG BEGIN HOUR
4 WGS84 Ozone W 8-HR RUN AVG BEGIN HOUR
5 WGS84 Ozone 1 1 HOUR
6 WGS84 Ozone W 8-HR RUN AVG BEGIN HOUR
7 WGS84 Ozone W 8-HR RUN AVG BEGIN HOUR
8 WGS84 Ozone W 8-HR RUN AVG BEGIN HOUR
9 WGS84 Ozone 1 1 HOUR
10 WGS84 Ozone W 8-HR RUN AVG BEGIN HOUR
pollutant_standard date_local units_of_measure event_type observation_count
1 Ozone 1-hour 1979 2024-01-01 Parts per million No Events 24
2 Ozone 8-Hour 1997 2024-01-01 Parts per million No Events 24
3 Ozone 8-Hour 2008 2024-01-01 Parts per million No Events 24
4 Ozone 8-hour 2015 2024-01-01 Parts per million No Events 17
5 Ozone 1-hour 1979 2024-01-02 Parts per million No Events 23
6 Ozone 8-Hour 1997 2024-01-02 Parts per million No Events 24
7 Ozone 8-Hour 2008 2024-01-02 Parts per million No Events 24
8 Ozone 8-hour 2015 2024-01-02 Parts per million No Events 17
9 Ozone 1-hour 1979 2024-01-03 Parts per million No Events 24
10 Ozone 8-Hour 1997 2024-01-03 Parts per million No Events 24
observation_percent validity_indicator arithmetic_mean first_max_value
1 100 Y 0.041333 0.048
2 100 Y 0.041833 0.046
3 100 Y 0.041833 0.046
4 100 Y 0.042294 0.046
5 96 Y 0.033870 0.046
6 100 Y 0.033500 0.037
7 100 Y 0.033500 0.037
8 100 Y 0.032471 0.037
9 100 Y 0.037000 0.045
10 100 Y 0.036083 0.039
first_max_hour aqi method_code method
1 14 NA 087 INSTRUMENTAL - ULTRA VIOLET ABSORPTION
2 8 43 087 INSTRUMENTAL - ULTRA VIOLET ABSORPTION
3 8 43 087 INSTRUMENTAL - ULTRA VIOLET ABSORPTION
4 8 43 087 INSTRUMENTAL - ULTRA VIOLET ABSORPTION
5 23 NA 087 INSTRUMENTAL - ULTRA VIOLET ABSORPTION
6 22 34 087 INSTRUMENTAL - ULTRA VIOLET ABSORPTION
7 22 34 087 INSTRUMENTAL - ULTRA VIOLET ABSORPTION
8 22 34 087 INSTRUMENTAL - ULTRA VIOLET ABSORPTION
9 15 NA 087 INSTRUMENTAL - ULTRA VIOLET ABSORPTION
10 10 36 087 INSTRUMENTAL - ULTRA VIOLET ABSORPTION
local_site_name site_address state county city cbsa_code
1 ROCKY FLATS-N 16600 W COLO #128 Colorado Jefferson Not in a city 19740
2 ROCKY FLATS-N 16600 W COLO #128 Colorado Jefferson Not in a city 19740
3 ROCKY FLATS-N 16600 W COLO #128 Colorado Jefferson Not in a city 19740
4 ROCKY FLATS-N 16600 W COLO #128 Colorado Jefferson Not in a city 19740
5 ROCKY FLATS-N 16600 W COLO #128 Colorado Jefferson Not in a city 19740
6 ROCKY FLATS-N 16600 W COLO #128 Colorado Jefferson Not in a city 19740
7 ROCKY FLATS-N 16600 W COLO #128 Colorado Jefferson Not in a city 19740
8 ROCKY FLATS-N 16600 W COLO #128 Colorado Jefferson Not in a city 19740
9 ROCKY FLATS-N 16600 W COLO #128 Colorado Jefferson Not in a city 19740
10 ROCKY FLATS-N 16600 W COLO #128 Colorado Jefferson Not in a city 19740
cbsa date_of_last_change
1 Denver-Aurora-Lakewood, CO 2024-05-24
2 Denver-Aurora-Lakewood, CO 2024-05-24
3 Denver-Aurora-Lakewood, CO 2024-05-24
4 Denver-Aurora-Lakewood, CO 2024-05-24
5 Denver-Aurora-Lakewood, CO 2024-05-24
6 Denver-Aurora-Lakewood, CO 2024-05-24
7 Denver-Aurora-Lakewood, CO 2024-05-24
8 Denver-Aurora-Lakewood, CO 2024-05-24
9 Denver-Aurora-Lakewood, CO 2024-05-24
10 Denver-Aurora-Lakewood, CO 2024-05-24
dbGetQuery(con,
"SELECT
sample_duration,
local_site_name,
AVG(aqi) as avg_aqi,
AVG(arithmetic_mean) as avg_arithmetic_mean,
AVG(first_max_value) as avg_first_max_value
FROM ozone
GROUP BY sample_duration, local_site_name")
sample_duration local_site_name avg_aqi
1 1 HOUR ROCKY FLATS-N NA
2 8-HR RUN AVG BEGIN HOUR ROCKY FLATS-N 47.24793
3 1 HOUR NATIONAL RENEWABLE ENERGY LABS - NREL NA
4 8-HR RUN AVG BEGIN HOUR NATIONAL RENEWABLE ENERGY LABS - NREL 44.07649
5 1 HOUR Evergreen NA
6 8-HR RUN AVG BEGIN HOUR Evergreen 47.28375
avg_arithmetic_mean avg_first_max_value
1 0.04345224 0.05146281
2 0.04325047 0.04856749
3 0.04052387 0.04900000
4 0.04019869 0.04589235
5 0.04131112 0.05216529
6 0.04122206 0.04876309
Conclusion
::dbDisconnect(conn = con) DBI
This post shows how to use the RAQSAPI::
package to access air quality data from the Environmental Protection Agency, and how to use the duckdb::
package to set up an in-memory DuckDB database in R.
In future posts, I’ll begin analyzing and visualizing Colorado’s air pollution data from the EPA’s Air Quality Service.