= FileAttachment(".//data/marCasadoAir@5.csv").csv({typed: true}) // Date issue marCasadoAir
Introduction
Today we’ll be exploring multivariate meteorological data from Mar Casado Beach, Sao Paulo (Brazil). By the end of the lesson, participants will be able to:
Create a database in a notebook using DuckDBClient.of() Query the database in SQL cells Visualize relationships between variables Import and reuse existing content using import with Perform and visualize results of principal component analysis In this session, we will work with data from da Silva et al (2019) describing monthly atmospheric and ocean data (e.g. air pressure, windspeed, tides, sea surface temperature).
Data source: Marcos Gonçalves da Silva, Juliana Nascimento Silva, Helena Rodrigues Fragoso, Natalia Pirani Ghilardi-Lopes (2019). Temporal series analysis of abiotic data for a subtropical Brazilian rocky shore. Data in Brief, Volume 24. ISSN 2352-3409, https://doi.org/10.1016/j.dib.2019.103873.
Step 1: Combine the tables into a database The data are already attached here in two parts: marCasadoAir, and marCasadoSea:
= FileAttachment(".//data/marCasadoSea@4.csv").csv({typed: true}) // Date issue marCasadoSea
We can create a database containing both tables using DuckDBClient.of():
// Write code to create a database called marCasadoDB, with tables 'air' and 'sea':
= DuckDBClient.of({air: marCasadoAir, sea: marCasadoSea}) marCasadoDB
Now, open the Database pane in the right margin of your notebook. There, you can explore the schema and variables of your newly created database.
Step 2: Wrangle & analyze data in a SQL cell We want to combine some data from both tables in the database. The column we’re able to join by is month. We will also keep the following columns, and add a new column for season:
From air:
- month
- meanPressure (mmHg)
- windSpeed (kilometers per hour)
- PAR (photoactive radiation in E/m s2)
- meanHumidity (percent)
- windDirection (degrees)
From sea:
- maxTide (meters)
- minTide (meters)
- salinity (practical salinity units)
- seaSurfaceTemp (degrees Celsius)
From da Silva et al (2019): “…two distinct seasons: (I) a hot and moist season from October to March (encompassing Spring and Summer) and (II) a cold and dry season from April to September (encompassing Autumn and Winter); an expected result for subtropical zones.”
We’ll also add a new column, season, containing “cool dry” for October thru March, otherwise “hot moist.”
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 60 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (14): maxAirTemp, minAirTemp, maxHumidity, minHumidity, meanHumidity, m...
date (1): month
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 60 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): seaSurfaceTemp, maxTide, minTide, salinity
date (1): month
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
select cast(strftime('%m', a.month) as integer) as month,
a.meanPressure,
a.windSpeed,
a.PAR,
a.meanHumidity,
a.windDirection,
s.maxTide,
s.minTide,
s.salinity,
s.seaSurfaceTemp,CASE WHEN cast(strftime('%m', a.month) as integer) IN (10, 11, 12, 1, 2, 3) THEN 'hot moist' ELSE 'cool dry' END AS season
from air as a
left join sea as s
on a.month = s.month
month | meanPressure | windSpeed | PAR | meanHumidity | windDirection | maxTide | minTide | salinity | seaSurfaceTemp | season |
---|---|---|---|---|---|---|---|---|---|---|
11 | 1012.10 | 5.47 | 50.84 | 79.86 | 51.80 | 1.25 | 0.28 | 35.50 | 26.27 | hot moist |
12 | 1011.89 | 3.83 | 51.48 | 77.34 | 71.58 | 1.27 | 0.26 | 35.13 | 26.27 | hot moist |
1 | 1013.48 | 6.22 | 36.19 | 79.75 | 54.58 | 1.30 | 0.27 | 35.62 | 26.57 | hot moist |
2 | 1015.67 | 5.37 | 35.78 | 79.25 | 70.64 | 1.28 | 0.29 | 35.67 | 24.82 | hot moist |
3 | 1016.32 | 4.88 | 27.58 | 76.40 | 66.01 | 1.25 | 0.26 | 35.77 | 24.05 | hot moist |
4 | 1017.40 | 3.73 | 21.69 | 81.65 | 71.94 | 1.25 | 0.25 | 35.41 | 22.37 | cool dry |
5 | 1020.16 | 5.27 | 25.69 | 79.34 | 79.75 | 1.25 | 0.29 | 34.45 | 20.73 | cool dry |
6 | 1018.84 | 4.38 | 31.20 | 75.15 | 90.27 | 1.28 | 0.29 | 32.74 | 18.85 | cool dry |
7 | 1017.23 | 5.48 | 33.23 | 76.70 | 74.55 | 1.32 | 0.29 | 32.32 | 21.05 | cool dry |
8 | 1015.74 | 5.91 | 38.72 | 77.88 | 56.02 | 1.30 | 0.31 | 33.13 | 21.79 | cool dry |
If you want, you can work in SQL to wrangle & analyze an array (doesn’t have to be a database):
SELECT cast(strftime('%m', month) as integer) as month,
avg(seaSurfaceTemp) as meanSST
FROM sea
GROUP BY month
month | meanSST |
---|---|
11 | 26.27 |
12 | 26.27 |
1 | 26.57 |
2 | 24.82 |
3 | 24.05 |
4 | 22.37 |
5 | 20.73 |
6 | 18.85 |
7 | 21.05 |
8 | 21.79 |
= FileAttachment(".//data/marCasado.csv").csv({typed: true}) marCasado
// Write Plot code to create a heatmap of sea surface temperature (SST) by year and month, starting from the 'cell' snippet:
.plot({
Plotmarks: [
.cell(marCasado, {
Ploty: d => d.month.getUTCFullYear(),
x: d => d.month.getUTCMonth(),
fill: "seaSurfaceTemp",
tip: true
}),
]width: 500,
height: 250,
y: {tickFormat: "Y", padding: 0},
x: {padding: 0, tickFormat: Plot.formatMonth()}
})
Step 4: Interactive data visualization Use import with to import content from another notebook, and replace what it expects with something new:
import {PlotMatrix} with {marCasado as data} from "@observablehq/autoplot-matrix"
// Use the PlotMatrix function (expecting marCasado) to create a pair plot:
PlotMatrix(marCasado)
Step 5: Principal component analysis Imports & libraries for PCA Use require to access methods from the ml.js package (which includes the PCA function we’ll use):
= require("https://www.lactame.com/lib/ml/6.0.0/ml.min.js") ML
Import the scale and asMatrix functions from Christoph Pahmeyer’s hierarchical clustering notebook. The scale function will scale values of a property to have a mean of 0 and standard deviation of 1. We use asMatrix to get a 2D array of predictions to project our points in the PC space.
import {scale, asMatrix} from "@chrispahm/hierarchical-clustering"
Scaling and PCA Create a scaled version of the data, only including the numeric variables (excluding season and month) that will be included in principal component analysis:
// Create a scaled version of the numeric variables
= scale(marCasado.map(({ season, month, ...rest }) => rest)) marCasadoScaled
Then convert the array of objects to an array of arrays (which is what the PCA method from ml.js is expecting):
// Convert to an array of arrays, just containing the values (no keys):
= marCasadoScaled.map(Object.values) marCasadoArray
Now, we’re ready to perform PCA using the PCA function in ml.js (nicknamed ML when we used require above):
// Perform principal component analysis:
= new ML.PCA(marCasadoArray) // Already scaled above - otherwise can add {scale: true} here! marCasadoPCA
Explore PCA results Use getExplainedVariance() to see variance explained by each PC:
// Get variance explained by each PC:
= marCasadoPCA.getExplainedVariance() variancePC
Use getCumulativeVariance() to see cumulative variance explained:
// Get cumulative variance explained:
= marCasadoPCA.getCumulativeVariance() cumulativeVariance
Step 6: Visualize PCA results import with to reuse community contributions We’ll again use import with to access and reuse materials, this time from Christoph Pahmeyer’s notebook on principal component analysis.
import {loadings} from "@chrispahm/principal-component-analysis"
// Import viewof loadings from the notebook, with marCasadoScaled as food_scaled:
import {viewof loadings} with {marCasadoScaled as food_scaled} from "@chrispahm/principal-component-analysis"
// Look at viewof loadings:
= viewof loadings loadings_df
import {scores} from "@chrispahm/principal-component-analysis"
// import viewof scores from the notebook, with marCasadoScaled as food_scaled and marCasado as food:
import {viewof scores} with {marCasadoScaled as food_scaled, marCasado as food} from "@chrispahm/principal-component-analysis"
// Look at viewof scores:
= viewof scores scores_df
// Do some wrangling to get the month and season alongside scores:
= scores_df.map((d, i) => ({ ...d, Name: marCasado[i].month, season: marCasado[i].season })) scoresCombined
= 5 scalingFactor
// Create a PCA biplot with the scores and loadings
.plot({
Plotmarks: [
.dot(scoresCombined, { x: "PC1", y: "PC2", fill: "season", r: 5 }),
Plot.arrow(loadings_df, {
Plotx1: 0, x2: d => d.PC1 * scalingFactor, y1: 0, y2: (d) => d.PC2 * scalingFactor
,
}).text(loadings_df, {
Plotx: (d) => d.PC1 * scalingFactor, y: (d) => d.PC2 * scalingFactor,
text: "Variable",
dy: -5,
dx: 30,
fill: "black",
stroke: "white",
fontSize: 14
}),
]color: { legend: true },
inset: 20
})