This notebook will ingest the UHC - NDVI dataset into the CCUH infrastructure.
1. Freeze data
Lets freeze the external .csv into a parquet for long term storage and column type casting. We do a few small transformations:
rename tract10
cast tract10 as type character with a padded length of 11
rename ct10_ndvi to ndvi
The query looks like:
# Create SQL to read CSV and write to Parquetfreeze_query <-glue_sql(" COPY ( SELECT LPAD(CAST(GEOID10 AS VARCHAR), 11, '0') AS tract10, year, season, ct10_ndvi AS ndvi FROM read_csv_auto({raw_file}) ) TO {final} (FORMAT PARQUET) ", raw_file = context$raw_file,final = context$final,.con = con)freeze_query
<SQL> COPY (
SELECT
LPAD(CAST(GEOID10 AS VARCHAR), 11, '0') AS tract10,
year,
season,
ct10_ndvi AS ndvi
FROM read_csv_auto('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/raw/CT10NDVI_84_19.csv')
)
TO '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/ndvi_landsat_mesa_nbhd_v1.parquet' (FORMAT PARQUET)
Now let’s execute
```{r}DBI::dbExecute(con, freeze_query)```
2. Validation
Just a quick EDA on the dataset to do some quick QC and to get some familiarity before ingesting in the CCUH infrastructure. Let’s first import the data.
There are some missing values which are noted in the original documentation. Where NDVI is not available when:
There is cloud cover
There is snow cover
Over bodies of water (these should be excluded)”
Additionally, there’s a specific known data gap: Summer 2003 in Landsat 8 has a known missing data issue
Completeness
We have 36 years and 4 season per year (minues Summer 2003 which has a known missing data issue and thus excluded). So we expect to have 36 * 4 - 1 = 143 data points per census tract.
df_valid = df_data %>%count(tract10, name ='n_data_points') %>%verify(n_data_points ==143)
Indeed that is the case.
Visual Inspection
Okay structural no issues, lets do a quick inspection.
NDVI over time
Mean NDVI over time
df_data %>%group_by(year, season) %>%summarise(mean_ndvi =mean(ndvi, na.rm =TRUE),missing_pct =mean(is.na(ndvi)) *100,.groups ='drop' ) %>%ggplot(aes(x = year, y = mean_ndvi, color = season)) +geom_line() +geom_vline(xintercept =c(1999, 2013), linetype ="dashed", alpha =0.5) +annotate("text", x =c(1991, 2006, 2016), y =max(df_data$ndvi, na.rm =TRUE), label =c("Landsat 5", "Landsat 7", "Landsat 8")) +labs(title ="Mean NDVI by Season Over Time",subtitle ="Vertical lines indicate Landsat satellite transitions",x ="Year", y ="Mean NDVI",color ="Season") +theme_minimal() +scale_color_viridis_d()
Seasonal Distribution
Code
df_data %>%ggplot(aes(x = ndvi, fill = season, color = season)) +geom_density(alpha =0.3) +# Using density instead of histogram for better overlap visualizationlabs(title ="NDVI Distribution by Season",subtitle ="Overlapping density plots show seasonal patterns",x ="NDVI Value", y ="Density") +theme_minimal() +scale_fill_viridis_d() +scale_color_viridis_d() +theme(legend.position ="right",legend.title =element_text(size =10),plot.title =element_text(size =14),plot.subtitle =element_text(size =12) )
df_metadata = df_metadata <-tribble(~var_name, ~var_def, ~var_type,"tract10", "11 character ID for census tract (2010)", "character","year", "Year of observation, ranges from 1984-2019", "integer","season", "Season of observation (Winter, Spring, Summer, Fall)", "character","ndvi", "Normalized Difference Vegetation Index value, ranging from -1 to +1", "double")df_metadata %>%reactable(searchable = T,defaultPageSize =99 )
Source Code
---title: "ndvi_landsat_mesa_nbhd_v1"description: "Ingestion of the UHC - NDVI dataset"author: - name: Ran Li (Maintainer) orcid: 0000-0002-4699-4755date: last-modifiedformat: html: toc: true self-contained: true code-fold: true df-print: kable code-tools: true comments: hypothesis: theme: cleaneditor: sourceexecute: warning: false message: false eval: trueeditor_options: chunk_output_type: console---```{r}#| code-summary: 'Setup'#| code-fold: true## Dependenciespacman::p_load( tidyverse, assertr, arrow, duckdb, reactable, glue )## Local Contextcontext =lst(model ='ndvi_landsat_mesa_nbhd_v1',server ='//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1',raw_folder =file.path(server, 'raw'),raw_file =file.path(raw_folder, 'CT10NDVI_84_19.csv'),final =file.path(server, glue("{model}.parquet")))## Create database connectioncon =dbConnect(duckdb())```This notebook will ingest the UHC - NDVI dataset into the CCUH infrastructure. # 1. Freeze data Lets freeze the external .csv into a parquet for long term storage and column type casting. We do a few small transformations:- rename tract10- cast tract10 as type character with a padded length of 11- rename ct10_ndvi to ndviThe query looks like: ```{r}#| code-fold: false# Create SQL to read CSV and write to Parquetfreeze_query <-glue_sql(" COPY ( SELECT LPAD(CAST(GEOID10 AS VARCHAR), 11, '0') AS tract10, year, season, ct10_ndvi AS ndvi FROM read_csv_auto({raw_file}) ) TO {final} (FORMAT PARQUET) ", raw_file = context$raw_file,final = context$final,.con = con)freeze_query```Now let's execute```{r}DBI::dbExecute(con, freeze_query)```# 2. ValidationJust a quick EDA on the dataset to do some quick QC and to get some familiarity before ingesting in the CCUH infrastructure. Let's first import the data.```{r}#| code-summary: "Import and QC"## Importdf_data = context$final %>%open_dataset() %>%collect() %>%as_tibble()```## Valid tract10 geoidJust to validate against the tract10 geoid in the CCUH universe. Lets pull in the CCUH tract10 crosswalk```{r}xwalk_tract10 ="//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/xwalk_tract10_v1/xwalk_tract10_v1.parquet"|>open_dataset() %>%rename(tract10 = tract10.census.geoid ) %>%collect()```So we have 74,081 total tract10 geographic units. ```{r}df_data %>%count(tract10) %>%verify(tract10 %in% xwalk_tract10$tract10) %>%nrow() ```There are 72,539 census tracts in the NDVI dataset all of which verified as valid in the CCUH tract10 universe. Lets confirm which one's are missing. ```{r}df_qc = xwalk_tract10 %>%select(tract10, state.abb, state.contiguous, state.type ) %>%left_join(df_data %>%select(tract10) %>%distinct() %>%mutate(has_ndvi ='1'))df_qc %>%filter(is.na(has_ndvi)) %>%count(state.type, state.abb)```Makes sense the census tracts that are not in the NDVI dataset are in either non-contiguous states (AK, HI) or is in a territory (e.g. Guam).Looks good to me. ## NA values```{r}df_data %>%mutate(ndvi_na =is.na(ndvi)) %>%count(ndvi_na)```There are some missing values which are noted in the original documentation. Where NDVI is not available when:- There is cloud cover- There is snow cover- Over bodies of water (these should be excluded)"Additionally, there's a specific known data gap: Summer 2003 in Landsat 8 has a known missing data issue## CompletenessWe have 36 years and 4 season per year (minues Summer 2003 which has a known missing data issue and thus excluded). So we expect to have 36 * 4 - 1 = 143 data points per census tract. ```{r}#| code-fold: falsedf_valid = df_data %>%count(tract10, name ='n_data_points') %>%verify(n_data_points ==143)```Indeed that is the case.## Visual InspectionOkay structural no issues, lets do a quick inspection. ### NDVI over time ```{r}#| code-summary: "Mean NDVI over time"#| code-fold: truedf_data %>%group_by(year, season) %>%summarise(mean_ndvi =mean(ndvi, na.rm =TRUE),missing_pct =mean(is.na(ndvi)) *100,.groups ='drop' ) %>%ggplot(aes(x = year, y = mean_ndvi, color = season)) +geom_line() +geom_vline(xintercept =c(1999, 2013), linetype ="dashed", alpha =0.5) +annotate("text", x =c(1991, 2006, 2016), y =max(df_data$ndvi, na.rm =TRUE), label =c("Landsat 5", "Landsat 7", "Landsat 8")) +labs(title ="Mean NDVI by Season Over Time",subtitle ="Vertical lines indicate Landsat satellite transitions",x ="Year", y ="Mean NDVI",color ="Season") +theme_minimal() +scale_color_viridis_d()```### Seasonal Distribution ```{r}df_data %>%ggplot(aes(x = ndvi, fill = season, color = season)) +geom_density(alpha =0.3) +# Using density instead of histogram for better overlap visualizationlabs(title ="NDVI Distribution by Season",subtitle ="Overlapping density plots show seasonal patterns",x ="NDVI Value", y ="Density") +theme_minimal() +scale_fill_viridis_d() +scale_color_viridis_d() +theme(legend.position ="right",legend.title =element_text(size =10),plot.title =element_text(size =14),plot.subtitle =element_text(size =12) )``````{r}summary_stats <- df_data %>%group_by(season) %>%summarise(mean_ndvi =mean(ndvi, na.rm =TRUE),sd_ndvi =sd(ndvi, na.rm =TRUE),median_ndvi =median(ndvi, na.rm =TRUE),missing_pct =mean(is.na(ndvi)) *100,n_tracts =n_distinct(tract10) )summary_stats```Looks good to me. # 3. Access## Data::: {.panel-tabset}## R ```{r}library(arrow)library(tidyverse)ds = "`r context$final`" |> arrow::open_dataset() df_ndvi_2018 = ds |> filter( year == 2018 ) %>% collect() %>% pivot_wider( names_from = season, names_prefix = 'ndvi_', values_from = ndvi )```## Python```{python}import polars as plds_prism = pl.scan_parquet("//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/ndvi_landsat_mesa_nbhd_v1.parquet")df = (ds_prism .filter( (pl.col("year") == 2018) ) .collect())```:::## MetadataMinimalistic codebook can be found below; for details please refer to this [dataset documentation](https://www.notion.so/drexel-ccuh/Tract10-Seasonal-NDVI-1984-2019-17d57008e8858078b492ebb9f703955d) or the [source dataset documentation](https://www.notion.so/drexel-ccuh/UHC-NDVI-LandSat_MESANeighborhoods-Source-Dataset-17d57008e88580fdb75bd296754072f6). ```{r}#| code-summary: 'PRISM measures metadata'df_metadata = df_metadata <-tribble(~var_name, ~var_def, ~var_type,"tract10", "11 character ID for census tract (2010)", "character","year", "Year of observation, ranges from 1984-2019", "integer","season", "Season of observation (Winter, Spring, Summer, Fall)", "character","ndvi", "Normalized Difference Vegetation Index value, ranging from -1 to +1", "double")df_metadata %>%reactable(searchable = T,defaultPageSize =99 )```