ndvi_landsat_mesa_nbhd_v1

Ingestion of the UHC - NDVI dataset
Author

Ran Li (Maintainer)

Published

January 17, 2025

Setup
## Dependencies
pacman::p_load(
  tidyverse, assertr, arrow, duckdb, reactable, glue
  )

## Local Context
context = lst(
  model = 'ndvi_landsat_mesa_nbhd_v1',
  server = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1',
  raw_folder = file.path(server, 'raw'),
  raw_file = file.path(raw_folder, 'CT10NDVI_84_19.csv'),
  final = file.path(server, glue("{model}.parquet"))
)

## Create database connection
con =  dbConnect(duckdb())

This notebook will ingest the UHC - NDVI dataset into the CCUH infrastructure.

1. Freeze data

Lets freeze the external .csv into a parquet for long term storage and column type casting. We do a few small transformations:

  • rename tract10
  • cast tract10 as type character with a padded length of 11
  • rename ct10_ndvi to ndvi

The query looks like:

# Create SQL to read CSV and write to Parquet
freeze_query <- glue_sql("
  COPY (
    SELECT 
      LPAD(CAST(GEOID10 AS VARCHAR), 11, '0') AS tract10,
      year, 
      season,
      ct10_ndvi AS ndvi
    FROM read_csv_auto({raw_file})
  )
  TO {final} (FORMAT PARQUET)
  ", 
  raw_file = context$raw_file,
  final = context$final,
  .con = con
)

freeze_query
<SQL> COPY (
  SELECT 
    LPAD(CAST(GEOID10 AS VARCHAR), 11, '0') AS tract10,
    year, 
    season,
    ct10_ndvi AS ndvi
  FROM read_csv_auto('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/raw/CT10NDVI_84_19.csv')
)
TO '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/ndvi_landsat_mesa_nbhd_v1.parquet' (FORMAT PARQUET)

Now let’s execute

```{r}
DBI::dbExecute(con, freeze_query)
```

2. Validation

Just a quick EDA on the dataset to do some quick QC and to get some familiarity before ingesting in the CCUH infrastructure. Let’s first import the data.

Import and QC
## Import
df_data = context$final %>% 
  open_dataset() %>% 
  collect() %>% 
  as_tibble()

Valid tract10 geoid

Just to validate against the tract10 geoid in the CCUH universe. Lets pull in the CCUH tract10 crosswalk

Code
xwalk_tract10 = "//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/xwalk_tract10_v1/xwalk_tract10_v1.parquet" |> 
  open_dataset() %>% 
  rename(tract10 = tract10.census.geoid )  %>% 
  collect()

So we have 74,081 total tract10 geographic units.

Code
df_data %>% 
  count(tract10) %>% 
  verify(tract10 %in% xwalk_tract10$tract10) %>% 
  nrow() 
[1] 72539

There are 72,539 census tracts in the NDVI dataset all of which verified as valid in the CCUH tract10 universe. Lets confirm which one’s are missing.

Code
df_qc = xwalk_tract10 %>% 
  select(tract10, state.abb, state.contiguous, state.type ) %>% 
  left_join(df_data %>% 
              select(tract10) %>% 
              distinct() %>% 
              mutate(has_ndvi = '1'))

df_qc %>% 
  filter(is.na(has_ndvi)) %>% 
  count(state.type, state.abb)
state.type state.abb n
state AK 167
state HI 351
territory NA 1034

Makes sense the census tracts that are not in the NDVI dataset are in either non-contiguous states (AK, HI) or is in a territory (e.g. Guam).

Looks good to me.

NA values

Code
df_data %>% 
  mutate(ndvi_na = is.na(ndvi)) %>% 
  count(ndvi_na)
ndvi_na n
FALSE 10154142
TRUE 218935

There are some missing values which are noted in the original documentation. Where NDVI is not available when:

  • There is cloud cover
  • There is snow cover
  • Over bodies of water (these should be excluded)”

Additionally, there’s a specific known data gap: Summer 2003 in Landsat 8 has a known missing data issue

Completeness

We have 36 years and 4 season per year (minues Summer 2003 which has a known missing data issue and thus excluded). So we expect to have 36 * 4 - 1 = 143 data points per census tract.

df_valid = df_data %>% 
  count(tract10, name = 'n_data_points') %>% 
  verify(n_data_points == 143)

Indeed that is the case.

Visual Inspection

Okay structural no issues, lets do a quick inspection.

NDVI over time

Mean NDVI over time
df_data %>%
  group_by(year, season) %>%
  summarise(
    mean_ndvi = mean(ndvi, na.rm = TRUE),
    missing_pct = mean(is.na(ndvi)) * 100,
    .groups = 'drop'
  ) %>%
  ggplot(aes(x = year, y = mean_ndvi, color = season)) +
  geom_line() +
  geom_vline(xintercept = c(1999, 2013), linetype = "dashed", alpha = 0.5) +
  annotate("text", x = c(1991, 2006, 2016), y = max(df_data$ndvi, na.rm = TRUE), 
           label = c("Landsat 5", "Landsat 7", "Landsat 8")) +
  labs(title = "Mean NDVI by Season Over Time",
       subtitle = "Vertical lines indicate Landsat satellite transitions",
       x = "Year", y = "Mean NDVI",
       color = "Season") +
  theme_minimal() +
  scale_color_viridis_d()

Seasonal Distribution

Code
df_data %>%
  ggplot(aes(x = ndvi, fill = season, color = season)) +
  geom_density(alpha = 0.3) +  # Using density instead of histogram for better overlap visualization
  labs(title = "NDVI Distribution by Season",
       subtitle = "Overlapping density plots show seasonal patterns",
       x = "NDVI Value", 
       y = "Density") +
  theme_minimal() +
  scale_fill_viridis_d() +
  scale_color_viridis_d() +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 10),
    plot.title = element_text(size = 14),
    plot.subtitle = element_text(size = 12)
  )

Code
summary_stats <- df_data %>%
  group_by(season) %>%
  summarise(
    mean_ndvi = mean(ndvi, na.rm = TRUE),
    sd_ndvi = sd(ndvi, na.rm = TRUE),
    median_ndvi = median(ndvi, na.rm = TRUE),
    missing_pct = mean(is.na(ndvi)) * 100,
    n_tracts = n_distinct(tract10)
  )

summary_stats
season mean_ndvi sd_ndvi median_ndvi missing_pct n_tracts
fall 0.1978961 0.0716868 0.2035830 0.9730015 72539
spring 0.1823133 0.0742116 0.1812400 1.6866406 72539
summer 0.2562190 0.0996391 0.2646248 0.7206085 72539
winter 0.1058239 0.0602066 0.1072993 5.1465123 72539

Looks good to me.

3. Access

Data

```{r}
library(arrow)
library(tidyverse)

ds = "//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/ndvi_landsat_mesa_nbhd_v1.parquet" |> 
  arrow::open_dataset() 

df_ndvi_2018 = ds |>
  filter(
    year == 2018
  ) %>% 
  collect()   %>% 
  pivot_wider(
    names_from = season,
    names_prefix = 'ndvi_',
    values_from = ndvi
  )

```
```{python}
import polars as pl

ds_prism = pl.scan_parquet("//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/ndvi_landsat_mesa_nbhd_v1.parquet")

df = (ds_prism
    .filter(
        (pl.col("year") == 2018)
    )
    .collect()
)

```

Metadata

Minimalistic codebook can be found below; for details please refer to this dataset documentation or the source dataset documentation.

PRISM measures metadata
df_metadata = df_metadata <- tribble(
  ~var_name,    ~var_def,                                                                   ~var_type,
  "tract10",    "11 character ID for census tract (2010)",                                  "character",
  "year",       "Year of observation, ranges from 1984-2019",                               "integer",
  "season",     "Season of observation (Winter, Spring, Summer, Fall)",                     "character",
  "ndvi",       "Normalized Difference Vegetation Index value, ranging from -1 to +1",      "double"
)

df_metadata %>% 
  reactable(
    searchable = T,
    defaultPageSize = 99
  )