ndvi_landsat_mesa_nbhd_v1

Ingestion of the UHC - NDVI dataset

Author

Ran Li (Maintainer)

Published

January 17, 2025

Setup

## Dependencies
pacman::p_load(
  tidyverse, assertr, arrow, duckdb, reactable, glue
  )

## Local Context
context = lst(
  model = 'ndvi_landsat_mesa_nbhd_v1',
  server = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1',
  raw_folder = file.path(server, 'raw'),
  raw_file = file.path(raw_folder, 'CT10NDVI_84_19.csv'),
  final = file.path(server, glue("{model}.parquet"))
)

## Create database connection
con =  dbConnect(duckdb())

This notebook will ingest the UHC - NDVI dataset into the CCUH infrastructure.

1. Freeze data

Lets freeze the external .csv into a parquet for long term storage and column type casting. We do a few small transformations:

rename tract10
cast tract10 as type character with a padded length of 11
rename ct10_ndvi to ndvi

The query looks like:

# Create SQL to read CSV and write to Parquet
freeze_query <- glue_sql("
  COPY (
    SELECT 
      LPAD(CAST(GEOID10 AS VARCHAR), 11, '0') AS tract10,
      year, 
      season,
      ct10_ndvi AS ndvi
    FROM read_csv_auto({raw_file})
  )
  TO {final} (FORMAT PARQUET)
  ", 
  raw_file = context$raw_file,
  final = context$final,
  .con = con
)

freeze_query

<SQL> COPY (
  SELECT 
    LPAD(CAST(GEOID10 AS VARCHAR), 11, '0') AS tract10,
    year, 
    season,
    ct10_ndvi AS ndvi
  FROM read_csv_auto('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/raw/CT10NDVI_84_19.csv')
)
TO '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/ndvi_landsat_mesa_nbhd_v1.parquet' (FORMAT PARQUET)

Now let’s execute

```{r}
DBI::dbExecute(con, freeze_query)
```

2. Validation

Just a quick EDA on the dataset to do some quick QC and to get some familiarity before ingesting in the CCUH infrastructure. Let’s first import the data.

Import and QC

## Import
df_data = context$final %>% 
  open_dataset() %>% 
  collect() %>% 
  as_tibble()

Valid tract10 geoid

Just to validate against the tract10 geoid in the CCUH universe. Lets pull in the CCUH tract10 crosswalk

Code

xwalk_tract10 = "//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/xwalk_tract10_v1/xwalk_tract10_v1.parquet" |> 
  open_dataset() %>% 
  rename(tract10 = tract10.census.geoid )  %>% 
  collect()

So we have 74,081 total tract10 geographic units.

Code

df_data %>% 
  count(tract10) %>% 
  verify(tract10 %in% xwalk_tract10$tract10) %>% 
  nrow()

[1] 72539

There are 72,539 census tracts in the NDVI dataset all of which verified as valid in the CCUH tract10 universe. Lets confirm which one’s are missing.

Code

df_qc = xwalk_tract10 %>% 
  select(tract10, state.abb, state.contiguous, state.type ) %>% 
  left_join(df_data %>% 
              select(tract10) %>% 
              distinct() %>% 
              mutate(has_ndvi = '1'))

df_qc %>% 
  filter(is.na(has_ndvi)) %>% 
  count(state.type, state.abb)

state.type	state.abb	n
state	AK	167
state	HI	351
territory	NA	1034

Makes sense the census tracts that are not in the NDVI dataset are in either non-contiguous states (AK, HI) or is in a territory (e.g. Guam).

Looks good to me.

NA values

Code

df_data %>% 
  mutate(ndvi_na = is.na(ndvi)) %>% 
  count(ndvi_na)

ndvi_na	n
FALSE	10154142
TRUE	218935

There are some missing values which are noted in the original documentation. Where NDVI is not available when:

There is cloud cover
There is snow cover
Over bodies of water (these should be excluded)”

Additionally, there’s a specific known data gap: Summer 2003 in Landsat 8 has a known missing data issue

Completeness

We have 36 years and 4 season per year (minues Summer 2003 which has a known missing data issue and thus excluded). So we expect to have 36 * 4 - 1 = 143 data points per census tract.

df_valid = df_data %>% 
  count(tract10, name = 'n_data_points') %>% 
  verify(n_data_points == 143)

Indeed that is the case.

Visual Inspection

Okay structural no issues, lets do a quick inspection.

NDVI over time

Mean NDVI over time

df_data %>%
  group_by(year, season) %>%
  summarise(
    mean_ndvi = mean(ndvi, na.rm = TRUE),
    missing_pct = mean(is.na(ndvi)) * 100,
    .groups = 'drop'
  ) %>%
  ggplot(aes(x = year, y = mean_ndvi, color = season)) +
  geom_line() +
  geom_vline(xintercept = c(1999, 2013), linetype = "dashed", alpha = 0.5) +
  annotate("text", x = c(1991, 2006, 2016), y = max(df_data$ndvi, na.rm = TRUE), 
           label = c("Landsat 5", "Landsat 7", "Landsat 8")) +
  labs(title = "Mean NDVI by Season Over Time",
       subtitle = "Vertical lines indicate Landsat satellite transitions",
       x = "Year", y = "Mean NDVI",
       color = "Season") +
  theme_minimal() +
  scale_color_viridis_d()

Seasonal Distribution

Code

df_data %>%
  ggplot(aes(x = ndvi, fill = season, color = season)) +
  geom_density(alpha = 0.3) +  # Using density instead of histogram for better overlap visualization
  labs(title = "NDVI Distribution by Season",
       subtitle = "Overlapping density plots show seasonal patterns",
       x = "NDVI Value", 
       y = "Density") +
  theme_minimal() +
  scale_fill_viridis_d() +
  scale_color_viridis_d() +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 10),
    plot.title = element_text(size = 14),
    plot.subtitle = element_text(size = 12)
  )

Code

summary_stats <- df_data %>%
  group_by(season) %>%
  summarise(
    mean_ndvi = mean(ndvi, na.rm = TRUE),
    sd_ndvi = sd(ndvi, na.rm = TRUE),
    median_ndvi = median(ndvi, na.rm = TRUE),
    missing_pct = mean(is.na(ndvi)) * 100,
    n_tracts = n_distinct(tract10)
  )

summary_stats

season	mean_ndvi	sd_ndvi	median_ndvi	missing_pct	n_tracts
fall	0.1978961	0.0716868	0.2035830	0.9730015	72539
spring	0.1823133	0.0742116	0.1812400	1.6866406	72539
summer	0.2562190	0.0996391	0.2646248	0.7206085	72539
winter	0.1058239	0.0602066	0.1072993	5.1465123	72539

Looks good to me.

3. Access

Data

```{r}
library(arrow)
library(tidyverse)

ds = "//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/ndvi_landsat_mesa_nbhd_v1.parquet" |> 
  arrow::open_dataset() 

df_ndvi_2018 = ds |>
  filter(
    year == 2018
  ) %>% 
  collect()   %>% 
  pivot_wider(
    names_from = season,
    names_prefix = 'ndvi_',
    values_from = ndvi
  )

```

```{python}
import polars as pl

ds_prism = pl.scan_parquet("//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/ndvi_landsat_mesa_nbhd_v1/ndvi_landsat_mesa_nbhd_v1.parquet")

df = (ds_prism
    .filter(
        (pl.col("year") == 2018)
    )
    .collect()
)

```

Metadata

Minimalistic codebook can be found below; for details please refer to this dataset documentation or the source dataset documentation.

PRISM measures metadata

df_metadata = df_metadata <- tribble(
  ~var_name,    ~var_def,                                                                   ~var_type,
  "tract10",    "11 character ID for census tract (2010)",                                  "character",
  "year",       "Year of observation, ranges from 1984-2019",                               "integer",
  "season",     "Season of observation (Winter, Spring, Summer, Fall)",                     "character",
  "ndvi",       "Normalized Difference Vegetation Index value, ranging from -1 to +1",      "double"
)

df_metadata %>% 
  reactable(
    searchable = T,
    defaultPageSize = 99
  )