DBT__harmonized__salurbal_din_counts_l2__v1

Scratch Pad for developing SQL Models

Authors

Ran Li

Jessica Uruchima

Published

February 12, 2025

Setup
library(tidyverse)
library(arrow)
library(duckdb)
Note

This is a DBT model scratch notebook. It is not version controlled and sits at the DBT dev folder in //files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/_datawarehouse/dev/_scr.qmd. We use this to develop our SQL data models within the datawarehouse. It is deployed as a page into the CCUH quarto-pub site and documented in the Notion Notebooks Database in case we need to reference it in the future.

So we want to start from the harmonized__salurbal_din__v1 data model and we just want to count by L2 and day.

Develop

R

Lets just build this out in R first.

api = 'harmonized__salurbal_din__v1.parquet' |> 
    arrow::open_dataset()

df1 = api |> 
    filter(as.numeric(MONTH) < 90, as.numeric(DAY) < 90) |> 
    mutate(DOD = as.Date(paste(YEAR, MONTH, DAY, sep = "-"))) |> 
    count(iso2, SALID2, DOD, name = 'n_deaths_r') |> 
    collect() 

head(df1)
iso2 SALID2 DOD n_deaths_r
BR 10216315 2000-11-01 3
BR 10216610 2000-12-21 7
BR 10224810 2000-04-01 199
BR 10225211 2000-12-25 3
BR 10219211 2000-11-05 1
BR 10224832 2000-09-11 3

There is a interesting thing where we have some month and day with value of 99:

  • 1 month - we can drop
  • But for PA - almost half of death records are missing DAY (coded as 98 or 99)
api |> 
  mutate(invalid_DAY = as.integer(DAY)>31) |> 
  count(iso2, invalid_DAY) |> 
  collect() |>
  arrange(desc(n))
iso2 invalid_DAY n
BR FALSE 14143215
GT FALSE 265438
PA TRUE 99445
PA FALSE 92006
BR TRUE 2

This is a big issue … for now we will just filter out these records without DAY. a severe limitation for Panama death counts.

SQL

# Create a connection to DuckDB
con <- dbConnect(duckdb())

# Read the parquet file into DuckDB
duckdb::duckdb_register_arrow(
  con,
  "harmonized__salurbal_din__v1",
  arrow::open_dataset('harmonized__salurbal_din__v1.parquet')
)

# Execute the SQL query
df2 <- dbGetQuery(con, "
  SELECT 
    iso2,
    SALID2,
    MAKE_DATE(
        CAST(YEAR AS INTEGER),
        CAST(MONTH AS INTEGER),
        CAST(DAY AS INTEGER)
    ) as DOD,
    COUNT(*) as n_deaths_sql
  FROM harmonized__salurbal_din__v1
  WHERE CAST(DAY AS INTEGER) < 90 
    AND CAST(MONTH AS INTEGER) < 90
  GROUP BY
    iso2,
    SALID2,
    MAKE_DATE(
        CAST(YEAR AS INTEGER),
        CAST(MONTH AS INTEGER),
        CAST(DAY AS INTEGER)
    )
")

Compare

dfa = df1 |> 
    left_join(df2 )


cor(dfa$n_deaths_r, dfa$n_deaths_sql)
[1] 1

Validate DBT Result

## Get DBT Result
df_dbt = 'harmonized__salurbal_din_counts_l2__v1.parquet' |> 
    open_dataset() |> 
    collect()

## MErge
dfb = df1 |> 
    left_join(df_dbt )

## Correlation
cor(dfb$n_deaths_r, dfb$n_deaths)
[1] 1