Setup
library(tidyverse)
library(arrow)
library(duckdb)
Scratch Pad for developing SQL Models
library(tidyverse)
library(arrow)
library(duckdb)
This is a DBT model scratch notebook. It is not version controlled and sits at the DBT dev folder in //files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/ccuh-server/freeze/_datawarehouse/dev/_scr.qmd
. We use this to develop our SQL data models within the datawarehouse. It is deployed as a page into the CCUH quarto-pub site and documented in the Notion Notebooks Database in case we need to reference it in the future.
So we want to start from the harmonized__salurbal_din__v1
data model and we just want to count by L2 and day.
Lets just build this out in R first.
= 'harmonized__salurbal_din__v1.parquet' |>
api ::open_dataset()
arrow
= api |>
df1 filter(as.numeric(MONTH) < 90, as.numeric(DAY) < 90) |>
mutate(DOD = as.Date(paste(YEAR, MONTH, DAY, sep = "-"))) |>
count(iso2, SALID2, DOD, name = 'n_deaths_r') |>
collect()
head(df1)
iso2 | SALID2 | DOD | n_deaths_r |
---|---|---|---|
BR | 10216315 | 2000-11-01 | 3 |
BR | 10216610 | 2000-12-21 | 7 |
BR | 10224810 | 2000-04-01 | 199 |
BR | 10225211 | 2000-12-25 | 3 |
BR | 10219211 | 2000-11-05 | 1 |
BR | 10224832 | 2000-09-11 | 3 |
There is a interesting thing where we have some month and day with value of 99
:
98
or 99
)|>
api mutate(invalid_DAY = as.integer(DAY)>31) |>
count(iso2, invalid_DAY) |>
collect() |>
arrange(desc(n))
iso2 | invalid_DAY | n |
---|---|---|
BR | FALSE | 14143215 |
GT | FALSE | 265438 |
PA | TRUE | 99445 |
PA | FALSE | 92006 |
BR | TRUE | 2 |
This is a big issue … for now we will just filter out these records without DAY. a severe limitation for Panama death counts.
# Create a connection to DuckDB
<- dbConnect(duckdb())
con
# Read the parquet file into DuckDB
::duckdb_register_arrow(
duckdb
con,"harmonized__salurbal_din__v1",
::open_dataset('harmonized__salurbal_din__v1.parquet')
arrow
)
# Execute the SQL query
<- dbGetQuery(con, "
df2 SELECT
iso2,
SALID2,
MAKE_DATE(
CAST(YEAR AS INTEGER),
CAST(MONTH AS INTEGER),
CAST(DAY AS INTEGER)
) as DOD,
COUNT(*) as n_deaths_sql
FROM harmonized__salurbal_din__v1
WHERE CAST(DAY AS INTEGER) < 90
AND CAST(MONTH AS INTEGER) < 90
GROUP BY
iso2,
SALID2,
MAKE_DATE(
CAST(YEAR AS INTEGER),
CAST(MONTH AS INTEGER),
CAST(DAY AS INTEGER)
)
")
= df1 |>
dfa left_join(df2 )
cor(dfa$n_deaths_r, dfa$n_deaths_sql)
[1] 1
## Get DBT Result
= 'harmonized__salurbal_din_counts_l2__v1.parquet' |>
df_dbt open_dataset() |>
collect()
## MErge
= df1 |>
dfb left_join(df_dbt )
## Correlation
cor(dfb$n_deaths_r, dfb$n_deaths)
[1] 1