Introduction to danstat

Introduction

This vignette introduces the danstat package by means of a simple use case to illustrate the data discovery process. The package is designed to give easy and intuitive access to the Danmarks Statistik (Statistics Denmark) Statistikbank API which allows access to all published data in StatBank. As a public institution, Statistics Denmark provides open access to the data, so the API doesn’t require authentication and has no rate limits. The data is free to re-use and reproduce even for commercial use, however, source accreditation to Statistics Denmark should be given. See terms for the Use and reuse of data.

library(danstat)
library(purrr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(kableExtra)
#> 
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#> 
#>     group_rows

Exploring a subject

In this example we would like to examine alcohol related incidents and the consumption of alcohol in Denmark. An inital call to get_subjects() returns the following data frame:

get_subjects()
#>    id            description active hasSubjects subjects
#> 1   1                 People   TRUE        TRUE     NULL
#> 2   2      Labour and income   TRUE        TRUE     NULL
#> 3   3                Economy   TRUE        TRUE     NULL
#> 4   4      Social conditions   TRUE        TRUE     NULL
#> 5   5 Education and research   TRUE        TRUE     NULL
#> 6   6               Business   TRUE        TRUE     NULL
#> 7   7              Transport   TRUE        TRUE     NULL
#> 8   8    Culture and leisure   TRUE        TRUE     NULL
#> 9   9 Environment and energy   TRUE        TRUE     NULL
#> 10 19                  Other   TRUE        TRUE     NULL

The subjects of interest for us are “Business” and “Transport”. We look into more details into these to subjects:

subj <- get_subjects(subjects = c("6","7"))
subsubjects <- subj$subjects %>% bind_rows()
subsubjects
#>      id                            description active hasSubjects subjects
#> 1  3456                  Structure of business   TRUE        TRUE     NULL
#> 2  3457                       Business economy   TRUE        TRUE     NULL
#> 3  3458              International enterprises   TRUE        TRUE     NULL
#> 4  3460 Agriculture, horticulture and forestry   TRUE        TRUE     NULL
#> 5  3461                Fishery and aquaculture   TRUE        TRUE     NULL
#> 6  3462               Manufacturing industries   TRUE        TRUE     NULL
#> 7  3463                           Construction   TRUE        TRUE     NULL
#> 8  3465                                  Trade   TRUE        TRUE     NULL
#> 9  3466               Accommodation and travel   TRUE        TRUE     NULL
#> 10 3450                       Financial sector   TRUE        TRUE     NULL
#> 11 3467                         Service sector   TRUE        TRUE     NULL
#> 12 3459              Business tendency surveys   TRUE        TRUE     NULL
#> 13 3440                     Means of transport   TRUE        TRUE     NULL
#> 14 3469             Traffic and infrastructure   TRUE        TRUE     NULL
#> 15 3115                    Passenger transport   TRUE        TRUE     NULL
#> 16 3464                        Goods transport   TRUE        TRUE     NULL
#> 17 3413                      Traffic accidents   TRUE        TRUE     NULL

Getting table information and variables

Now we would like to see what data tables are available for subjects “Distributive trades” and “Traffic accidents”:

tables <- get_tables(subjects = c("3465", "3413")) 
tables %>% 
  select(id, text, variables) %>% 
  kable()

id	text	variables
DETA211	Retail Trade Index	industry (DB07), time
DETA212	Retail Trade Index	commodity group, index type , time
LAG8	Stocks of wholesale	industry (DB07), price unit , time
LAG8S	Stocks wholesale	industry (DB07) , seasonal adjustment, time
ALKO6	Sales of alcohol and tobacco, subject to excises duties	type, time
ALKO5	Sales of alcohol and tobac	type, time
ALKO2	Sales of alcohol and tobacco, subject to excises duties, by pop.	type, time
UHELD3	Road traffic accidents	type of accident , accident situation, urban area , speed limit , time
UHELD4	Road traffic accidents	type of accident , means of transport, hour , day of the week , month , time
UHELD5	Road traffic accidents	type of accident , accident situation , means of transport , other units involved, time
UHELD6	Road traffic accidents	type of accident , accident situation , means of transport , number of transport units involved, time
UHELDK7	Road traffic accidents	type of accident , region , urban area , accident situation, time
UHELD8	Injured and killed in road traffic accidents	type of accident , casualty , means of transport, sex , age , type of injury , time
UHELD9	Injured and killed in road traffic accidents	type of accident , casualty , means of transport, type of person , urban area , type of road , time
UHELD10	Injured and killed in road traffic accidents	type of accident , accident situation , casualty , means of transport , other units involved, time
UHELD11	Injured and killed in road traffic accidents	type of accident , casualty , means of transport , position , type of person , influenced by alcohol, time
UHELDK1	Injured and killed in road traffic accidents	region , casualty , motor vehicles involved, age , sex , time
UHELDK2	Injured and killed in alcohol accidents	region , casualty, age , sex , time
UHELD12	Drivers and pedestrians	means of transport , sex , age , age of driving license , level of alcohol in blood, time
MOERKE	Injured in road traffic accidents reported by the police and casualty wards	reporter , accident situation, means of transport, sex , age , type of injury , time
MOERKE1	Injured in road traffic accidents reported by casualty wards	reporter , accident situation , means of transport , sex , age , casualty (classified by diagnosis), time
LIGEHI10	Gender equality indicator of injured and killed in road traffic accidents	indicator , casualty , motor vehicles involved, age , time
BANE91	Fatalities and injuries in railway traffic accidents	railway system , category of person, casualty , time
BANE92	Fatalities and injuries in railway traffic accidents	railway system , type of accident, casualty , time
SKIB95	Accidents at sea involving danish vessels	type of accident, type of vessel , scope , time
SKIB96	Accidents at sea in Danish sea territory	type of accident , type of vessel , scope , land of registration, time

If we would like to examine number of traffic accidents by hour of day (which is an interesting question in itself) we can choose table “UHELD4”. To extract sales and consumption of alcohol, we will use table “ALKO3”. We can now look into the variable codes and values for these 2 tables and determine if we should filter any values.

vars_acc <- get_table_metadata(table_id = "uheld4", variables_only = TRUE)
vars_alco <- get_table_metadata(table_id = "alko3", variables_only = TRUE)

vars_acc %>% 
  select(id, text)
#>         id               text
#> 1   UHELDA   type of accident
#> 2 TRANSMID means of transport
#> 3     KLOK               hour
#> 4      UGE    day of the week
#> 5      MND              month
#> 6      Tid               time

vars_alco %>% 
  select(id, text)
#>     id text
#> 1 TYPE type
#> 2  Tid time

Let us look a bit into the values for the “UHELDA” (type of accident) variable from table “UHELD4” and the “TYPE” variable from table “ALKO3” to determine what the possible values of these are.

vars_acc$values[1] %>% 
  kable()

id	text
0000	All accidents
1000	Accidents involving persons influenced by alcohol
2000	Others accidents


vars_alco$values[1] %>% 
  kable()

id	text
09	Sales of lager equivalents, total excl. tax free beer (index 2000=100) (2000-)
010	Sales of wine, total (indeks 2000=100)
011	Sales of spirits, total (indeks 2000=100)
0111	Sales of alcopops, total (indeks 2006=100)(2006-)
055	Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-
060	Consumption of wine, total (index 2000=100) (2000-)
065	Consumption of spirits, total (index 2000=100) (2000-)
070	Consumption of alcopops, total (index 2006=100) (2006-)

To compare the diurnal patterns of alcohol-related accidents vs. other accidents we will select values c(1000, 2000) from the “UHELDA” variable and values c("09", "055") (sales and consumption of lager equivalents) from the “TYPE” variable. As we would like data for all time periods (“Tid”) we will leave the values for that variable as NA.

Pulling data from the API

We can now pull the needed data from the API - first the accident numbers:

variable_codes <- vars_acc$id[c(1, 3, 6)] # UHELDA, KLOK and Tid
variable_values <- list(c(1000, 2000), NA, NA) # all values for KLOK and Tid

# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))

# Get data 
accidents <- get_data("uheld4", variables = variable_input)
#> Rows: 1350 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): UHELDA, KLOK
#> dbl (2): TID, INDHOLD
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(accidents) %>% kable()

UHELDA	KLOK	TID	INDHOLD
Accidents involving persons influenced by alcohol	00-01	1997	94
Accidents involving persons influenced by alcohol	00-01	1998	85
Accidents involving persons influenced by alcohol	00-01	1999	81
Accidents involving persons influenced by alcohol	00-01	2000	87
Accidents involving persons influenced by alcohol	00-01	2001	77
Accidents involving persons influenced by alcohol	00-01	2002	91

and then the alcohol sales and consumption values:

variable_codes <- vars_alco$id 
variable_values <- list(c("055", "09"), NA) # All values for Tid

# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))

# Get data 
alcohol <- get_data("alko3", variables = variable_input)
#> Rows: 132 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): TYPE, INDHOLD
#> dbl (1): TID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
alcohol %>% 
  filter(INDHOLD != "..") %>% # the API returns ".." as missing values
  head() %>% 
  kable()

TYPE	TID	INDHOLD
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2000	100
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2001	97
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2002	96
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2003	98
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2004	93
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-	2005	93

Analysis 1: Duirnal patterns of traffic accidents

An interesting analysis would be to see at which time of day most accidents occur:

accidents_by_hour <- accidents %>%
    filter(KLOK != "Not stated") %>%
    group_by(UHELDA, KLOK) %>%
    summarise(mean_accidents = mean(INDHOLD, na.rm = TRUE))
#> `summarise()` has grouped output by 'UHELDA'. You can override using the
#> `.groups` argument.

accidents_by_hour %>%
    ggplot(aes(x = KLOK, y = mean_accidents, color = UHELDA, group = UHELDA)) +
    geom_line() +
    geom_point() +
    theme_bw() + 
  theme(legend.position="top") +
  labs(x = "Time of day", y = "Average annual accidents")

A pattern is quite clear: non-alcohol related accidents peak during commute hours in the morning and the afternoon, while alcohol related accidents are highest and almost equally likely between 6pm and 3am. It is important to note here that the values on the y-axis are total annual accidents for the average year between 1997 and 2018. So a value of e.g. 50, means that of all accidents in an average year 50 occurred in a given hour (summed over all days in the year) - so these are not average hourly accidents.

Analysis 2: Alcohol sales and alcohol-related accidents

First, we summarize the accidents dataset and check the length of the series in both datasets

accidents_by_year <- accidents %>% 
  group_by(UHELDA, TID) %>% 
  summarize(INDHOLD = sum(INDHOLD)) %>% 
  ungroup()
#> `summarise()` has grouped output by 'UHELDA'. You can override using the
#> `.groups` argument.

accidents_by_year %>%
  group_by(UHELDA) %>% 
  summarise(min(TID), max(TID))
#> # A tibble: 2 × 3
#>   UHELDA                                            `min(TID)` `max(TID)`
#>   <chr>                                                  <dbl>      <dbl>
#> 1 Accidents involving persons influenced by alcohol       1997       2023
#> 2 Others accidents                                        1997       2023

alcohol_by_year <- alcohol %>% 
  filter(INDHOLD != "..") %>% # the API returns ".." as missing values
  mutate(INDHOLD = as.numeric(INDHOLD))

alcohol_by_year %>%
  group_by(TYPE) %>% 
  summarise(min(TID), max(TID))
#> # A tibble: 2 × 3
#>   TYPE                                                     `min(TID)` `max(TID)`
#>   <chr>                                                         <dbl>      <dbl>
#> 1 Consumption of lager equivalents, total, excl. tax free…       2000       2010
#> 2 Sales of lager equivalents, total excl. tax free beer (…       1955       2020

Since alcohol consumption is only available up to 2010, we will use alcohol sales as a proxy. Also, as the alcohol consumption is indexed with year 2000 as index 100, we transoform the accidents values in the same way.


alcohol_data <- alcohol_by_year %>% 
  filter(between(TID, 1997, 2021),
         grepl("sales", TYPE, ignore.case = TRUE)) %>% 
  select(year = TID,
         alcohol_sales = INDHOLD)

accidents_data <- accidents_by_year %>% 
  filter(grepl("alcohol", UHELDA, ignore.case = TRUE)) %>% 
  select(year = TID,
         alcohol_accidents = INDHOLD)

accidents_2000 <- accidents_data %>% 
  filter(year == 2000) %>% 
  pull(alcohol_accidents)
  
accidents_data$alcohol_accidents <- round((accidents_data$alcohol_accidents/accidents_2000)*100)

data <- inner_join(alcohol_data, accidents_data, by = "year") %>% 
  mutate(risk_index = round(alcohol_accidents/alcohol_sales*100))

data %>% 
  ggplot(aes(x=year, y=risk_index)) +
  geom_col()

As this vignette does not by any means aspire to the standards of rigourous academic research, we can only say here that there seems to be an indication that the risk of alcohol-related accidents has dropped substantially since 2000. Sales of alcohol might be a poor proxy for consumption, patterns of cross-border purchase of alcohol may have changed over the past 20 years, cars have definitely become safer over time, etc. Disentangling these effects is well beyond the scope of this article.