This vignette introduces the danstat
package by means of
a simple use case to illustrate the data discovery process. The package
is designed to give easy and intuitive access to the Danmarks Statistik
(Statistics Denmark) Statistikbank API which allows access to all
published data in StatBank. As a public institution, Statistics Denmark
provides open access to the data, so the API doesn’t require
authentication and has no rate limits. The data is free to re-use and
reproduce even for commercial use, however, source accreditation to
Statistics Denmark should be given. See terms for the Use and reuse of data.
library(danstat)
library(purrr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(kableExtra)
#>
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#>
#> group_rows
In this example we would like to examine alcohol related incidents
and the consumption of alcohol in Denmark. An inital call to
get_subjects()
returns the following data frame:
get_subjects()
#> id description active hasSubjects subjects
#> 1 1 People TRUE TRUE NULL
#> 2 2 Labour and income TRUE TRUE NULL
#> 3 3 Economy TRUE TRUE NULL
#> 4 4 Social conditions TRUE TRUE NULL
#> 5 5 Education and research TRUE TRUE NULL
#> 6 6 Business TRUE TRUE NULL
#> 7 7 Transport TRUE TRUE NULL
#> 8 8 Culture and leisure TRUE TRUE NULL
#> 9 9 Environment and energy TRUE TRUE NULL
#> 10 19 Other TRUE TRUE NULL
The subjects of interest for us are “Business” and “Transport”. We look into more details into these to subjects:
subj <- get_subjects(subjects = c("6","7"))
subsubjects <- subj$subjects %>% bind_rows()
subsubjects
#> id description active hasSubjects subjects
#> 1 3456 Structure of business TRUE TRUE NULL
#> 2 3457 Business economy TRUE TRUE NULL
#> 3 3458 International enterprises TRUE TRUE NULL
#> 4 3460 Agriculture, horticulture and forestry TRUE TRUE NULL
#> 5 3461 Fishery and aquaculture TRUE TRUE NULL
#> 6 3462 Manufacturing industries TRUE TRUE NULL
#> 7 3463 Construction TRUE TRUE NULL
#> 8 3465 Trade TRUE TRUE NULL
#> 9 3466 Accommodation and travel TRUE TRUE NULL
#> 10 3450 Financial sector TRUE TRUE NULL
#> 11 3467 Service sector TRUE TRUE NULL
#> 12 3459 Business tendency surveys TRUE TRUE NULL
#> 13 3440 Means of transport TRUE TRUE NULL
#> 14 3469 Traffic and infrastructure TRUE TRUE NULL
#> 15 3115 Passenger transport TRUE TRUE NULL
#> 16 3464 Goods transport TRUE TRUE NULL
#> 17 3413 Traffic accidents TRUE TRUE NULL
Now we would like to see what data tables are available for subjects “Distributive trades” and “Traffic accidents”:
tables <- get_tables(subjects = c("3465", "3413"))
tables %>%
select(id, text, variables) %>%
kable()
id | text | variables |
---|---|---|
DETA211 | Retail Trade Index | industry (DB07), time |
DETA212 | Retail Trade Index | commodity group, index type , time |
LAG8 | Stocks of wholesale | industry (DB07), price unit , time |
LAG8S | Stocks wholesale | industry (DB07) , seasonal adjustment, time |
ALKO6 | Sales of alcohol and tobacco, subject to excises duties | type, time |
ALKO5 | Sales of alcohol and tobac | type, time |
ALKO2 | Sales of alcohol and tobacco, subject to excises duties, by pop. | type, time |
UHELD3 | Road traffic accidents | type of accident , accident situation, urban area , speed limit , time |
UHELD4 | Road traffic accidents | type of accident , means of transport, hour , day of the week , month , time |
UHELD5 | Road traffic accidents | type of accident , accident situation , means of transport , other units involved, time |
UHELD6 | Road traffic accidents | type of accident , accident situation , means of transport , number of transport units involved, time |
UHELDK7 | Road traffic accidents | type of accident , region , urban area , accident situation, time |
UHELD8 | Injured and killed in road traffic accidents | type of accident , casualty , means of transport, sex , age , type of injury , time |
UHELD9 | Injured and killed in road traffic accidents | type of accident , casualty , means of transport, type of person , urban area , type of road , time |
UHELD10 | Injured and killed in road traffic accidents | type of accident , accident situation , casualty , means of transport , other units involved, time |
UHELD11 | Injured and killed in road traffic accidents | type of accident , casualty , means of transport , position , type of person , influenced by alcohol, time |
UHELDK1 | Injured and killed in road traffic accidents | region , casualty , motor vehicles involved, age , sex , time |
UHELDK2 | Injured and killed in alcohol accidents | region , casualty, age , sex , time |
UHELD12 | Drivers and pedestrians | means of transport , sex , age , age of driving license , level of alcohol in blood, time |
MOERKE | Injured in road traffic accidents reported by the police and casualty wards | reporter , accident situation, means of transport, sex , age , type of injury , time |
MOERKE1 | Injured in road traffic accidents reported by casualty wards | reporter , accident situation , means of transport , sex , age , casualty (classified by diagnosis), time |
LIGEHI10 | Gender equality indicator of injured and killed in road traffic accidents | indicator , casualty , motor vehicles involved, age , time |
BANE91 | Fatalities and injuries in railway traffic accidents | railway system , category of person, casualty , time |
BANE92 | Fatalities and injuries in railway traffic accidents | railway system , type of accident, casualty , time |
SKIB95 | Accidents at sea involving danish vessels | type of accident, type of vessel , scope , time |
SKIB96 | Accidents at sea in Danish sea territory | type of accident , type of vessel , scope , land of registration, time |
If we would like to examine number of traffic accidents by hour of day (which is an interesting question in itself) we can choose table “UHELD4”. To extract sales and consumption of alcohol, we will use table “ALKO3”. We can now look into the variable codes and values for these 2 tables and determine if we should filter any values.
vars_acc <- get_table_metadata(table_id = "uheld4", variables_only = TRUE)
vars_alco <- get_table_metadata(table_id = "alko3", variables_only = TRUE)
vars_acc %>%
select(id, text)
#> id text
#> 1 UHELDA type of accident
#> 2 TRANSMID means of transport
#> 3 KLOK hour
#> 4 UGE day of the week
#> 5 MND month
#> 6 Tid time
vars_alco %>%
select(id, text)
#> id text
#> 1 TYPE type
#> 2 Tid time
Let us look a bit into the values for the “UHELDA” (type of accident) variable from table “UHELD4” and the “TYPE” variable from table “ALKO3” to determine what the possible values of these are.
|
|
To compare the diurnal patterns of alcohol-related accidents
vs. other accidents we will select values c(1000, 2000)
from the “UHELDA” variable and values c("09", "055")
(sales
and consumption of lager equivalents) from the “TYPE” variable. As we
would like data for all time periods (“Tid”) we will leave the values
for that variable as NA
.
We can now pull the needed data from the API - first the accident numbers:
variable_codes <- vars_acc$id[c(1, 3, 6)] # UHELDA, KLOK and Tid
variable_values <- list(c(1000, 2000), NA, NA) # all values for KLOK and Tid
# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))
# Get data
accidents <- get_data("uheld4", variables = variable_input)
#> Rows: 1350 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): UHELDA, KLOK
#> dbl (2): TID, INDHOLD
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(accidents) %>% kable()
UHELDA | KLOK | TID | INDHOLD |
---|---|---|---|
Accidents involving persons influenced by alcohol | 00-01 | 1997 | 94 |
Accidents involving persons influenced by alcohol | 00-01 | 1998 | 85 |
Accidents involving persons influenced by alcohol | 00-01 | 1999 | 81 |
Accidents involving persons influenced by alcohol | 00-01 | 2000 | 87 |
Accidents involving persons influenced by alcohol | 00-01 | 2001 | 77 |
Accidents involving persons influenced by alcohol | 00-01 | 2002 | 91 |
and then the alcohol sales and consumption values:
variable_codes <- vars_alco$id
variable_values <- list(c("055", "09"), NA) # All values for Tid
# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))
# Get data
alcohol <- get_data("alko3", variables = variable_input)
#> Rows: 132 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): TYPE, INDHOLD
#> dbl (1): TID
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
alcohol %>%
filter(INDHOLD != "..") %>% # the API returns ".." as missing values
head() %>%
kable()
TYPE | TID | INDHOLD |
---|---|---|
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2000 | 100 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2001 | 97 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2002 | 96 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2003 | 98 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2004 | 93 |
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- | 2005 | 93 |
An interesting analysis would be to see at which time of day most accidents occur:
accidents_by_hour <- accidents %>%
filter(KLOK != "Not stated") %>%
group_by(UHELDA, KLOK) %>%
summarise(mean_accidents = mean(INDHOLD, na.rm = TRUE))
#> `summarise()` has grouped output by 'UHELDA'. You can override using the
#> `.groups` argument.
accidents_by_hour %>%
ggplot(aes(x = KLOK, y = mean_accidents, color = UHELDA, group = UHELDA)) +
geom_line() +
geom_point() +
theme_bw() +
theme(legend.position="top") +
labs(x = "Time of day", y = "Average annual accidents")
A pattern is quite clear: non-alcohol related accidents peak during commute hours in the morning and the afternoon, while alcohol related accidents are highest and almost equally likely between 6pm and 3am. It is important to note here that the values on the y-axis are total annual accidents for the average year between 1997 and 2018. So a value of e.g. 50, means that of all accidents in an average year 50 occurred in a given hour (summed over all days in the year) - so these are not average hourly accidents.