Skip to contents

The idea

Considering that we have data on the frequency of benzodiazepine use for the three waves in these subjects (n = 781), we can try to create clusters referring to the pattern of use in these three moments through an unsupervised machine learning algorithm.

data("all_waves", package = "BenzoCovid")

Clustering is a technique in machine learning that attempts to find clusters of observations within a dataset. The goal is to find clusters such that the observations within each cluster are similar to each other, while observations in different clusters are different from each other. So, we are not predicting a value based on a response/target variable, but instead, we are grouping the observations.

Feature transformation

The first step in this process is the transformation of the categorical variables of the frequency of use of benzodiazepines to numerical variables. This procedure is necessary because the k-means algorithm only accepts numerical features as input data.

We also took this step to select only the variables referring to the use of benzodiazepines in the subset of data.

benzo_frequencies <- all_waves |> 
    dplyr::mutate(dplyr::across(dplyr::matches("^benzofreq_w[1-3]$"), as.numeric)) |> 
    dplyr::select(dplyr::matches("^benzofreq_w[1-3]$"))

3D scatterplot

Below is a 3D scatterplot showing the association between benzodiazepine use in the three assessment waves. In general, there is no great variation in the use frequency pattern in the sample.

plotly::plot_ly(
    benzo_frequencies,
    x = ~ benzofreq_w1,
    y = ~ benzofreq_w2,
    z = ~ benzofreq_w3,
    text = ~ paste(
        "Wave 1:",
        benzofreq_w1,
        "<br>Wave 2:",
        benzofreq_w2,
        "<br>Wave 3:",
        benzofreq_w3
    ),
    hoverinfo = "text"
) |>
    plotly::add_markers() |>
    plotly::layout(title = "Benzodiazepine Use Frequency in Each Wave",
                   scene = list(
                       xaxis = list(title = "Wave 1"),
                       yaxis = list(title = "Wave 2"),
                       zaxis = list(title = "Wave 3")
                   ))

K-means clustering

We start fitting three clustering models, from 1 to 5 clusters in order to find out the best number of clusters in our dataset.

set.seed(1)

benzo_kmeans <- 
  tibble::tibble(k = 1:5) |> 
  dplyr::mutate(
    kclust = purrr::map(k, ~kmeans(benzo_frequencies, .x)),
    tidied = purrr::map(kclust, broom::tidy),
    glanced = purrr::map(kclust, broom::glance),
    augmented = purrr::map(kclust, broom::augment, benzo_frequencies)
  )

Here, we can check which number of clusters behaves the best. We need to see the least amount of “distance” between the observations in the same cluster, and try to minimize the number of clusters used in order to keep a parsimonious model.

benzo_kmeans |> 
  tidyr::unnest(cols = c(glanced)) |> 
    ggplot2::ggplot(ggplot2::aes(x = k, y = tot.withinss)) +
    ggplot2::geom_line() +
    ggplot2::geom_point(size = 4, color = "darkslateblue") +
    ggplot2::theme_minimal(base_size = 20) +
    ggplot2::labs(title = "What's the optimal number of clusters?",
                  x = "# of clusters",
                  y = "Within sum of squares function") +
    gghighlight::gghighlight(k == 2)

As we can see, this represents the variance within the clusters. It decreases as k increases, but notice a bend (or “elbow”) around k = 2. This bend indicates that additional clusters beyond the second have little value for us.

Now, we can see the mean values of each variable for each of the clusters. I have given myself the right to name the clusters as frequent use and infrequent use of benzodiazepines.

The frequent use cluster contains 86 subjects, and the infrequent use cluster contains 695 subjects.

two_clusters <- benzo_kmeans |> 
    dplyr::pull(kclust) |> 
    purrr::pluck(2)
    
two_clusters |> 
    purrr::pluck("centers") |> 
    base::as.data.frame() |> 
    tibble::rowid_to_column("Cluster") |> 
    dplyr::mutate(
        Cluster = dplyr::if_else(
            Cluster == 1, "Infrequent use",
            "Frequent use"
        )
    ) |> 
    dplyr::rename(
        "Wave 1" = benzofreq_w1,
        "Wave 2" = benzofreq_w2,
        "Wave 3" = benzofreq_w3
        ) |> 
    knitr::kable(digits = 3, row.names = FALSE)
Cluster Wave 1 Wave 2 Wave 3
Infrequent use 1.104 1.121 1.134
Frequent use 4.465 4.372 3.779

Join original data with cluster information

clustered_data <- two_clusters |> 
    purrr::pluck("cluster") |> 
    tibble::as_tibble_col("cluster") |> 
    dplyr::mutate(cluster = dplyr::if_else(
        cluster == 1, "Infrequent", "Frequent"
    )) |> 
    dplyr::bind_cols(all_waves)

utils::head(clustered_data)
#> # A tibble: 6 × 46
#>   cluster   age sex   color educa…¹ state region sexua…² heter…³ house…⁴ conta…⁵
#>   <chr>   <dbl> <fct> <fct> <fct>   <fct> <fct>  <fct>   <fct>   <fct>   <fct>  
#> 1 Infreq…    47 Fema… Bran… Pós- g… São … Sudes… Hetero… sim     D       Não    
#> 2 Freque…    34 Fema… Bran… Ensino… Rio … Sul    Bissex… não     D       Não    
#> 3 Infreq…    57 Fema… Bran… Ensino… São … Sudes… Hetero… sim     B       Não    
#> 4 Infreq…    52 Fema… Bran… Pós- g… Rio … Sul    Hetero… sim     A       Não    
#> 5 Infreq…    32 Male  Bran… Ensino… Goiás Centr… Hetero… sim     D       Não    
#> 6 Infreq…    44 Fema… Bran… Ensino… Rio … Sul    Hetero… sim     E       Não    
#> # … with 35 more variables: risk_group_covid19 <fct>,
#> #   number_people_house <dbl>, benzofreq_w1 <fct>, benzofreq_w2 <fct>,
#> #   benzofreq_w3 <fct>, family_relationship_w1 <fct>,
#> #   family_relationship_w2 <fct>, family_relationship_w3 <fct>,
#> #   friend_relationship_w1 <fct>, friend_relationship_w2 <fct>,
#> #   friend_relationship_w3 <fct>, loving_relationship_w1 <fct>,
#> #   loving_relationship_w2 <fct>, loving_relationship_w3 <fct>, …

Is the pattern of suicidal ideation different between clusters?

BenzoCovid::grouped_count(clustered_data, "cluster", "suicidal_ideation_w1")
#> # A tibble: 4 × 3
#> # Groups:   suicidal_ideation_w1 [2]
#>   suicidal_ideation_w1 cluster        n
#>   <fct>                <chr>      <int>
#> 1 Yes                  Frequent      24
#> 2 Yes                  Infrequent    92
#> 3 No                   Frequent      62
#> 4 No                   Infrequent   603