計上されていないレコード部分も含めてRで集計したいときの覚書

【用途】

集計対象のデータレコードが0だった場合, 0と記録されていないことが多々ある. 0だった場合も含めて集計して平均を計算したいなんてことがある. そんなときのために, Rで前処理して0のレコードを作ってしまう方法の一つのメモ.
別のもっと楽な方法があるかもしれないので, もし知っている方がいれば教えてほしい. (公開後すぐに教えていただいた技を最後に追記しました)

【内容】

データのイメージは, ユーザーがいて, 衣食住の費用が記録されているというもの.
まず初めはレコードが0と記録されているケース. ライブラリ読み込みと模擬データセットの作成.

# Rやtidyverse周りのアップデートをまだサボっている
library(tidyverse)

df_clothes <- data.frame(
  breakdown = "clothes",
  user = c("A", "B", "C", "D"),
  expenditure = c(14, 3, 7, 0)
)

df_food <- data.frame(
  breakdown = "food",
  user = c("A", "B", "C", "D"),
  expenditure = c(2, 8, 6, 12)
)

df_rent <- data.frame(
  breakdown = "rent",
  user = c("A", "B", "C", "D"),
  expenditure = c(7, 12, 9, 8)
)

df_expenditure <- rbind.data.frame(df_clothes,
                                   df_food,
                                   df_rent)
df_expenditure

##    breakdown user expenditure
## 1    clothes    A          14
## 2    clothes    B           3
## 3    clothes    C           7
## 4    clothes    D           0
## 5       food    A           2
## 6       food    B           8
## 7       food    C           6
## 8       food    D          12
## 9       rent    A           7
## 10      rent    B          12
## 11      rent    C           9
## 12      rent    D           8

本当にやりたい集計はこのような形になる. 費用項目ごとに平均を出す.

df_expenditure %>% group_by(breakdown) %>% 
  summarise(N = n(),
            ave_expenditure = mean(expenditure))

## # A tibble: 3 x 3
##   breakdown     N ave_expenditure
##   <fct>     <int>           <dbl>
## 1 clothes       4               6
## 2 food          4               7
## 3 rent          4               9

しかし, レコードに0が記録されていないデータセットだった場合を考えたいので, そのようなデータセットを作成. この場合に上と同じ集計をすると違う結果が出てしまう.
つまり, clothesの集計が3人分で行われている. お金を使っている人の平均を出していることになる.

df_expenditure_nonzero <- df_expenditure %>% filter(expenditure!=0)

df_expenditure_nonzero %>% group_by(breakdown) %>% 
  summarise(N = n(),
            ave_expenditure = mean(expenditure))

## # A tibble: 3 x 3
##   breakdown     N ave_expenditure
##   <fct>     <int>           <dbl>
## 1 clothes       3               8
## 2 food          4               7
## 3 rent          4               9

この問題を回避するため, 一度wide型に変換してNAを作成して, long型に戻すことで計上されていない部分を確保する.

df_expenditure_nonzero %>% spread(key = breakdown, value = expenditure) %>% 
  gather(key = breakdown, value = expenditure, -user)

##    user breakdown expenditure
## 1     A   clothes          14
## 2     B   clothes           3
## 3     C   clothes           7
## 4     D   clothes          NA
## 5     A      food           2
## 6     B      food           8
## 7     C      food           6
## 8     D      food          12
## 9     A      rent           7
## 10    B      rent          12
## 11    C      rent           9
## 12    D      rent           8

# wideとlongの変換をgather&spreadではなくpivotで書く(練習とメモ目的)
#df_expenditure_nonzero %>% pivot_wider(names_from = breakdown, values_from = expenditure) %>% 
#  pivot_longer(-user, names_to = "breakdown", values_to = "expenditure")

long&wideの再変換を通して集計すると, 欲しかった結果が得られる.

# NA入りだと平均が計算されない
#df_expenditure_nonzero %>% spread(key = breakdown, value = expenditure) %>% 
#  gather(key = breakdown, value = expenditure, clothes,food,rent) %>% 
#  group_by(breakdown) %>% 
#  summarise(N = n(),
#            ave_expenditure = mean(expenditure))

# NAを0に置換して集計
df_expenditure_nonzero %>% spread(key = breakdown, value = expenditure) %>% 
  gather(key = breakdown, value = expenditure, clothes,food,rent) %>% 
  group_by(breakdown) %>% 
  replace_na(list(expenditure=0)) %>% 
  summarise(N = n(),
            ave_expenditure = mean(expenditure))

## # A tibble: 3 x 3
##   breakdown     N ave_expenditure
##   <chr>     <int>           <dbl>
## 1 clothes       4               6
## 2 food          4               7
## 3 rent          4               9

wideとlongの変換を挟まずにもっと楽にできないかと思っているので, 分かったらここに追記する予定.

【追記】

Twitterで教えていただきました.

tidyr::complete(fill = list(var =0))で組み合わせを作ればいけまっせ https://t.co/8YZE3JIBZi pic.twitter.com/TR2TdVGCyx
— Uryu Shinya (@u_ribo) June 4, 2020

df_expenditure_nonzero %>%
  complete(user, breakdown,
           fill = list(expenditure=0)) %>% 
  group_by(breakdown) %>% 
  summarise(N = n(),
            ave_expenditure = mean(expenditure))

## # A tibble: 3 x 3
##   breakdown     N ave_expenditure
##   <fct>     <int>           <dbl>
## 1 clothes       4               6
## 2 food          4               7
## 3 rent          4               9

# summariseの引数.groupsを入れたケースのメモ
#df_expenditure_nonzero %>%
#  complete(user, breakdown,
#           fill = list(expenditure=0)) %>% 
#  group_by(breakdown) %>% 
#  summarise(N = n(),
#            ave_expenditure = mean(expenditure),
#            .groups = "drop") # .groupsは{dplyr}v1.0.0から使える余計なungroupを不要にできるぽい

無駄なlongとwide変換もNA置換もなくなりました. ありがとうございました!!

【記憶の検索キーワード】

0がない, 集計, 平均, NA置換, long型, wide型, tidyr, complete

霞と側杖を食らう

ほしいものです。なにかいただけるとしあわせです。[https://www.amazon.jp/hz/wishlist/ls/2EIEFTV4IKSIJ?ref_=wl_share]

計上されていないレコード部分も含めてRで集計したいときの覚書

【用途】

【内容】

【追記】

【記憶の検索キーワード】