课题组每周研讨会
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.
forcats 这个包是用来处理因子的,是tidyverse包的核心,提供了处理分类变量的工具。
先创建一个变量包含月份信息
> x1 <- c("Dec", "Apr", "Jan", "Mar")
> x2 <- c("Dec", "Apr", "Jam", "Mar")
> sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"
我们可以看到X2当你拼写错误时,并没有什么提醒,并且sort后的结果并没有按照月份的顺序来,那么如果我们想要按照月份的顺序该怎么做?我们可以用因子解决这些问题。首先需要我们创建月份的因子水平列表。
> month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
> y1 <- factor(x1, levels = month_levels)
> y1
[1] Dec Apr Jan Mar
12 Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct ... Dec
> sort(y1)
[1] Jan Mar Apr Dec
12 Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct ... Dec
> y2 <- factor(x2, levels = month_levels)
> y2
[1] Dec Apr <NA> Mar
12 Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct ... Dec
创建因子后可以看到排序是按照月份进行的,并且x2里拼写错误被NA替代。
如果不设置levels,直接用factors对向量进行处理,那么会按照字母表的顺序对因子进行排序。
> factor(x1)
[1] Dec Apr Jan Mar
Levels: Apr Dec Jan Mar
如果想要levels的顺序与数据出现的顺序一致,那么可以在levels里面设置unique(x)或者用fct_inorder
> f1 <- factor(x1, levels = unique(x1))
> f1
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
> f2 <- x1 %>% factor() %>% fct_inorder()
> f2
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
下面详细介绍一下这个包的用途:
The goal of the forcats package is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values. Some examples include:
fct_reorder()
: Reordering a factor by another variable.fct_infreq()
: Reordering a factor by the frequency of values.fct_relevel()
: Changing the order of a factor by hand.fct_lump()
: Collapsing the least/most frequent values of a factor into “other”.### installation
# The easiest way to get forcats is to install the whole tidyverse:
install.packages("tidyverse")
library("tidyverse")
# Alternatively, install just forcats:
install.packages("forcats")
library("forcats")
> starwars %>%
+ filter(!is.na(species)) %>%
+ count(species, sort = TRUE)
# A tibble: 37 x 2
species n
<chr> <int>
1 Human 35
2 Droid 6
3 Gungan 3
4 Kaminoan 2
5 Mirialan 2
6 Twi'lek 2
7 Wookiee 2
8 Zabrak 2
9 Aleena 1
10 Besalisk 1
# ... with 27 more rows
> starwars %>%
+ filter(!is.na(species)) %>%
+ mutate(species = fct_lump(species, n = 3)) %>%
+ count(species)
# A tibble: 4 x 2
species n
<fct> <int>
1 Droid 6
2 Gungan 3
3 Human 35
4 Other 39
我们画图的时候经常会出现不按照值的大小进行排序的情况,会使得画出来的图很丑,那么就需要我们按照值的大小进行排序。
> ggplot(starwars, aes(x = eye_color)) +
+ geom_bar() +
+ coord_flip()
> starwars %>%
mutate(eye_color = fct_infreq(eye_color)) %>%
ggplot(aes(x = eye_color)) +
geom_bar() +
coord_flip()
fct_reorder:通过其他的值对因子顺序进行修改
> relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
#> `summarise()` ungrouping output (override with `.groups` argument)
> ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig
using fct_reorder()
. fct_reorder()
takes three arguments:
f
, the factor whose levels you want to modify.x
, a numeric vector that you want to use to reorder the levels.fun
, a function that’s used if there are multiple values of x
for each value of f
. The default value is median
.ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()
> gss_cat %>% count(partyid)
# A tibble: 10 x 2
partyid n
<fct> <int>
1 No answer 154
2 Don't know 1
3 Other party 393
4 Strong republican 2314
5 Not str republican 3032
6 Ind,near rep 1791
7 Independent 4119
8 Ind,near dem 2499
9 Not str democrat 3690
10 Strong democrat 3490
> gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
# A tibble: 10 x 2
partyid n
<fct> <int>
1 No answer 154
2 Don't know 1
3 Other party 393
4 Republican, strong 2314
5 Republican, weak 3032
6 Independent, near rep 1791
7 Independent 4119
8 Independent, near dem 2499
9 Democrat, weak 3690
10 Democrat, strong 3490
还可以将多个原来的水平修改为一个新的水平。
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
> gss_cat %>%
+ mutate(partyid = fct_collapse(partyid,
+ other = c("No answer", "Don't know", "Other party"),
+ rep = c("Strong republican", "Not str republican"),
+ ind = c("Ind,near rep", "Independent", "Ind,near dem"),
+ dem = c("Not str democrat", "Strong democrat")
+ )) %>%
+ count(partyid)
# A tibble: 4 x 2
partyid n
<fct> <int>
1 other 548
2 rep 5346
3 ind 8409
4 dem 7180
> f <- factor(c("a", "b", "c", "d"), levels = c("b", "c", "d", "a"))
> fct_relevel(f)
[1] a b c d
Levels: b c d a
> fct_relevel(f, "a")###a作为第一个
[1] a b c d
Levels: a b c d
> fct_relevel(f, "b", "a")### b,a作为前两个,其他按照原来的顺序
[1] a b c d
Levels: b a c d
> # Move to the third position
> fct_relevel(f, "a", after = 2)
[1] a b c d
Levels: b c a d
> #Relevel to the end
> fct_relevel(f, "a", after = Inf)
[1] a b c d
Levels: b c d a
> fct_relevel(f, "a", after = 3)
[1] a b c d
Levels: b c d a
> # Relevel with a function
> fct_relevel(f, sort)
[1] a b c d
Levels: a b c d
> fct_relevel(f, sample)
[1] a b c d
Levels: c d b a
> df <- forcats::gss_cat[, c("rincome", "denom")]
> View(df)
> lapply(df, levels)
$rincome
[1] "No answer" "Don't know" "Refused" "$25000 or more"
[5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
[9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
[13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
$denom
[1] "No answer" "Don't know" "No denomination"
[4] "Other" "Episcopal" "Presbyterian-dk wh"
[7] "Presbyterian, merged" "Other presbyterian" "United pres ch in us"
[10] "Presbyterian c in us" "Lutheran-dk which" "Evangelical luth"
[13] "Other lutheran" "Wi evan luth synod" "Lutheran-mo synod"
[16] "Luth ch in america" "Am lutheran"
> df2 <- lapply(df, fct_relevel, "Don't know", after = Inf)
> lapply(df2, levels)
$rincome
[1] "No answer" "Refused" "$25000 or more" "$20000 - 24999"
[5] "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" "$7000 to 7999"
[9] "$6000 to 6999" "$5000 to 5999" "$4000 to 4999" "$3000 to 3999"
[13] "$1000 to 2999" "Lt $1000" "Not applicable" "Don't know"
$denom
[1] "No answer" "No denomination" "Other"
[4] "Episcopal" "Presbyterian-dk wh" "Presbyterian, merged"
[7] "Other presbyterian" "United pres ch in us" "Presbyterian c in us"
...............
[28] "Am baptist asso" "Not applicable" "Don't know"
> gss_cat$partyid %>% fct_count()
# A tibble: 10 x 2
f n
<fct> <int>
1 No answer 154
2 Don't know 1
3 Other party 393
4 Strong republican 2314
5 Not str republican 3032
6 Ind,near rep 1791
7 Independent 4119
8 Ind,near dem 2499
9 Not str democrat 3690
10 Strong democrat 3490
> gss_cat$partyid %>% fct_relabel(~ gsub(",", "...", .x)) %>% fct_count()
# A tibble: 10 x 2
f n
<fct> <int>
1 No answer 154
2 Don't know 1
3 Other party 393
4 Strong republican 2314
5 Not str republican 3032
6 Ind...near rep 1791
7 Independent 4119
8 Ind...near dem 2499
> gss_cat$relig %>% fct_count()
# A tibble: 16 x 2
f n
<fct> <int>
1 No answer 93
2 Don't know 15
3 Inter-nondenominational 109
4 Native american 23
.....
> gss_cat$relig %>% fct_anon() %>% fct_count()
# A tibble: 16 x 2
f n
<fct> <int>
1 01 71
2 02 93
3 03 109
4 04 5124
5 05 224
6 06 689
.....
> gss_cat$relig %>% fct_anon("X") %>% fct_count()
# A tibble: 16 x 2
f n
<fct> <int>
1 X01 104
2 X02 5124
3 X03 147
4 X04 689
5 X05 0
6 X06 93
7 X07 3523
......
> f <- factor(sample(letters[1:3], 20, replace = TRUE))
> f
[1] c c c b c a c c c c b c b b a a b a a a
Levels: a b c
> fct_expand(f, "d", "e", "f")
[1] c c c b c a c c c c b c b b a a b a a a
Levels: a b c d e f
> fct_expand(f, letters[1:6])
[1] c c c b c a c c c c b c b b a a b a a a
Levels: a b c d e f
> f <- factor(c("a", "b"), levels = c("a", "b", "c"))
> f
[1] a b
Levels: a b c
> fct_drop(f)
[1] a b
Levels: a b