分析技术研习室

Logo

课题组每周研讨会

View the Project on GitHub XSLiuLab/Workshop

data.table

data.table的基本框架

Screen Shot 2020-06-17 at 07.43.34

​ 图片引自:https://rstudio.com/

创建data.table
对行 i 进行操作
对列 j 进行操作
用by进行分组
data.table的常用函数
> unique(dt, by = c("name"))
            d e c   name money number
1:  0.1127583 0 5 banana     4      1
2:  0.7079005 0 5  apple     1      1
3: -0.1899854 0 7 orange     5      3
> uniqueN(dt, by = c("name"))
[1] 3
> haskey(dt)
[1] TRUE
> key(dt)
[1] "number" "name"

可以使用索引简化计算

举例1:计算name为apple所在行的number值总和

> setkey(dt, name)
> dt["apple", sum(number)]
[1] 4
> dt
            d  e c   name money number
1:  0.7079005  0 5  apple     1      1
2: -2.6720810 -2 7  apple     2      3
3:  0.1127583  0 5 banana     4      1
4:  2.3955292  2 9 banana     3      6
5: -0.1899854  0 7 orange     5      3
6:  1.5170863  1 9 orange     6      6

举例2:按照name分组计算number之和(没有索引也可以做)

使用索引

> setkey(dt, name)
> dt[c("apple","banana","orange"), sum(number), by = .EACHI]
     name V1
1:  apple  4
2: banana  7
3: orange  9
> dt[c("apple","banana","orange"), sum(number)]
[1] 20

不使用索引

> dt[, sum(number), by =name]
     name V1
1:  apple  4
2: banana  7
3: orange  9
组合data.table
> dt_a <- data.table(a = 1:3, 
+                    b = c("c","a","b"))
> dt_a
   a b
1: 1 c
2: 2 a
3: 3 b
> dt_b <- data.table(x = rev(1:3), 
+                    y = c("b","c","b"))
> dt_b
   x y
1: 3 b
2: 2 c
3: 1 b
> dt_a[dt_b, on = .(b = y)]
   a b x
1: 3 b 3
2: 1 c 2
3: 3 b 1

条件选择组合

Screen Shot 2020-06-17 at 13.21.56

> dt_a[dt_b, on = .(b = y)]
   a b c x z
1: 3 b 6 3 4
2: 1 c 7 2 5
3: 2 a 5 1 8
> dt_a[dt_b, on = .(b = y, c > z)]
    a b c x
1:  3 b 4 3
2:  1 c 5 2
3: NA a 8 1

Screen Shot 2020-06-17 at 13.27.44

读取或写出文件
foverlaps()
foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y),
    by.y = key(y), maxgap = 0L, minoverlap = 1L,
    type = c("any", "within", "start", "end", "equal"),
    mult = c("all", "first", "last"),
    nomatch = getOption("datatable.nomatch", NA),
    which = FALSE, verbose = getOption("datatable.verbose"))

看两个数据框区域是否存在overlap,使用y作为索引去x中寻找有overlap的情况

> x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
+                start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
> x
    chr start end
1: Chr1     5  11
2: Chr1    10  20
3: Chr2     1   4
4: Chr2    25  52
5: Chr2    50  60
> y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
+                end=c(4, 18, 55), geneid=letters[1:3])
> y
    chr start end geneid
1: Chr1     1   4      a
2: Chr1    15  18      b
3: Chr2     1  55      c
> setkey(y, chr, start, end)
> foverlaps(x, y, type="any")
    chr start end geneid i.start i.end
1: Chr1    NA  NA   <NA>       5    11
2: Chr1    15  18      b      10    20
3: Chr2     1  55      c       1     4
4: Chr2     1  55      c      25    52
5: Chr2     1  55      c      50    60
  1. type

type = "within" 只匹配y的区域完全包含在x的区域内的情况(相等也属于within)

type = "any" 匹配y和x有重叠的区域

type = "start" 匹配start一样的情况

type = "end"匹配end一样的情况

  1. 其他

nomatch = NULL 返回匹配得上的部分

setkey() 设置匹配索引

参数which = TRUE 是只返回两个数据框匹配情况的行号

参数mult = "first" 是返回x中第一次匹配上的行

foverlaps(x, y, type="any", mult="first")

⚠️:如果x和y索引的列名称不同时,在foverlaps()内加上一行参数

by.x =c("", "", "") 对应y中列的名称

数据的拆分和合并
> reshape_dt <- data.table(kinds = c(rep("peach", 2), rep("grape", each = 2)), 
                                price = c("3","8","4","6"),
                         price2 = c("4","9","5","7"),
                                level = c("h","l","h","l"))
> reshape_dt
   kinds price price2 level
1: peach     3      4     h
2: peach     8      9     l
3: grape     4      5     h
4: grape     6      7     l
> reshape_dt_new <- melt(reshape_dt, id.vars = c("kinds", "level"), 
     measure.vars = c("price", "price2"),
                      variable.name = "2price",
                      value.name = "money")
> reshape_dt_new
   kinds level 2price money
1: peach     h  price     3
2: peach     l  price     8
3: grape     h  price     4
4: grape     l  price     6
5: peach     h price2     4
6: peach     l price2     9
7: grape     h price2     5
8: grape     l price2     7
> dcast(reshape_dt_new, kinds + level ~ `2price`, value.var = "money")
   kinds level price price2
1: grape     h     4      5
2: grape     l     6      7
3: peach     h     3      4
4: peach     l     8      9