Introduction

http://zzz.bwh.harvard.edu/plink/index.shtml

Plink1.9 : https://www.cog-genomics.org/plink/1.9/
Plink2 : https://www.cog-genomics.org/plink/2.0/

Generate a simulation file

https://zzz.bwh.harvard.edu/plink/simulate.shtml - 어떤 dataset을 생성할 것인지 정해준다.

$ emacs wgas.sim
4 nullA 0 1 1 1
4 nullA 0 1 1 1
4 disea 0 1 2 mult

or

echo -e "1000\tSNP\t0.00\t1.00\t1.00\t1.00" > wgas.sim

wgas.sim 파일의 각 필드(행) 값은 다음을 나타낸다.

1 Number of SNPs in this set
2 Label of this set of SNPs
3 Lower allele frequency range
4 Upper allele frequency range
5 Odds ratio for disease, heterozygote
6 Odds ratio for disease, homozyygote (or "mult")

데이터를 생성하는 명령어.

$ plink --simulate wgas.sim --make-bed --out test

생성된 SNP simulation 파일.

$ cat test.bim
1       nullA_0 0       1       D       d
1       nullA_1 0       2       D       d
1       nullA_2 0       3       d       D
1       nullA_3 0       4       D       d
1       nullB_0 0       5       d       D
1       nullB_1 0       6       d       D
1       nullB_2 0       7       D       d
1       nullB_3 0       8       D       d
1       disea_0 0       9       D       d
1       disea_1 0       10      d       D
1       disea_2 0       11      D       d
1       disea_3 0       12      d       D

얼만큼의 빈도를 가지고 nullA_#가 나타나는지 확인.

$ cat test-temporary.simfreq
1 nullA_0       0.370886 0.370886       1       1
1 nullA_1       0.229202 0.229202       1       1
1 nullA_2       0.743708 0.743708       1       1
1 nullA_3       0.0544006 0.0544006     1       1
1 nullB_0       0.813045 0.813045       1       1
1 nullB_1       0.639096 0.639096       1       1
1 nullB_2       0.441359 0.441359       1       1
1 nullB_3       0.192305 0.192305       1       1
1 disea_0       0.266559 0.266559       2       4
1 disea_1       0.732512 0.732512       2       4
1 disea_2       0.344471 0.344471       2       4
1 disea_3       0.800109 0.800109       2       4

생성된 데이터에 대한 정보는 다음과 같다.

$ emacs info_DS.py
from pandas_plink import read_plink1_bin

f_nm="test"

DS = read_plink1_bin(f_nm+".bed", f_nm+".bim", f_nm+".fam", verbose=False)

print(DS)
print("")
print(DS.values)
print(DS.shape)
print(len(DS))

$ python info_DS.py
<xarray.DataArray 'genotype' (sample: 2000, variant: 12)>
dask.array<shape=(2000, 12), dtype=float64, chunksize=(1024, 12)>
Coordinates:
  * sample   (sample) object 'per0' 'per1' 'per2' ... 'per1998' 'per1999'
  * variant  (variant) object '1_nullA_0' '1_nullA_1' ... '1_disea_3'
    father   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    fid      (sample) <U7 'per0' 'per1' 'per2' ... 'per1997' 'per1998' 'per1999'
    gender   (sample) <U1 '2' '2' '2' '2' '2' '2' ... '2' '2' '2' '2' '2' '2'
    iid      (sample) <U7 'per0' 'per1' 'per2' ... 'per1997' 'per1998' 'per1999'
    mother   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    trait    (sample) float64 2.0 2.0 2.0 2.0 2.0 2.0 ... 1.0 1.0 1.0 1.0 1.0
    a0       (variant) <U1 'd' 'd' 'D' 'D' 'D' 'D' 'd' 'd' 'D' 'D' 'D' 'd'
    a1       (variant) <U1 'D' 'D' 'd' 'd' 'd' 'd' 'D' 'D' 'd' 'd' 'd' 'D'
    chrom    (variant) <U1 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
    cm       (variant) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    pos      (variant) int64 1 2 3 4 5 6 7 8 9 10 11 12
    snp      (variant) <U7 'nullA_0' 'nullA_1' 'nullA_2' ... 'disea_2' 'disea_3'

[[1. 1. 2. ... 2. 2. 2.]
 [1. 2. 0. ... 2. 2. 2.]
 [1. 2. 1. ... 1. 2. 2.]
 ...
 [1. 2. 1. ... 1. 2. 2.]
 [2. 2. 1. ... 2. 2. 2.]
 [2. 1. 1. ... 2. 2. 2.]]
(2000, 12)
2000

Default 로 1000명의 case와 1000명의 control 이 생겼다. 100명씩 생성해 보자.

$ plink --simulate wgas2.sim --make-bed -out test --simulate-ncases 100 --simulate-ncontrols 100 --simulate-prevalence 0.05

Introduction

Generate a simulation file

Read .map, .ped files

Read `.map`, `.ped` files