Description of genome variant meta-data
This document describe the meta-data of 11,841,785 (11 million) genome variants in the UK Bio-Bank (UKBB) we analyzed.
Because the meta-data are different for each of the 4 races we analyzed, we put them in 4 different TSV named "staticprependp.tsv", where {p} denotes one of the 4 races,
- afr: African Blacks,
- asi: Asians,
- bri: British Whites,
- iri: Irish.
All four TSV has exactly 11,841,786 rows (a header included), and 9 columns,
- c19: chromosome number by Hg19/GRCh37, provided by UKBB;
- p19: basepair position by Hg19/GRCh37, provided by UKBB;
- c38: chromosome number by Hg19/GRCh38, uplifted from c19 via NCBI;
- p38: basepair position by Hg38/GRCh38, uplifted from p19 via NCBI.
- ref: reference allele,
- alt: alternative allele.
- rsn: the RS number according to dbSNP.
- nfo: UKBB imputation information in [0, 1], higher means more reliable.
- hwe: Hardy-wind
The coordinates, alleles, and imputation information are identical for all 4 races, but the HWE p-values are distinct for each, this is why we require 4 different genome variant meta-data.
Besides "ATCG", the reference and alternative allele (ref/alt) also use "I" and "D" to denote insersion and deletion (i.e., indels).
The UK-Biobank genotype was aligned to Hg19/GRCh37 coordinates, so column c19 and p19 are complete. The Hg38/GRCh38 coordinates however, were uplifted from Hg19/GRCh37 via NCBI's online tool, and 9884 out of 11 million variants failed the uplifting. The 9884 variants that failed to uplift were given chromosome 0 and position 0 for column c38 and p38, _respectively.