About VLA summary statistics

This document describe the results of variance locus analysis (VLA) on the phenotypes of UK-Biobank (UKBB).

Since VLA was done for 4 races, we separate the result in 4 directories:

Currently, we run VLA on 3,273 phenotypes, including 1,580 health related UKBB orignial, and 1,693 PHECODE derived from hospitial diagnosis records in ICD. We save the results in 3,237 sub-directories in each of the 4 race directories, named after the field ID of the phenotype (see the second column in fld.tsv).

Depending on the number of variable in a phenotype, one or more compressed TSV (*.tsv.gz) would appear inside the phenotype sub-directory. In most cases, a phenotype is coded by one variable, so there is only one TSV. For multinomial and array phenotypes, multiple TSV would appear, that is, one per each nominal variable for a multinomial phenotype, or one per each element variable in an array phenotype.

Each TSV summerizes the statistics for 11,841,786 (11 million) genome variants, one variant per row, and it has the following 5 columns:

The maf is distinct for each of the 4 races, but identical accross phenotype variables. The working sample size (ssz) is distinct for each variant, phenotype variable, and race.

The p-values (gwa, vla, and vqa) can be NaN if the working sample size (ssz) is too small, whcih is more common for non-British groups.

The complete summary statistics is meant to be prepended with the static genome meta-data, located in

where {p} denote one of the 4 races. See "about_static_prepend" for the detail.

Author: Xiaoran Tong

Created: 2021-10-06 Wed 11:59

Validate