About phenotype meta-data
Table of Contents
This document describes the phenotype meta-data file, fld.tsv.
Currently 3,273 phenotypes went through the Variance Locus Analysis (VLA) to create a EBI GWAS-catalog like database.
The meta-data fld.tsv use one row to describe each phenotype. Here we detail the columns in the meta-data file.
1 dat: data source
Although every phenotype leads back to the UK Biobank (UKBB), they are in two main forms,
- org: UKBB original phenotype covering a wide variety of aspects in health, such as body height, blood pressure, behaviors, cognitive, and blood assay.
- phc: PHECODE traits derived from UKBB acquired hospital diagnosis, coded in ICD9/10. The PHECODE exclusively describes binary disease status, such as type 2 diabetes and asthma.
2 fld: field ID
for UKBB originals (i.e., dat = "org"), this field ID is the same one given by the UKBB.
Thus, one can turn the ID into the URL to UKBB's online description with
- for a derived PHECODE phentoype (i.e., dat = "phc"), the ID is the number taken from PHECODE mapping table. One can this ID to look up the PHECODE official site at:
3 tsz: total sample size
The number of non-missing values prior to any quality control and filtering.
- includes all races, undivided;
- includes all samples, regardless of their genotyping.
- includes all samples, regardless of their kinship.
4 cvt: converted value type
Manually converted phenotype value types, can be one of the following:
- int: integer
- con: continuous
- seq: sequence (ordinal);
- bin: binary
- cat: categorical (multinomial)
5 arr: array size
Some UKBB fields are repeated measurements, for example, the systolic blood pressure, which was read twice to mitigate "doctor's bias".
- when arr > 0, the gzipped tarball will contain multiple TSV, at least one for each repeated measurement.
- the vast majority of phenotypes are non-repeated, and arr is empty.
6 dsc: short description
A short discription of the data fields taken from UKBB's page. For PHECODE phenotypes, this is the name of the disease.
7 nlv: nominal levels
The number of non-reference levels for a phenotype, which is only relevant to categorical/multinomial phenotypes (i.e., cvt = "cat").
- when nlv > 0, the gzipped tarball will contain multiple TSV, at least one for each non-reference levels.
- the reference is the one level coded as "0"; some phenotypes have no clear reference, and the level coding starts from "1".
- nlv is empty for non-multinomial phenotypes, including binary ones such as the PHECODE (_cvt_ = "bin").
8 lvl: level labels
The string labels of each level.
- i tis mostly relevent to multinomial and seqential phentoypes, but can occure to numerical phenotypes as well.
- lvl is empty for binary phenotypes, since it is simply 0=NO and 1=Yes.
- the special level "I=" denotes infinity or extreme, typically paired with labels like "the entire life", "all the time", "never", etc.