About phenotype meta-data

Table of Contents

This document describes the phenotype meta-data file, fld.tsv.

Currently 3,273 phenotypes went through the Variance Locus Analysis (VLA) to create a EBI GWAS-catalog like database.

The meta-data fld.tsv use one row to describe each phenotype. Here we detail the columns in the meta-data file.

1 dat: data source

Although every phenotype leads back to the UK Biobank (UKBB), they are in two main forms,

  • org: UKBB original phenotype covering a wide variety of aspects in health, such as body height, blood pressure, behaviors, cognitive, and blood assay.
  • phc: PHECODE traits derived from UKBB acquired hospital diagnosis, coded in ICD9/10. The PHECODE exclusively describes binary disease status, such as type 2 diabetes and asthma.

2 fld: field ID

3 tsz: total sample size

The number of non-missing values prior to any quality control and filtering.

  • includes all races, undivided;
  • includes all samples, regardless of their genotyping.
  • includes all samples, regardless of their kinship.

4 cvt: converted value type

Manually converted phenotype value types, can be one of the following:

  • int: integer
  • con: continuous
  • seq: sequence (ordinal);
  • bin: binary
  • cat: categorical (multinomial)

5 arr: array size

Some UKBB fields are repeated measurements, for example, the systolic blood pressure, which was read twice to mitigate "doctor's bias".

  • when arr > 0, the gzipped tarball will contain multiple TSV, at least one for each repeated measurement.
  • the vast majority of phenotypes are non-repeated, and arr is empty.

6 dsc: short description

A short discription of the data fields taken from UKBB's page. For PHECODE phenotypes, this is the name of the disease.

7 nlv: nominal levels

The number of non-reference levels for a phenotype, which is only relevant to categorical/multinomial phenotypes (i.e., cvt = "cat").

  • when nlv > 0, the gzipped tarball will contain multiple TSV, at least one for each non-reference levels.
  • the reference is the one level coded as "0"; some phenotypes have no clear reference, and the level coding starts from "1".
  • nlv is empty for non-multinomial phenotypes, including binary ones such as the PHECODE (_cvt_ = "bin").

8 lvl: level labels

The string labels of each level.

  • i tis mostly relevent to multinomial and seqential phentoypes, but can occure to numerical phenotypes as well.
  • lvl is empty for binary phenotypes, since it is simply 0=NO and 1=Yes.
  • the special level "I=" denotes infinity or extreme, typically paired with labels like "the entire life", "all the time", "never", etc.

Author: Xiaoran Tong

Created: 2021-10-06 Wed 12:21

Validate