The mirVAFC was a webserver for the analysis of WGS/WES sequencing variants of single individual or family affected by genetic disorders, to identify pathogenic genetic variants according to variants interpretation guidelines of ACMG. The mirVAFC accepts one sequencing variants file (VCF) for single individual or multiple family members, one pedigree file to indicate family relationship and affected status, and the disease information as required input. The sequence variants will be annotated by various databases data and bioinformatics tools predictions, filtered out using custom criteria and classified into different categories as for estimated pathogenicity. Pathogenic variants prioritizations will be performed based on the classification and mutation effects.
Users should firstly select and upload their data files, and define appropriate parameters on the data input web page (as bellow) to initiate the analysis. Note: example input data for testing could be downloaded by clicking on the button on right up corner.
The variant file in standard VCF format(http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41)should be uploaded. For family data analysis, one variant file containing multiple samples should be used. This input is required.
The pedigree file to indicate family relationship, sex (male or female), and affected status (yes or no) should be in format described as http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml. And the affected status was coded as 0 unaffected, 1 affected, 2 proband and -9 missing. It is of note that individual IDs in the pedigree file should match that in variants file exactly. This input is required.
The candidate genes file contains genes that are reported to be pathogenic for input disease or of special interest to the user. In the candidate genes file, one gene symbols per line was contained. This input is optional.
Genes that should be reported as incidental findings if known pathogenic (according the ClinVar database records) or damaging sequence variants (null variants or missense variants predicted as damaging by > 5 software tools) found.
The file contains known pathogenic evidences for sequence variants. The evidence codes should follow those described as Table 3 of ACMG guideline manuscript (doi:10.1038/gim.2015.30).
chr1 891511 891511 C A PVS1
chr1 1021346 1021346 A G PS4
Disease: The disease names, or IDs from disease databases of OMIM and HPO should be supplied by filling box.
Disease incidence: disease prevalence in the population.
Family history: Whether the disease/phenotype was observed in other family members.
Parental identity: User could define whether the parental relationship is confirmed to exclude other non-biological relationships like from adoption, sperm/egg donations et al.
Genome reference: Version of human genome reference sequence used for input sequence variants productions.1000G/ESP/ExAC: Allele frequency for specific population of the 1000G (http://www.1000genomes.org/), ESP(http://evs.gs.washington.edu/EVS) and ExAC (http://exac.broadinstitute.org/) population variation databases will be used.
After analysis job submitted, user will be directed to waiting status page as bellow, where the job ID is supplied. The waiting page would be refreshed every 30 seconds, and would be redirected to the analysis results page when job finished. User could also supply email address for notification after the job finished, and the email notification include the job ID.
User could retrieve analysis results later by entering the job ID in the result page:
It is of note that results will be kept for only 2 days, and user could download the whole analysis results by clicking on the 'Export result' button on the upper right corner of results page.
The outputs of mirVAFC include:
The summary part give a description of variants distributions or categories on different aspects (plot as bellow), which include sequencing depth, mapping quality (Phred scaled), genotyping quality (Phred scaled) distributions, different variant types (SNPs, INDELs, transitions, transversions, heterozygous, homozygous), mutation origins (paternal inherited, maternal inherited, or de novo), mutation consequences (missense, stop gain or loss, et al.) and different clinical classifications. Ti/Tv ratio and Het/Hom ratio were also supplied that could be used to assess any sequencing or processing bias.
In the main page of results, sequence variants are ordered according to their prioritization results, with the priority ranks as the first column. Basic information for the variant including the position, affected gene, mutation consequence etc. are shown.
Click on 'show' on Detail column of each variant would direct user to page for the specific sequence variant, where more information related were shown, which include:
(1), Basic information including the genome coordinates, mutation impact on protein products, and affected protein domains.
(2), Population variation database record including the ID of NCBI dbSNP, allele frequency in three population variation databases.
(3), The phenotype or diseases related to the affected gene or variant from several disease databases, the ID as well as text description of the phenotype were shown, with links to external resources supplied.
(4), Computationally predicted mutation effect for each variant by different methods, which are based on evolutionary conservation, protein structure, amino acid biochemical properties et al. The prediction results were obtained from the dbNSFP database.
(5), Evolutionary conservation and selection constraint score for each variant predicted using different methods.
(6), Biological function annotations for each affected gene from multiple databases, which include the GO terms, biological pathways, protein interactions, co-expression and protein complex. For protein interactions relationship downloaded from BioGRiD, only physical interactions used. For co-expression relationship obtained from COXPRESdb, only those with correlation coefficient larger than 0.7 used.
(7), Sequencing and mapping information, which include the total number of sequencing reads mapped to this variant position (DP), total mapping reads with alternate allele (DP(Alt)), phred scaled genotyping quality (GQ) and phred scaled mapping quality (MQ).
(8), Human tissue expression profiles, data obtained from the ILLUMINA Human BodyMap 2.0 project http://www.ensembl.info/blog/2011/05/24/human-bodymap-2-0-data-from-illumina/, in which 16 human tissue expression level were supplied using RNA-Seq data, and the values represent RPKM.
(9), Genotyping information for all individuals, and the inheritance mode were inferred based on genotype for child and parents. non-cosegregation label variants present in unaffected family members.
Variants from different categories (Pathogenic, Likely pathogenic, Benign, Likely benign, and Uncertain) based on ACMG sequence variants clinical interpretation guidelines will be presented as different tabs (Class_P, Class_LP, Class_B, Class_LB, and Class_U accordingly) in the results pages, with Class_I to indicate incidental findings:
, and in the detail annotation page the evidence code were also shown.
For variants with possible damaging effect (LOF or predicted to be deleterious by multiple computational methods) on genes suggested to be reported as incidental findings by ACMG suggestions (Incidental), and the associated disorders were also shown.
Sequence variants could be filtered flexibly using multiple different criteria to remove possible sequencing or mapping errors, or those unlikely to be pathogenic.
The criteria used for filtering include:
select different types of mutations according to their impact based on refGene model, which include nonsynonymous SNV, frame shift or in frame insertions/deletions, splicing changes, stop gain or loss and synonymous SNV;
total sequencing depth encoded in variants file, usually 10;
phered scaled genotype quality to indicate genotyping errors, usually 30 were used to represent error probability < 0.001;
phered scaled average reads mapping quality for each position to indicate mapping errors, usually 30 were used to represent error probability < 0.001;
which type inheritance mode or uncertain if unknown in the case of family data analysis, and the specific inheritance mode definitions is as bellow:
for family data analysis, whether the variants co-segregate with the disease phenotype, co-segregation means variants present in all affected individuals, absent from all un-affected ones for the same family;
whether dbSNP recorded variants should be reported;
threshold on minimum variants allele frequency based on whole or subpopulation variation data from 1000G, ESP or ExAC databases;
variants within the genomic interval for analysis, BED format like 'chr1 0 1' were accepted;
variants within the genes will be reported;
mutations that affect the transcripts and the effect reported;
threshold on minimum number of computational tools (SIFT, Polyphen2_HDIV, Polyphen2_HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, RadialSVM,and LR obtained from dbNSFP) that predict the variants as deleterious;
threshold for PhastCons scores (570 as default);
threshold for Grantham scores (100 as default);
threshold for Grantham scores (3 as default);
variants with the diseases or phenotypes terms annotated based on Clinvar, OMIM, and MGI databases;
variants with the functions annotated based on GO terms, KEGG pathways, BioCarta pathways;
in the case of genes selected, variants from genes with shared function module from GO terms, pathways, or physical or regulatory interactions were reported;
variants from similar sequence regions (simple repeat, segmental duplications, repeatmasker and pseudo genes) which maybe caused by mapping errors will be removed.
variants within the genes will be removed;
After above mentioned parameters were defined, user press 'Refresh' button on the bottom to obtain the updated results.