VariantKey  5.4.1
Numerical Encoding for Human Genetic Variants
genoref.h File Reference

Functions to retrieve genome reference sequences from a binary FASTA file. More...

#include <stdio.h>
#include <string.h>
#include "binsearch.h"
#include "variantkey.h"

Go to the source code of this file.

Macros

#define ALLELE_MAXSIZE   256
 Maximum allele length. More...
 
#define NORM_WRONGPOS   (-2)
 Normalization: Invalid position. More...
 
#define NORM_INVALID   (-1)
 Normalization: Invalid reference. More...
 
#define NORM_OK   (0)
 Normalization: The reference allele perfectly match the genome reference. More...
 
#define NORM_VALID   (1)
 Normalization: The reference allele is inconsistent with the genome reference (i.e. when contains nucleotide letters other than A, C, G and T). More...
 
#define NORM_SWAP   (1 << 1)
 Normalization: The alleles have been swapped. More...
 
#define NORM_FLIP   (1 << 2)
 Normalization: The alleles nucleotides have been flipped (each nucleotide have been replaced with its complement). More...
 
#define NORM_LEXT   (1 << 3)
 Normalization: Alleles have been left extended. More...
 
#define NORM_RTRIM   (1 << 4)
 Normalization: Alleles have been right trimmed. More...
 
#define NORM_LTRIM   (1 << 5)
 Normalization: Alleles have been left trimmed. More...
 

Functions

static void mmap_genoref_file (const char *file, mmfile_t *mf)
 
static int aztoupper (int c)
 
static void prepend_char (const uint8_t pre, char *string, size_t *size)
 
static char get_genoref_seq (mmfile_t mf, uint8_t chrom, uint32_t pos)
 
static int check_reference (mmfile_t mf, uint8_t chrom, uint32_t pos, const char *ref, size_t sizeref)
 
static void flip_allele (char *allele, size_t size)
 
static void swap_sizes (size_t *first, size_t *second)
 
static void swap_alleles (char *first, size_t *sizefirst, char *second, size_t *sizesecond)
 
static int normalize_variant (mmfile_t mf, uint8_t chrom, uint32_t *pos, char *ref, size_t *sizeref, char *alt, size_t *sizealt)
 
static uint64_t normalized_variantkey (mmfile_t mf, const char *chrom, size_t sizechrom, uint32_t *pos, uint8_t posindex, char *ref, size_t *sizeref, char *alt, size_t *sizealt, int *ret)
 Returns a normalized 64 bit variant key based on CHROM, POS, REF, ALT. More...
 

Detailed Description

The functions provided here allows to retrieve genome reference sequences from a binary version of a genome reference FASTA file.

The input reference binary files can be generated from a FASTA file using the resources/tools/fastabin.sh script.

Macro Definition Documentation

#define ALLELE_MAXSIZE   256
#define NORM_FLIP   (1 << 2)
#define NORM_INVALID   (-1)
#define NORM_LEXT   (1 << 3)
#define NORM_LTRIM   (1 << 5)
#define NORM_OK   (0)
#define NORM_RTRIM   (1 << 4)
#define NORM_SWAP   (1 << 1)
#define NORM_VALID   (1)
#define NORM_WRONGPOS   (-2)

Function Documentation

static int aztoupper ( int  c)
inlinestatic

Returns the uppercase version of the input character. Note that this is safe to be used only with a-z characters. All characters above 'a' will be changed.

Parameters
cCharacter to uppercase.
Returns
Uppercased character
static int check_reference ( mmfile_t  mf,
uint8_t  chrom,
uint32_t  pos,
const char *  ref,
size_t  sizeref 
)
inlinestatic

Check if the reference allele matches the reference genome data.

Parameters
mfStructure containing the memory mapped file.
chromEncoded Chromosome number (see encode_chrom).
posPosition. The reference position, with the first base having position 0.
refReference allele. String containing a sequence of nucleotide letters.
sizerefLength of the ref string, excluding the terminating null byte.
Returns
Positive number in case of success, negative in case of error:
  • 0 the reference allele match the reference genome;
  • 1 the reference allele is inconsistent with the genome reference (i.e. when contains nucleotide letters other than A, C, G and T);
  • -1 the reference allele don't match the reference genome;
  • -2 the reference allele is longer than the genome reference sequence.
static void flip_allele ( char *  allele,
size_t  size 
)
inlinestatic

Flip the allele nucleotides (replaces each letter with its complement). The resulting string is always in uppercase. Support extended nucleotide letters.

Parameters
alleleAllele. String containing a sequence of nucleotide letters.
sizeLength of the allele string.
static char get_genoref_seq ( mmfile_t  mf,
uint8_t  chrom,
uint32_t  pos 
)
inlinestatic

Returns the genome reference nucleotide at the specified chromosome and position.

Parameters
mfStructure containing the memory mapped file.
chromEncoded Chromosome number (see encode_chrom).
posPosition. The reference position, with the first base having position 0.
Returns
The nucleotide letter or 0 in case of invalid position.
static void mmap_genoref_file ( const char *  file,
mmfile_t mf 
)
inlinestatic

Memory map the genoref binary file.

Parameters
filePath to the file to map.
mfStructure containing the memory mapped file.
Returns
Returns the memory-mapped file descriptors.
static int normalize_variant ( mmfile_t  mf,
uint8_t  chrom,
uint32_t *  pos,
char *  ref,
size_t *  sizeref,
char *  alt,
size_t *  sizealt 
)
inlinestatic

Normalize a variant. Flip alleles if required and apply the normalization algorithm described at: https://genome.sph.umich.edu/wiki/Variant_Normalization

Parameters
mfStructure containing the memory mapped file.
chromChromosome encoded number.
posPosition. The reference position, with the first base having position 0.
refReference allele. String containing a sequence of nucleotide letters.
sizerefLength of the ref string, excluding the terminating null byte.
altAlternate non-reference allele string.
sizealtLength of the alt string, excluding the terminating null byte.
Returns
Positive bitmask number in case of success, negative number in case of error. When positive, each bit has a different meaning when set, has defined by the NORM_* defines:
  • bit 0 (NORM_VALID) : The reference allele is inconsistent with the genome reference (i.e. when contains nucleotide letters other than A, C, G and T).
  • bit 1 (NORM_SWAP) : The alleles have been swapped.
  • bit 2 (NORM_FLIP) : The alleles nucleotides have been flipped (each nucleotide have been replaced with its complement).
  • bit 3 (NORM_LEXT) : Alleles have been left extended.
  • bit 4 (NORM_RTRIM) : Alleles have been right trimmed.
  • bit 5 (NORM_LTRIM) : Alleles have been left trimmed.
static uint64_t normalized_variantkey ( mmfile_t  mf,
const char *  chrom,
size_t  sizechrom,
uint32_t *  pos,
uint8_t  posindex,
char *  ref,
size_t *  sizeref,
char *  alt,
size_t *  sizealt,
int *  ret 
)
inlinestatic
Parameters
mfStructure containing the memory mapped binary fasta file.
chromChromosome. An identifier from the reference genome, no white-space or leading zeros permitted.
sizechromLength of the chrom string, excluding the terminating null byte.
posPosition. The reference position.
posindexPosition index: 0 for 0-based, 1 for 1-based.
refReference allele. String containing a sequence of nucleotide letters. The value in the pos field refers to the position of the first nucleotide in the String. Characters must be A-Z, a-z or *.
sizerefLength of the ref string, excluding the terminating null byte.
altAlternate non-reference allele string. Characters must be A-Z, a-z or *.
sizealtLength of the alt string, excluding the terminating null byte.
retNormalization return value (see: normalize_variant).
Returns
Normalized VariantKey 64 bit code.
static void prepend_char ( const uint8_t  pre,
char *  string,
size_t *  size 
)
inlinestatic

Prepend a character to a string.

Parameters
preCharacter to prepend.
stringString to modify.
sizeInput string length.
Returns
void
static void swap_alleles ( char *  first,
size_t *  sizefirst,
char *  second,
size_t *  sizesecond 
)
inlinestatic
static void swap_sizes ( size_t *  first,
size_t *  second 
)
inlinestatic