.. _reference: ============== What is Meryl? ============== Meryl is a tool for creating and working with DNA sequence k-mers. K-mers are typically created by **counting** how many times each k-mer sequence occurs in some collection of sequences. Meryl refers to this as the **value** of the k-mer. Each k-mer can also be annotated with a **label** of up to 64-bits of arbitrary binary data. The label can be interpreted as a collection of yes/no flags or binary-coded data, for example, 7 bits could be used to indicate which of 7 samples the k-mer is present in, or a set of 10 bits could be used to `unary encode `_ the `melting temperature `_ of the k-mer. Databases are filtered and combined using **actions**. An action tells how to **assign** the output value and label of a k-mer, and how to **select** desired k-mers for output to an output **database**. .. note:: The original meryl used order ACTG for a reason that turned out to be incorrect. It was believed that complementing a binary sequence would be easier in that order, but it is just as easy in the normal order. The revised order does have the appealing property that GC content can be computed by counting the number of low-order bits set in each base, where the more standard ACGT order requires additional operations. In the revised (ACTG) order, flipping the first bit of each two-bit encoded base will complement the base. This can be done with an exclusive or against b101010. .. code-block:: none A C T G 00 01 10 11 v| v| v| v| -- flip first bit to complement 10 11 00 01 T G A C compl = kmer ^ 0b101010 NumGC = popcount(kmer & b010101) In the usual (ACGT) order, complementation can be accomplished by flipping every bit; an exclusive-or against b111111. GC content can be computed by counting bits also, but we need to count the number of two-bit encoded bases where the first and second bits differ. This requires shifting the encoded k-mer one place, comparing the -- now overlapping -- adjacent bits with XOR, and finally counting the number of set bits. .. code-block:: none A C G T 00 01 10 11 vv vv vv vv -- flip every bit to complement 11 10 01 00 T G C A compl = kmer ^ 0b111111 NumGC = popcount( (kmer>>1 ^ kmer) & 0b010101 ) It is yet to be decided if meryl2 will also use the same order (maintaining compatibility with meryl1) or if it will use the more typical order (maintaining compatibility with the rest of the world).. .. note:: A k-mer is a short sequence of nucleotide bases. The **forward k-mer** is from the supplied sequence orientation, while the **reverse-k-mer** is from the reverse-complemented sequence. The **canonical k-mer** is the lexicographically smaller of the forward-mer and reverse-mer. For example, the sequence GATCTCA has five forward 3-mers: GAT, ATC, TCT, CTC and TCA. The canonical k-mers are found by reverse-complementing each of those and picking the lexicographically smaller: .. code-block:: none :caption: Forward, reverse and canonical 3-mers. :linenos: G A (GAT, ATC) -> ATC T (ATC, GAT) -> ATC C (TCT, AGA) -> AGA T (CTC, GAG) -> CTC C (TCA, TGA) -> TCA A In meryl, k-mers can be up to 64 bases long and are canonical by default. Given at least one sequence file, meryl will find the list of k-mers present in it and count how many times each one occurs. The count becomes the ``value`` of the k-mer. These are stored in a meryl database. The example in the sidebar would store: .. code-block:: none ATC 2 AGA 1 CTC 1 TCA 1 While values are typically interpreted as the frequency of the k-mer in some set of sequences, they are simply unsigned 32-bit integers (a maximum value of 4,294,967,295) and cane be used to store any arithmetic data. The (optional) label of a k-mer can contain up to 64 bits worth of non-arithmetic data. The label can, for example, be used to assign a ``type`` or ``class`` to certain k-mers, or to mark k-mers as coming from a specific source, etc. Labels are operated on by the binary logic operations (AND, OR, XOR, NOT) and can also be shifted to the left or right. The primary purpose of meryl is to combine multiple k-mer databases into a single output database by computing new values and labels and filtering k-mers based on their value, label, base composition and presence or absence from specific databases. It does this by passing each k-mer through a tree of ``actions``. A leaf node of the tree reads k-mers from input databases (or by counting k-mers in input sequence files), filtering k-mers via an action, and emitting k-mers to other nodes or output databases. ========= Databases ========= .. sidebar:: Database Implementation Details Each k-mer is stored by breaking the binary represntation into three pieces: a file prefix, a block prefix, and a suffix. A k-mer needs 2*k bits to represent it. The file prefix is 6 bits wide, representing one of the 64 possible files in a database. Inside a file, k-mers are stored in blocks, each k-mer in a block will have the same block prefix. The suffix data for a block is stored using Elias-Fano encoding (CITE) where each suffix is split into two pieces. The first piece is encoded as a unary offset from the last piece, and the second piece is a N-log2(N)+1 binary value. At present, values are stored as plain 32-bit integers, stored immediately after the k-mer data. A set of k-mers, each k-mer with a value and a label, is stored in a **database**. The database is a directory with 129 binary files in it -- 64 data files, 64 index files and one master index. This division lets meryl easily process each of these files independently, making effective use of up to 64 compute threads. Databases also store the k-mer size (**-k** option), label size (**-l** option), and any simple sequence reductions (**--compress** and **--ssr** options) applied. It is not possible to combine databases with different parameters. Each k-mer is stored at most once per database - thus a k-mer cannot have multiple values of labels associated with it (though we did envision doing this at one time). =============== Counting K-mers =============== The **count** action reads sequence from any number of input files and counts the number of times each (canonical) k-mer occurs. Actions **count-forward** and **count-reverse** will instead count k-mers as they are oriented in the input sequence or the reverse-complement sequence, respectively. Input sequences can be in either FASTA, FASTQ, raw bases, or if compiled with Canu support, in a Canu seqStore database. Sequence files can be gzip, bzip2 or xz compressed. An output database must be supplied to all count actions. K-mers are both written to the output database and provided as input to destination actions. Count actions, unless accompanied by an action that reads input from an existing database, MUST specify the k-mer size on the command line with the **-k** option. Count actions can include a value or label assignment, but cannot include any selectors. A value assignment could be used to assign each k-mer a constant value instead of the count; a label assignment could be used to assign each k-mer a constant representing the input file. Counting is resource intense. Meryl will use memory and threads up to a limit supplied by: the operating system (usually physical memory and the number of CPUs), a grid manager (such as Slurm, PBS or SGE) or a command line option (**-m** and **-t**). Two algorithms are used for counting k-mers. The algorithm that is expected to use the least memory is used. The choice depends on the size of the input sequences and the k-mer size. Counting Small k-mers (k < 17) ------------------------------ .. warning:: does count really only use one thread here? .. warning:: is this method used even for small amount of input sequence? .. sidebar:: Small k-mer Counting Implementation Details Each integer counter is initially a 16-bit value. Once any count exceeds 2\ :sup:`16` = 65,535 another bit is added to all value, resulting in 17-bit values for every k-mer. Once any count then exceeds 2\ :sup:`17` = 131,072, another bit is added, and so on. Thus, memory usage is 512 MB * log\ :sub:`2` maximum_count_value For k at most 16, meryl counts k-mers directly, that is, by associating an integer count with each possible k-mer. This has the benefit of being simple and uses a constant amount of memory regardless of the size of the input, but quickly exhausts memory for even moderate k-mer sizes. There are 4\ :sup:`k` k-mers of size k; for k=16, there are 4,294,967,296 possible k-mers. Counting 16-mers with this method will use at least 8 GB of memory, independent of input size: counting 16-mers in an E.coli genome will use 8 GB of memory, despite there being only 5 million or so k-mers. Further, memory usage can increase depending on the maximum count value. This method uses only a single thread to read the input sequence and increment counters in the array, but multiple threads can be used to generate the output database. Counting Large k-mers (k > 15) ------------------------------ .. sidebar:: Large k-mer Counting Implementation Details Each k-mer is split into a prefix and a suffix. The prefix is used to select a list to which the suffix is added. When the (approximate) size of all lists exceeds a user-supplied threshold, each list is sorted, the suffixes are counted, and this subset of counted k-mers is output to an intermediate database. After all k-mers are processed, the intermediate databases are merged into one. For k larger than 15, or for small amounts of input sequence, meryl counts k-mers by first converting the sequence to a list of k-mers, duplicates included, then sorts the list, then scans the list to count the number of times each k-mer is present. If all k-mers in an input sequence do not fit in memory, a partial result is written to disk. After all input sequences have been processed, the partial results are combined into a single output database. In practice, this method requires several additional gigabytes of memory to minimize the overhead of writing and merging partial results. This method can use multiple threads for every stage. ======= Actions ======= Meryl processing is built around **actions**. An action loads a k-mer from one or multiple databases (or, for counting actions, computes the k-mer from a sequence file) computes value and label for it (based on the input kmer values and labels), decides if it should be output or discarded (e.g., "if the new value is more than 100, output the k-mer"), and saves it to an output database or text file or displays it on the screen or passes it to another action for further processing. An action is specified as an alias (listed below) or by explicitly stating all parameters. The parameters describe: * how to compute the value of the k-mer * how to compute the label of the k-mer * conditions when a k-mer should be output or discarded: - based on which input databases it came from - based on the input and/or output values of the k-mer - based on the input and/or output labels of the k-mer - based on the sequence of the k-mer * what to do with output k-mers - output them to a new database - print them to ASCII output files - display them on the terminal - pass them to other actions .. note:: K-mers are read "in order" from the inputs. If an input does not contain the "next" k-mer, it does not participate in the action processing. For example, suppose we have three input databases with the following 4-mers and their counts: .. code-block:: none :caption: Sample databases. :linenos: input-1 input-2 input-3 AAAA/1 AAAA/2 AAAA/3 AAAC/1 CAAT/2 CCCC/3 CAAT/1 GGGG/3 GGGG/1 A ``union-sum`` action with these three databases as input will output: .. code-block:: none :caption: Sample output from union-sum action. :linenos: AAAA/6 (using the k-mer from input-1, input-2 and input-3) AAAC/1 (... from input-1) CAAT/3 (... from input-1 and input-2) CCCC/3 (... from input-3) GGGG/4 (... from input-1 and input-3) An **assignment** computes the output value (label) for each k-mer from among the input values (labels). At most one assignment can be supplied for the value and one for the label. A **selector** decides if the k-mer should be output or discarded. Selectors can use input values (labels), the computed output value (label), the base composition of the k-mer and which specific inputs the k-mer was present in. Any number of selectors can be supplied, linked with **and**, **or** and **not** operators. See SELECTORS. Though it is possible to specify all those choices explicitly, **aliases** are provided for most common operations. Aliases exist to support common operations. An alias sets the ``value``, ``label`` and ``input`` options and so these are not allowed to be used with aliases. Examples of aliases and their explicit configuration: .. table:: Action Aliases :widths: 25 30 45 +--------------------+--------------------+----------------------------------------------+ | Alias | Output k-mer if... | Sets value to the... | +====================+====================+==============================================+ | union | ...k-mer is in any | ...number of databases the k-mer is in. | +--------------------+ input database. +----------------------------------------------+ | union-min | | ...smallest input value. | +--------------------+ +----------------------------------------------+ | union-max | | ...largest input value. | +--------------------+ +----------------------------------------------+ | union-sum | | ...sum of the input values. | +--------------------+--------------------+----------------------------------------------+ +--------------------+--------------------+----------------------------------------------+ | intersect | ...k-mer is in all | ...value of the k-mer in the first database. | +--------------------+ input databases. +----------------------------------------------+ | intersect-min | | ...smallest input value. | +--------------------+ +----------------------------------------------+ | intersect-max | | ...largest input value. | +--------------------+ +----------------------------------------------+ | intersect-sum | | ...sum of the input values. | +--------------------+--------------------+----------------------------------------------+ +--------------------+--------------------+----------------------------------------------+ | subtract | ...k-mer is in the | ...value of the k-mer in the first database | | | first database. | minus all other values. | +--------------------+ +----------------------------------------------+ | difference | | ...value of the k-mer in the first database. | +--------------------+--------------------+----------------------------------------------+ +--------------------+--------------------+----------------------------------------------+ | less-than X | ...k-mer is in the | ...value of the k-mer. | +--------------------+ first and only | | | greater-than X | database and the | | +--------------------+ value meets the | | | at-least X | speficied | | +--------------------+ condition. | | | at-most X | | | +--------------------+ | | | equal-to X | | | +--------------------+ | | | not-equal-to X | | | +--------------------+--------------------+----------------------------------------------+ +--------------------+--------------------+----------------------------------------------+ | increase X | ...k-mer is in the | ...value of the k-mer modified by | +--------------------+ first and only | the specified operation. | | decrease X | database. | | +--------------------+ | (divide-round rounds 0 up to 1) | | multiple X | | | +--------------------+ | | | divide X | | | +--------------------+ | | | divide-round X | | | +--------------------+ | | | modulo X | | | +--------------------+--------------------+----------------------------------------------+ .. warning:: This table has not been verified! .. table:: Action Aliases :widths: 19 19 19 16 14 13 +----------------+---------------------------------------------------------------------------------+ | | Action | | Alias +------------------------------------+--------------------------------------------+ | + Assignment | Selector | +----------------+-------------------+----------------+--------------+--------------+--------------+ | union | value=sum | label=or | input:any | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | union-min | value=min | label=min-value| input:any | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | union-max | value=max | label=max-value| input:any | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | union-sum | value=sum | label=or | input:any | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ +----------------+-------------------+----------------+--------------+--------------+--------------+ | intersect | value=first | label=and | input:all | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | intersect-min | value=min | label=min-value| input:all | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | intersect-max | value=max | label=max-value| input:all | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | intersext-sum | value=sum | label=and | input:all | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ +----------------+-------------------+----------------+--------------+--------------+--------------+ | subtract | value=sub | label=first | input:first | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | difference | value=sub | label=first | input:first | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ +----------------+-------------------+----------------+--------------+--------------+--------------+ | less-than X | value=first | label=first | input:only | value:X | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | at-least X | value=first | label=first | input:only | value:>=X | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | at-most X | value=first | label=first | input:only | value:<=X | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | equal-to X | value=first | label=first | input:only | value:==X | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | not-equal-to X | value=first | label=first | input:only | value:!=X | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ +----------------+-------------------+----------------+--------------+--------------+--------------+ | increase X | value=\@1+X | label=first | input:only | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | decrease X | value=\@1-X | label=first | input:only | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | multiply X | value=\@1*X | label=first | input:only | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | divide X | value=\@1/X | label=first | input:only | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | divide-round X | value=\@1/X [#a]_ | label=first | input:only | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ | modulo X | value=\@1%X | label=first | input:only | value: | label: | +----------------+-------------------+----------------+--------------+--------------+--------------+ .. [#a] The ``divide-round`` alias rounds values of 0 up to 1. A full action is: .. code-block:: none :caption: Fully general action template. :linenos: [ action-name output:database= # to a binary database output:list= # to ascii files output:show # to stdout output:pipe=