Meryl

Quick Start

To give a gentle introduction to meryl, we’ll use three random Escherichia coli assemblies pulled from GenBank. They’re not quite random; they’re flagged as complete and have the highest number of scaffolds - which, we hope, means they have a complete nuclear genome and a few handfuls of plasmids.

We’ll use EC931, SCEC020022 and MSB1_4I-sc-2280412, but give them shorter names for convenience.

mkdir data

curl -LRO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/522/225/GCA_014522225.1_ASM1452222v1/GCA_014522225.1_ASM1452222v1_genomic.fna.gz
curl -LRO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/165/095/GCA_002165095.2_ASM216509v2/GCA_002165095.2_ASM216509v2_genomic.fna.gz
curl -LRO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/071/835/GCA_905071835.1_MSB1_4I/GCA_905071835.1_MSB1_4I_genomic.fna.gz

mv -i GCA_014522225.1_ASM1452222v1_genomic.fna.gz data/ec.fna.gz
mv -i GCA_002165095.2_ASM216509v2_genomic.fna.gz  data/sc.fna.gz
mv -i GCA_905071835.1_MSB1_4I_genomic.fna.gz      data/ms.fna.gz

First, lets count the kmers in each and save the results in meryl databases.

% meryl count k=42 data/ec.fna.gz output ec.meryl

This command does just as it reads (left to right): count 42-mers in a FASTA input and output the results to a meryl database.

We can show the k-mers in the database with:

% meryl threads=1 print ec.meryl | head

AAAAAAAAAAACGGATCTGCATCACAACGAGAGTGATCCCAC      1
AAAAAAAAAACTTCCACCAATATGATGGGTGGCGTACAGTAA      1
AAAAAAAAAACGGATCTGCATCACAACGAGAGTGATCCCACC      1
AAAAAAAAAATAGACAAAAAATACTTTATCAAAACATACATA      1
AAAAAAAAACCATCCAAATCTGGATGGCTTTTCATAATTCTG      1
AAAAAAAAACCCGATTTTATCAGGCGTCAACCTCTGAATTGT      1
AAAAAAAAACCCGCTGATTAAGCGGGTTTTGAATTCTTGCTG      1
AAAAAAAAACCTGAAAAAACGGCCTGACGTGAATCAAGCAAT      1
AAAAAAAAACCTGCCATCGCTGGCAGGTTTTTTATGACTAAA      1
AAAAAAAAACCGACCAAGGGTCGGGGCAAGAATCAGAGTCTG      1

(Detail: Option ‘threads=1’ is supplied to prevent meryl from processing its 64 data chunks in parallel; for the print operation this results in the output kmers being unsorted.)

Let’s now count the other two databases and combine the kmers from all three genomes into one database, all in one command.

% meryl \
    union-sum \
    output union.meryl \
    [count data/sc.fna.gz output sc.meryl] \
    [count output ms.meryl data/ms.fna.gz] \
    ec.meryl

There’s a lot going on here. There are three ‘actions’ and three ‘output’ files. Each action starts a new ‘nesting-level’. A nesting-level has exactly one action, at most one output, and any number of inputs.

The union-sum action is at nesting-level 1, and writes output to ‘union.meryl’. Continuing to read the command line left to right, we find a new action, the counting of kmers in sc.fna.gz. The output of this action is fed as input to the union-sum action. Since this is a new action, it makes a new nesting-level (2). The square brackets serve to cordon off a nesting level. When the right square bracket is enountered, the nesting-level is decremented back to 1. Then another count action is encountered, which also feeds it’s output to the union-sum action. Finally, a meryl database is enountered. Since we’re back to nesting-level 1, this, too, becomes input to the union-sum command.

The end result of this is to count kmers in sc.fna.gz and ms.fna.gz, writing the kmers from each to s.meryl and ms.meryl, respectively. Finally, the kmers from all three genomes are combined into one database, summing the counts for kmers that appear in multiple databases.

Notice also that the ‘k=42’ option is not present. Meryl will determine the kmer size from the ‘ec.meryl’ input database and use that for all operations. If a kmer size is supplied, it must match the database, and all databases must be of the same size.

Before meryl starts processing, it will output a summary of the processing it will perform. Note, however, that counting operations occur before this summary is reported.

PROCESSING TREE #1 using 16 threads.
  opUnionSum
    opPassThrough
      sc.meryl
    opPassThrough
      mc.meryl
    ec.meryl
    output to union.meryl

Detail: The ‘count’ actions are converted to ‘pass-through’ actions once the kmers are counted.

A meryl database also stores the histogram of kmer values. This can be displayed:

% meryl histogram ec.meryl

1       4911809
2       37336
3       7632
4       1217
5       2705
6       2232
7       4544
8       384
9       862
11      3
12      4
15      967
16      230
18      1
19      1
21      81
22      3
27      39
29      4
30      21
31      19
32      5
37      39
48      3
49      42
50      38
51      15
52      493
53      13

Which hints there is a 52 copy repeat of around 500 bases in Escherichia coli EC931. Histograms from the other two genomes show either no high copy repeat (Escherichia coli SCEC020022, ‘sc.meryl’) or a potential 64 copy repeat (Escherichia coli MSB1_4I-sc-2280412, ‘ms.meryl’). Lets now extract those kmers and see where they are on the genomes.

% meryl print at-least 48 ec.meryl output ec-repeats.meryl > ec-repeats.dump

This command does two things: it creates a new meryl database of just the repeat kmers, and creates a text file of those kmers.

The meryl-lookup tool compares FAST/FASTQ sequences against a meryl database (or several databases). We’ll use it to generate a bed file of the bases covered by kmers in a database.

% meryl-lookup -sequence data/ec.fna -mers ec-repeats.meryl -bed-runs > ec-repeats.bed

From this we see that there is not a single 52-copy repeat, but several shorter repeats. There are several instances of a 447 base repeat with single base differences:

CP049118.1      3782667 3782794
CP049118.1      3782794 3783114

CP049118.1      3762937 3763257
CP049118.1      3763257 3763384

Not really part of meryl, the high-count kmers can be passed to a greedy assembler with nice results (the greedy assembler is included in the meryl source code, but isn’t installed in the binary directory).

% perl $MERYL/scripts/greedy-assemble-kmers.pl < ec-repeats.dump

>1
AGCCTGTCATACGCGTAAAACAGCCAGCGCTGGCGCGATTTAGCCCCGACATAGCCCCACTGTTCGTCCATTTCCGCGCAGACGATGACGTCACTGCCCG
GCTGTATGCGCGAGGTTACCGACTGCGGCCTGAGTTTTTTAAGTGACGTAAAATCGTGTTGAGGCCAACGCCCATAATGCGGGCTGTTGCCCGGCATCCA
ACGCCATTCATGGCCATATCAATGATTTTCTGGTGCGTACCGGGTTGAGAAGCGGTGTAAGTGAACTGCAGTTGCCATGTTTTACGGCAGTGAGAGCAGA
GATAGCGCTGATGTCCGGC
>2
ATGGCGACGCTGGGGCGTCTTATGAGCCTGCTGTCACCCTTTGACGTGGTGATATGGATGACGGATGGCTGGCCGCTGTATGAATCCCGCCTGAAGGGAA
AGCTGCACGTAATCAGCAAGCGATATACGCAGCGAATTGAGCGGCATAACCTGAATCTGAGGCAGCACCTGGCACGGCTGGGACGGAAGTCGCTGTCGTT
CTCAAAATCGGTGGAGCTGCATGACAAAGTCATCGGGCATTATCTGAACATAAAACACTATCAATAAGTTGGAGTCATTACC
>3
GTGCTTTTGCCGTTACGCACCACCCCGTCAGTAGCTGAACAGGAGGGACAGCTGATAGAAACAGAAGCCACTGGAGCACCTCAAAAACACCATCATACAC
TAAATCAGTAAGTTGGCAGCATCACC

Dropping the kmer threshold to 10 and assembling those kmers finds 11 repeat sequences, one of length 770 bp and one of length 1195 bp.

Let’s now find the high-count kmers common to EC931 and MSB1_4I-sc-2280412 and assemble those.

% meryl \
    print \
    intersect \
      [at-least 48 ec.meryl] \
      [at-least 55 ms.meryl] \
  | \
  perl $MERYL/scripts/greedy-assemble-kmers.pl

>1
GCCGGACATCAGCGCTATCTCTGCTCTCACTGCCGTAAAACATGGCAACTGCAGTTCACTTACACCGCTTCTCAACCCGGTACGCACCAGAAAATCATTG
ATATGGCCATGAATGGCGTTGGATGCCGGGCAACAGCCCGCATTATGGGCGTTGGCCTCAACACGATTTTACGTCACTTAAAAAACTCAGGCCGCAGTCG
GTAACCTCGCGCATACAGCCGGGCAGTGACGTCATCGTCTGCGCGGAAATGGACGAACAGTGGGGCTATGTCGGGGCTAAATCGCGCCAGCGCTGGCTGT
TTTACGCGTATGACAGGCT
>2
GGTGATGCTGCCAACTTACTGATTTAGTGTATGATGGTGTTTTTGAGGTGCTCCAGTGGCTTCTGTTTCTATCAGCTGTCCCTCCTGTTCAGCTACTGAC
GGGGTGGTGCGTAACGGCAAAAGCAC
>3
ATGGCGACGCTGGGGCGTCTTATGAGCCTGCTGTCACCCTTTGACGTGGTGATATGGATGACGGATGGCTGGCCGCTGTATGAATCCCGCCTGAAGGGAA
AGCTGCACGTAATCAGCAAGCGATATACGCAGCGAATTGAGCGGCATAACCTGAATCTGAGGCAGCACCTGGCACGGCTGGGACGGAAGTCGCTGTCGTT
CTCAAAATCGGTGGAGCTGCATGACAAAGTCATCGGGCATTATCTGAACATAAAACACTATCAATAAGTTGGAGTCATTACC

Aside from a strand difference caused by the assembler, they’re the same!

We’ll stop the quick start here, before we barrel out of control into repeats.

Frequently Asked Questions

Tutorial and Examples

Meryl is a tool for counting kmers in a DNA sequence file. The ‘count’ of a kmer is the number of times that particular sequence occurs in the input sequences - in meryl terminology, this is the ‘value’ of the kmer. Each kmer can additionally have a ‘label’, up to 64-bits of binary data, associated with it. This information is stored in a ‘meryl database’.

Meryl processing is built around ‘actions’. An action can count the kmers in a sequence file creating a meryl database for future use, combine multiple databases into one, filter kmers in a single database by value or label, or modify the labels of kmers in a single database. Actions can be chained together in a processing tree and evaluated in a single invocation of meryl.

Some examples of what this can accomplish:

  1. Count the 22-mers in two sequence files, discard the unique kmers and output the rest to a database.

    meryl \
    output kmers.meryl \
    greater-than 1 \
    count \
      k=22 \
      sequences.fasta.gz
    
  2. Output kmers with a value at least 10 that exists in 2 out of 4 input databases. Notice that the size of the k-mer (‘k=22’) does not need to be supplied as it is implicit in the input databases.

    meryl \
    present-in:2 \
      [filter value:>=10 a.meryl] \
      [filter value:>=10 b.meryl] \
      [filter value:>=10 c.meryl] \
      [filter value:>=10 d.meryl]
    
  3. You can also count the k-mers in the same command. Since the ‘count’ action doesn’t know what k-mer size to use, we must again supply it.

    meryl \
    k=22 \
    present-in:2 \
      [filter value:>=10 count a.meryl] \
      [filter value:>=10 count b.meryl] \
      [filter value:>=10 count c.meryl] \
      [filter value:>=10 count d.meryl]
    
  1. Output a k-mer if it exists in at least three databases with count greater than 100, but output the minimum count the kmer has in any input database.

    meryl \
    present-in:2 value:min \        #  action-1
      [present-in:any value:min \   #  action-2
        a.meryl \
        b.meryl \
        c.meryl \
        d.meryl \
        e.meryl \
      ] \
      [present-in:3 \               # action-3
        [filter value:>100 a.meryl] \
        [filter value:>100 b.meryl] \
        [filter value:>100 c.meryl] \
        [filter value:>100 d.meryl] \
        [filter value:>100 e.meryl] \
      ]
    

    As the comments hint, there are three actions.

    action-1 has two inputs, ‘action-2’ and ‘action-3’. It requires that kmers be present in exactly two inputs and sets the value to the minimum value in any input.

    action-2 has five inputs, all from meryl databases. It outputs a kmer if it exists in any input and sets the value to the minimum value in any input.

    action-3 has five inputs, each from another action. It outputs a kmer if it exists in exactly three inputs, the value assigned to the output kmer isn’t specified and will default to [….]. Each of the five inputs is an action that outputs only kmers whose value greater than 100.

Actions

A single meryl action is specified as:

[action
   output=<database.meryl>
   print=<file.mers | files.##.mers | stdout>
   value:value_option
   label:label_option
   input1
   input2
   ... ]

There are four primary actions:

count

create a meryl database from FASTA or FASTQ inputs

filter

select kmers based on their value of label from a single input

modify

change kmer values or labels in a single input

present-in

combine multiple inputs into one output

These four actions are described in detail later. For now, just rembmer that an action processes multiple input databases (or sequence inputs, for ‘count’) into exactly one output database.

Output Modifier

If present, the ‘output’ modifier will write the kmers generated by this action to the specified meryl database. The database will be created if it doesn’t exist, or erased if it does exist. The kmers will also be supplied to any actions using this action as an input.

‘count’ actions currently require an output database.

Inputs

For counting actions, inputs must be FASTA or FASTQ files, either uncompressed or gzip, bzip2, xz compressed (with suffix ‘.gz’, ‘.bz2’ or ‘.xz’). Any number of inputs may be supplied. All kmers generated will be in the same output database with no tracking of where they came from.

For all other actions, inputs can be a mix of other actions or meryl databases. Some actions require exactly one input (for example, ‘filter’) while other actions can take any number of inputs (for example, ‘union’).

Modifiers

The ‘value’, ‘label’ and ‘kmer’ modifiers serve slightly different purposes depending on the action.

For present-in and modify actions, these modifiers describe how to combine the input values and labels into a single output value or label.

For filter actions, these modifiers to describe how to select input kmers for output.

When value:#c, value:first, value:min or value:max are used, the label operation acts ONLY on the kmers matching the value selection. For example, if value:min finds value=5 is the minimu, label=or would combine the labels of all kmers with value=5. Contrast this with value:add (which would set the output value to the sum of the kmer values in all databases) and label:and (which would set each bit in the output label to true if the corresponding bit is true in all inputs).

Likewise for label:#c, label:first, label:minweight and label:maxweight. For example, when label:#c is used, value:add would sum the values of all labels that are the same as constant c.

Constants begin with a ‘#’ symbol and end with a single letter denoting the base: #011100b (binary), #34o (octal), #28d (decimal), #28 (decimal), #1ch (hexadecimal) are all the same constant. Note that a constant with no base indicated is interpreted as being decimal.

Constants are optional.

Note that ‘value:first’ and ‘label:first’ are the value/label of the kmer in the first file it is present in, which is not necessarily the first input file.

Value Modifier

For present-in and modify actions, the value of the kmer is set to:

value:#c

constant c

value:@f

the value of the kmer in the f’th input file

value:first

that of the first input that the kmer is present in

value:selected

that of the kmer selected by a label modifier (e.g., minweight, maxweight)

value:min#c

the minimum of all values and constant c

value:max#c

the maximum of all values and constant c

value:add#c

the sum of all values and constant c

value:sum#c

same as add

value:sub#c

that of the first kmer (see value:first) minus all other values and constant c, thresholding to 0 if negative

value:dif#c

same as sub

value:mul#c

the multiplication of all values and c

value:div#c

that of the first kmer (see value:first) divided by all other values and constant c. the result of the division is truncated; value:mod can be used to get the remainder

value:mod#c

that of the remainder of the corresponding value:div modifier

value:rem#c

same as mod

value:modzero#c

the goofy one from merqury?

For filter actions, the kmer is output if the value is:

value:=#c

is equal to c

value:==#c

is equal to c

value:!=#c

is not equal to c

value:<>#c

is not equal to c

value:<#c

is less than c

value:>#c

is greater than c

value:<=#c

is at most c

value:>=#c

is at least c

Label Modifier

For present-in and modify actions, the label of the kmer is set to:

label:#c

constant c

label:first

that of the first input that the kmer is present in

label:selected

that of the kmdf selected by a value modifier (e.g., min, max)

label:and#c

the bitwise AND of all labels and constant c

label:or#c

the bitwise OR of all labels and constant c

label:xor#c

the bitwise XOR of all labels and constant c

label:sub#c

that of the first input after subtracting (turning off) bits set in any other label or constant c

label:inv

that of the bitwise invert (equivalent to label:xor#ffffffh)

label:minweight#c

to the label (or constant) with the least bits set

label:maxweight#c

to the label (or constant) with the most bits set

label:shl#c

shift label bits left by c

label:shr#c

shift label bits right by c

label:rol#c

rotate label bits left by c

label:ror#c

rotate label bits right by c

For filter actions, the kmer is output if the label is:

label:=#c

is equal to c

label:==#c

is equal to c

label:!=#c

is not equal to c

label:and#c

has at least one bit set that is also set in c

label:or#c

nonsense

label:xor#c

complicated

label:weight=#c

true if the number of bits set in >, =, < c

Kmer Modifier

This modifier is invalid for present-in and modify actions.

For filter actions, the kmer is output if:

kmer:AT==X

output a kmer if the AT content is X

kmer:AT!=X

kmer:AT>X

kmer:AT>=X

kmer:AT<X

kmer:AT<=X

kmer:GC==X

output the kmer if the GC content is X

kmer:GC!=X

kmer:GC>X

kmer:GC>=X

kmer:GC<X

kmer:GC<=X

kmer:A==X

output if kmer has exactly, more than, less than X A’s

kmer:C==X

kmer:G==X

kmer:T==X

kmer:[ACGTn]

output if kmer matches / does not match some pattern

Advanced Label Filtering

To generalize the label tests, it is sufficient (is it?) to test if the label is (is not) equal to C, or if the label is (is not) contained in or is (is not) contained by the constant C.

A label is contained in a constant if the only bits set in the label are also set in the constant. Label 00 is contained in all constants, and constant 11 contains all labels.

case

expression

interpretation

interpretation

interpretation

0

l == c |

L is equal to C

l != c

L is not equal to C

1

l&c == l

l|c == c

L is contained in C

no bit set in L that is not set in C

all of the bits set in L are set in C

2

l&c != l

l|c != c

L is not contained in C

a bit set in L that is not set in C

3

l&c == c

l|c == l

L contains C

no bit not set in L that is set in C

all of the bits set in C are set in L

4

l&c != c

l|c != c

L does not contain C

a bit not set in L that is set in C

5

l&c == 0

L and C have no set bits in common

6

l&c != 0

L and C have set bits in common

A general label filter can be specified by picking one of four functions for each bit, then testing if: all functions are true, some function is true, or no functions are true. (this cannot capture the weight tests)

The four functions to modify a label before it is tested are:

'0' - zero(bit) =  0    - set testable bit to 0
'1' - one (bit) =  1    - set testable bit to 1
'+' - pass(bit) =  bit  - set testable bit to the true     label bit
'-' - flip(bit) = !bit  - set testable bit to the inverted label bit

Once a label is modified by the functions, we can test if:

'all-1' -- all  bits are 1, no bits are 0 -- this roughly corresponds to an AND  operation
'any-1' -- some bit  is  1                -- this roughly corresponds to an OR   operation
'any-0' -- some bit  is  0                -- this roughly corresponds to an NAND operation
'all-0' -- all  bits are 0, no bits are 1 -- this roughly corresponds to an NOR  operation

Examples:

label:-++--:all - true if the label is 01100

label:00:all    - always false
label:11:all    - always true
label:--:all    - label must be 00
label:++:all    - label must be 11

label:00:any    - always false
label:11:any    - always true
label:--:any    - label cannot be 11
label:++:any    - label cannot be 00

label:00:none   - always true
label:11:none   - always false
label:--:none   - label must be 11
label:++:none   - label must be 00

label:0++:any   - label cannot be 000 or 100.

The tests in the table at the start of this section can be implemented as follows.

Case 0

For testing equality, convert the constant c into a string of +’s (for 1 bits) and -‘s (for 0 bits) then check that all bits are set.

For testing non-equality, invert the conversion (1 -> - and 0 -> +) then check that any bit is set. Proof: if the label is the same as the constant, 1 bits in the label will be inverted (to 0) and 0 bits in the label will be output true (so also 0) resulting in a modified label of all 0’s. If a 1 bit in the label corresponds to a 0 bit in the constant, it will be passed true (a 1 in the modified label), similarly, a if a 0 bit in the label corresponds to a 1 bit in the constant it will be passed inverted (a 1 in the modified label), either of which will make the modified label not equal to zero.

Case 1

Both of these test that every bit set in L is also set in C, but C can have extra bits set.

Convert the constant c into a string of 0’s (for 1 bits) and +’s (for 0 bits), then check that no bit is set.

Where C is a 1, we don’t care what L is; ‘l&c == l’ is true for both l=0 and l=1. By forcing these bits to 0 in the modified string, they will never result in the check failing.

Where C is a 0, howeveer, L must be 0. Hence, passing the L bit true will result in a 1 bit output when L=1, which will cause the check to properly fail. When L=0, a 0 is outpout, and the check passes.

Case 2

The same as case 1, but change the check from “no bit is set” to “any bit is set”.

Cases 3 and 4

Both are similar to cases 1 and 2. Constant c is converted to +’s for 1’s and 1’s for 0’s, then the result is tested to check that all bits are set. Thus, wherever c is set, l must also be set, and wherever c is not set, we don’t care what l is.

Case 4 is a bit tricky, since we need to check that some bit is zero – and there is no defined operation for that (other than “not all bits are set”).

Cases 5 and 6

Case 5 is checking that L and C have no bits set in common, and case 6 is checking that they do.

For case 5, C is converted to +’s for 1’s and 0’s for 0’s, then check that no bit is set. For case 6, check that some bit is set.

Action Ordering

Meryl actions are greedy. As the command line is processed left to right, any inputs (either from databases or another action) will be assigned to the last encountered action. Square brackets (‘[’ and ‘]’) can be used to assign inputs to specific actions, by isolating actions and inputs.

Though not strictly required, square brackets can be used to disambiguate actions and inputs. Each nesting level can have at most one action and any number of inputs. The nesting level increases when either a new action or a left bracket (‘[’) is encountered, but decreases only when a right bracket (‘]’) is encountered.

Some examples will clarify. With no brackets, the command

meryl
  union
    intersect
      a.meryl
      b.meryl
    intersect
      c.meryl
      d.meryl

expands to

meryl
  [              #  Start new nesting level 0
  union          #  Level 0 action
    [            #  Level 0 input, start new nesting level 1
    intersect    #  Level 1 action
      a.meryl    #  Level 1 input
      a.meryl    #  Level 1 input
      [          #  Level 1 input, start new nesting level 2
      intersect  #  Level 2 action
        b.meryl  #  Level 2 input
        c.meryl  #  Level 2 input
      ]          #  Level 2 terminator
    ]            #  Level 1 terminator
  ]              #  Level 0 terminator

Note the ‘union’ action has only one input; the first ‘intersect’ action has three inputs, and the last ‘intersect’ has two inputs.

As written, the intent is clear but square brackets must be used to communicate this to meryl.

meryl
  union
    [
    intersect
       a.meryl
       b.meryl
    ]
    [
    intersect
      c.meryl
      d.meryl
    ]

This can be written on one line as:

meryl union [intersect a.meryl b.meryl] [intersect c.meryl d.meryl]

in particular, the square brackets will recognized even if they are not separated from the action or input with white-space.

Modify Action

The ‘modify’ action changes the value or label. This action operates on exactly one input; if more than one input is supplied, the action is rejected and meryl will return an error before any processing is performed.

Descriptions are largely the same as for present-in above, except there is no other value, so operations occur between the current (single) value and the constant.

Filter Action

The filter action accepts input from one source, and passes or discards kmers according to the ‘value’ and ‘label’ rules supplied.

If more than one input is supplied, the action is rejected and meryl will return an error.

Tests can be combined in a single filter action using a sum-of-products format. In a sum-of-products expression, AND takes precedence over OR. In a programming language, we’d write expressions such as \(((\mathbf{A}\ and\ \mathbf{B})\ or\ (\mathbf{C}\ and\ \mathbf{D}))\). In meryl, the \(\mathbf{A}\), \(\mathbf{B}\), et cetera, symbols are repaced with a value or label modifier, and the parentheses are omitted.

A simple expression, output rare or common kmers:

filter
   value:<#5 or value:>#1000

To output kmers with value between 3 and 9, or between 13 and 20, we’d write:

filter
  value:>#3 and value:<#9 or value:>#13 and value:<#20

The contrasting style, not supported in meryl, is called a product-of-sums expression. In this, OR takes precedence over AND. In a programming language, we’d write expressions such as \(((\mathbf{A})\ and\ (\mathbf{B}\ or\ \mathbf{C})\ and\ (\mathbf{D}))\). Since this form is a bit awkward to use, and since De Morgan’s laws allow conversion between the two, this form is, again, not supported in meryl.

Count Action

Described elsewhere??

Present-In Action

The ‘present-in’ action will emit a kmer based on the number of input files it is present in, its “input-count” value. If the kmer is not present in the specified number of input files, it is discarded. ‘present-in’ takes a comma-delimited list of integer ranges.

Assuming 9 input files, some examples are:

present-in:4,5,6  - output if present in 4 or 5 or 6 input files
present-in:4-6,8  - output if present in 4 or 5 or 6 or 8 input files

Because meryl gets its kmers from the inputs, as opposed to iterating over all possible kmers, every kmer that meryl processes will exist in at least one input; therefore, ‘present-in:0’ is never true; further ‘present-in:0-6’ and ‘present-in:1-6’ are exactly the same.

One additional flag is useful:

'first' - the kmer is present in the first input file

The ‘first’ flag is necessary to perform set difference operations:

present-in:1:first     - output if kmer exists only in the first input
                         (the kmer exists in one input AND
                          the kmer exists in the first input)

present-in:first:2-9   - output if kmer exists in the first file and
                         some other file
                         (and, we'll see later, subtract the counts
                          in the other files from the count in the
                          first file)

Several synonyms exist:

present-in:n       - the number of input files (9 in the example above)
present-in:all     - the number of input files
present-in:any     - equivalent to 'present-in:1-all'
present-in:only    - equivalent to 'in:1:first'

Aliases exist to support common operations. An alias sets the ‘present-in’, ‘value’ and ‘label’ options and so these are not allowed to be used with aliases.

union          - present-in:any     value=sum label=or
intersect      - present-in:all     value=min label=and
subtract       - present-in:first   value=sub label=sub

difference     - present-in:first   value=sub label=sub

union-min      - present-in:any     value=min label=first  <- label?
union-max      - present-in:any     value=max label=first  <- label?
union-sum      - present-in:any     value=sum label=or

intersect-min  - present-in:all    value=min label=first   <- label?
intersect-max  - present-in:all    value=max label=first   <- label?
intersect-sum  - present-in:all    value=sum label=or

disjoint       - present-in:1
  (former symmetric-difference)

EXAMPLES

Generate Counts

Generate Counts in Parallel

Extract Counts

Merqury

Assign Labels to Sources

Assign labels to kmers in an assembly based on the source database they are present in. Each kmer will have a label:

00b  if the kmer appears only in the assembly
01b  if the kmer appears in db1
10b  if the kmer appears in db2
11b  if the kmer appears in both db1 and db1
meryl \
  output final.meryl \
  modify label:and#011b \
  filter label:and#100b \
  union \
    [modify label:#100b asm.meryl] \
    [modify label:#001b db1.meryl] \
    [modify label:#010b db2.meryl]

Note that the ‘modify label:and#011b’ and ‘filter label:and#100b’ actions are completely different, even though they look very similar. The ‘modify’ action will change a label L to ‘L AND 011’ (that is, retain only the rightmost two bits in the label), while the ‘filter’ action will pass only kmers that have a non-zero value for ‘L AND 100’ (that is, that have the third bit set in the label).

An alternate method could intersect the two databases with the assembly kmers first, then assign label. This method does not, however, report kmers that exist only in the assembly. Note that the ‘union’ action will bitwise ‘OR’ the labels together by default.

meryl \
  output final.meryl \
  union \
    [modify label:#01b intersect asm.meryl db1.meryl] \
    [modify label:#10b intersect asm.meryl db2.meryl]

Parameter Reference

Databases

A meryl database stores kmers, values and labels in a compressed binary format.

The value of a kmer is typically the number of times – its count – it occurs in some input sequence.

The label of a kmer is a 64-bit binary string assigned to each kmer by the user. Labels are envisioned to be most useful as a collection of yes/no or true/false flags, for example, “kmer occurred in file 1”. Labels are not fully implemented yet and will appear in a future version.

The database is stored as 64 independent files, each file storing some subset of kmers. Each file can be processed independently, allowing meryl to use up to 64 threads.

Details: Each kmer is stored by breaking the binary represntation into three pieces: a file prefix, a block prefix, and a suffix. A k-mer needs 2*k bits to represent it. The file prefix is 6 bits wide, representing one of the 64 possible files in a database. Inside a file, kmers are stored in blocks, each kmer in a block will have the same block prefix. The suffix data for a block is stored using Elias-Fano encoding (CITE) where each suffix is split into two pieces. The first piece is encoded as a unary offset from the last piece, and the second piece is a N-log2(N)+1 binary value. At present, values are stored as plain 32-bit integers, stored immediately after the kmer data.

Actions

Meryl processing is built around actions. An action reads kmers (or sequence) from one or many inputs and writes kmers to a single output. There are four basic actions:

count

create a meryl database from FASTA or FASTQ inputs

filter

select kmers based on their value of label from a single input

modify

change kmer values or labels in a single input

present-in

combine multiple inputs into one output

Numerous aliases exist for common operations.

Count

The count action reads sequence from input files and converts to a list of kmers and the number of times each kmer occurs in the input sequence.

Input sequences can be in either FASTA, FASTQ, raw bases, or if compiled with Canu support, in a Canu seqStore database. Sequence files can be gzip, bzip2 or xz compressed.

An output database must be supplied to all count actions. Kmers are written to both the output database and provided as input to other actions.

By default, count will count and store kmers in their canonical form. Actions count-forward and count-reverse can be used to instead process kmers as they occur in the sequence (count-forward) or as the occur in the reverse-complement sequence (count-reverse).

Unless derived from another input database, count actions must have a kmer size supplied on the command line with the -k option (or via the deprecated inline option k=<kmer-size>).

Counting is resource intense. By default, meryl will use all available resources on the host machine, or the limits as supplied by a grid manager such as Slurm, PBS or SGE. Lower limits can be specified with options -m and -t (or with deprecated inline options memory=<max-memory-to-use-in-gigabytes> and threads=<max-number-of-cpus-to-use>).

Homopolymer-compressed kmers can be counted by supplying the compress flag to the count action.

% meryl count output kmers.meryl compress sequence.fasta

Counting Algorithms

Two algorithms are used for counting kmers. The algorithm that is expected to use the least memory is used. The choice depends mostly on kmer size, but a little bit on the size of the input sequences.

Small Kmers

For small kmers (at most 16) meryl counts kmers directly by associating an integer count with each possible kmer. This has the benefit of being simple and uses a constant amount of memory regardless of the size of the input, but quickly exhausts memory for even moderate kmer sizes.

There are 4k kmers of size k; for k=16, there are 4,294,967,296 possible kmers. At a minimum, counting 16-mers with this method will use 8 GB of memory, independent of input size. As counts increase, memory is added in chunks of 512 MB.

Details: Each integer counter is initially a 16-bit value. Once any count exceeds 216 = 65,535 another bit is added to all value, resulting in 17-bit values for every kmer. Once any count then exceeds 217 = 131,072, another bit is added, and so on.

This method uses only a single thread to read the input sequence and increment counters in the array, but multiple threads can be used to generate the output database.

Large Kmers

For large kmers (generally, larger than 16), meryl converts an input sequence to a list of all kmers, duplicates included, in it. The list of kmers is sorted. It is now trivial to scan the list, counting how many times each kmer is present, and immediately writing the kmer and its value to the output database.

If all kmers in an input sequence do not fit in memory, a partial result is written to disk. After all input sequences have been processed, the partial results are combined into a single output database. In practice, this method requires several gigabytes of memory to minimize the overhead of writing and merging partial results.

This algorithm can use multiple threads for every stage.

Details: Each kmer is split into a prefix and a suffix. The prefix is used to select a list to which the suffix is added. A trade off is made between a small prefix (resulting in few lists that store large suffixes) and a large prefix (resulting in many lists where the overhead of each list could use more space than the lists themselves). When the (approximate) size of all lists exceeds a user-supplied threshold, each list is sorted, the suffixes are counted, and output to an intermediate database. After all kmers are processed, the intermediate databases are merged into one.

Filter

Modify

Present-In

COMMAND LINE OPTIONS

Global Options

Global options apply to every action.

Option

Description

-h

show command line usage

-help

--help

help

-V

increase verbosity of algorithms

-Q

run quietly, don’t report any algorithm chatter

-P

show progress of algorithms (mostly not implemented)

-C

only configure the computation, don’t do any processing

-k <bases>

set the kmer size, necessary for counting actions

-l <bits>

set the label size

-m

set maximum memory usage in gigabytes

-memory

--memory

-t

set the number of CPUs to use

-threads

--threads

Obsolete forms of some of these are still allowed. The kmer size could be set with k=<kmer-size>, memory limits with memory=<memory-in-gigabytes>, thread usage with threads=<thread-count>.

Obsolete option -E has been removed. This used to estimate the size of an imput that could be counted in some given memory size.

Deprecated Options

n=

d= distinct= f= word-frequency= t= threshold=

segment=

DEBUG ACTIONS

dumpIndex <meryl-database>

Report the parameters used for storing data in this meryl database.

dumpFile <meryl-database-file>

Dump the raw data from a merylData file in a meryl database.

Software Background

Meryl was first imagined in late-2000/early-2001/mid-2001 while BPW as at Celera Genomics.

A 3 June 2001 journal entry is titled “merConting the genome”, but makes no mention of motivation for doing so. The same journal has an 18 June entry outlining the algorithm, and this entry also strongly hints the application was to build a fast 20-mer to assembly coordinate lookup table. The “Meryl” name is mentioned on 16 August 2002, desiring to increment/decrement the counts. There is even source code dating back to 12 February 2001 deep in the dusty archives; as that’s the oldest archive found, meryl is doubtless a little bit older yet. It was originally called “merMaid” or “merCounter”, the “Meryl” name first appears on 11 June 2001, and the code looks reasonably mature.

Meryl was definitely used twice in the assembly of Anopheles gambiae [Holt, et al. 2002], once to find k-mers that occur frequently and exclude them from seeding overlaps (which is how it is still used in Canu [Koren and Walenz 2017]) and once to estimate repeat content of the assembly. It is mentioned, anonymously, the paper:

By counting the number of times each 20-nucleotide oligomer in the Anopheles and Drosophila assemblies appeared in its corre- sponding whole-genome shotgun data, we con- firmed that simple repeats are not expanded in Anopheles

The supplement elaborates a bit:

METHODS FOR ESTIMATING REPEAT CONTENT:

The consensus sequence of scaffolded contigs was analyzed for repeat content as follows. For each 20mer of the consensus, we counted the number of times that that 20mer appeared in the set of approximately 4.5 million sequence reads. Out of 262.8 M consensus 20mers, 26.3M (10.0%) occurred more than 50 times (5 times what is expected for unique sequence, given 10- fold sequence coverage). For comparison, of 133.5M consensus 20mers from a recent reassembly of the D. melanogaster genome that was done using the same version of the assembly software that was used for mosquito , 13.0M (9.7%) were observed more than 60 times (5 times more than expected, given 12-fold sequence coverage of the Drosophila genome). Using a count cutoff of 1.5 times expected or 50 times expected gives a similar picture: the repetative fractions of the A. gambiae and D. melanogaster genomes are not strikingly different.

Unfortunately, the earliest versions of the software are not online, likely because they were never stored in a version control system. The earliest online version is found in the initial commit of the kmer subversion repository on 2 January 2003. The main function in meryl.C shows it supporting the “binary and mathematical” operations, including “increment/decrement” from the 16 August 2002 journal entry.

The Celera Assembler subversion repository also has a relatively early version. CA was publicly released to SourceForge on 14 April 2004 (revision 4 has the good stuff). Use of meryl is mentioned in wga.pl. The meryl code and a CA-specific function to access sequence data are nicely organized, the bulk of meryl in one file (AS_MER_meryl.cc).

The offline BPW archives contain several early versions, including one just nine days after the first journal entry mentioned above:

-rw-r--r--  1 bri  bri  5964 Jun 10  2001 ne-20010611-1812/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri   180 Jul  3  2001 ne-20010703-1747/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  1322 Jul  6  2001 ne-20010706-1813/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  4132 Jul  9  2001 ne-20010709-1647/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  4294 Jul 10  2001 ne-20010710-1842/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  4675 Jul 11  2001 ne-20010712-1907/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  4775 Jul 18  2001 ne-20010718-1745/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  4706 Jul 23  2001 ne-20010723-1800/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  4850 Aug 15  2001 ne-20010817-1800/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  4988 Feb 27  2005 ne-20010831-1745/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  4988 Aug 30  2001 ne-20010912-1829/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  5130 Feb 27  2005 ne-20010928-1826/near-identity/meryl/meryl.C
-rw-r--r--  1 bri  bri  5707 Feb 27  2005 ne-20011109-1837/near-identity/meryl/meryl.C

(I’m not sure where the 2005 time stamps came from; I vaguely remember replacing duplicate files with symlinks before some of these directories were tar’d and compressed.)

Peeking into that first version, we find:

d:.../ne-20010611-1812/near-identity/meryl> ls -l
total 88
-rw-r--r--  1 bri  bri   2187 Jun 10  2001 1.C
-rw-r--r--  1 bri  bri   1232 Jun 10  2001 2.C
-rw-r--r--  1 bri  bri   1582 Jun 10  2001 3.C
-rw-r--r--  1 bri  bri   2496 Jun 10  2001 4.C
-rw-r--r--  1 bri  bri    538 Jun 11  2001 Makefile
-rwxr-xr-x  1 bri  bri  50432 Jun 10  2001 meryl
-rw-r--r--  1 bri  bri   5964 Jun 10  2001 meryl.C
-rw-r--r--  1 bri  bri   1725 Jun 10  2001 meryl.H

Is that a binary?

d:.../ne-20010611-1812/near-identity/meryl> file meryl
meryl: ELF 32-bit LSB executable, \
       Intel 80386, \
       version 1 (FreeBSD), \
       dynamically linked, \
       interpreter /usr/libexec/ld-elf.so.1, \
       for FreeBSD 4.1, \
       not stripped

The algorithm is quite simple, but the details are complicated: it simply builds a list of all the kmers in the input sequences, sorts, then counts how many times each kmer is in the list and write that to the output file.

The details are to split the kmer into a prefix and a suffix. The prefix points to a bucket into which all the suffixes are listed. Once all kmers are stored, each bucket is sorted then scanned to count the kmers. Bit-packed integers are used throughout to minimize memory usage.

In all it’s embarassing glory, here is the first ever version of meryl (omitting the functions that do the actual work).

#include "meryl.H"

//  theSeq is a compressed sequence.  Three bits per character, the first bit
//  telling us if the next two bits are valid sequence.
//
u64bit  *theSeq       = 0L;
u64bit   theSeqLen    = 0;
u64bit   numberOfMers = 0;

//  theCounts is a packed-word array of the number of mers
//  that all have the same TABLEBITS bits on the left side of
//  the mer.
//
//  theMers is a list of the other bits in the mers.
//
u64bit  *theCounts       = 0L;
u64bit   theCountsWords  = 0;
u64bit  *theMers         = 0L;
u64bit   theNumberOfMers = 0;

u64bit  approxMers = 1073741824;
u64bit  merSize    = 20;
u64bit  merMask    = u64bitMASK(merSize);

u64bit  tblBits = 20;
u64bit  tblMask = u64bitMASK(tblBits);

u64bit  chkBits = 0;
u64bit  chkMask = 0;

u64bit  bitsPerIndex = 0;

bool    beVerbose           = true;
bool    doReverseComplement = false;

char   *seqFileName = 0L;
FILE   *seqFile     = 0L;

void
usage(char *name) {
  fprintf(stderr, "usage: %s [-m merSize] [-v] [-r] [-f seqFile] [-s seqFileSize] [-h]\n", name);
  fprintf(stderr, "       -m   Sets the size of mer to count.  Must be less than 32.\n");
  fprintf(stderr, "       -v   Be noisy about it.\n");
  fprintf(stderr, "       -r   Also counts the reverse complement.\n");
  fprintf(stderr, "       -f   File to count mers in.  MultiFastA\n");
  fprintf(stderr, "       -s   Upper limit on the number of mers.\n");
  fprintf(stderr, "       -h   This.\n");
}


int
main(int argc, char **argv) {

  int   arg = 1;
  while (arg < argc) {
    switch (argv[arg][1]) {
      case 'v':
        beVerbose = true;
        break;
      case 'r':
        doReverseComplement = true;
        break;
      case 'f':
        arg++;
        seqFileName = argv[arg];
        break;
      case 'm':
        arg++;
        merSize = atol(argv[arg]);
        break;
      case 's':
        arg++;
        approxMers = atol(argv[arg]);
        break;
      case 'h':
        usage(argv[0]);
        exit(1);
        break;
      default:
        usage(argv[0]);
        fprintf(stderr, "Unknown option '%s'\n", argv[arg]);
        exit(1);
        break;
    }
    arg++;
  }

  if ((seqFileName == 0L) ||
      ((seqFileName[0] == '-') && (seqFileName[1] == 0)))
    seqFile = stdin;
  else
    seqFile = fopen(seqFileName, "r");
  if (seqFile == 0L) {
    fprintf(stderr, "Couldn't open the sequence file '%s'.\n", seqFileName);
    exit(1);
  }


  ////////////////////////////////////////////////////////////
  //
  //  Based on the approximate number of mers given, set the
  //  size of the table and number of bits per each table entry.
  //

  //  Not the greatest way to find the number of bits needed
  //  to encode approxMers, but easy
  //
  bitsPerIndex = 1;
  while (approxMers > u64bitMASK(bitsPerIndex))
    bitsPerIndex++;

  //  Determine the size of the table to reduce the memory footprint.
  //  This is computed backwards, so that we pick the best smallest
  //  table size.
  //
  u64bit   bestMem = ~u64bitZERO;
  u64bit   bestSiz =  u64bitZERO;
  u64bit   mem;

  for (u64bit siz=32; siz>18; siz-=2) {
    mem   = bitsPerIndex * (u64bitONE << siz) + approxMers * (2*merSize - siz);
    mem >>= 23;

    if (mem < bestMem) {
      bestMem = mem;
      bestSiz = siz;
    }
  }

  tblBits = bestSiz;
  tblMask = u64bitMASK(tblBits);

  chkBits = 2 * merSize - tblBits;
  chkMask = u64bitMASK(chkBits);


  fprintf(stderr, "You told me there will be no more than %lu mers.\n", approxMers);
  fprintf(stderr, "Using %2lu bits to encode positions.\n", bitsPerIndex);
  fprintf(stderr, "Using %2lu bits of the mer for the count.\n", tblBits);
  fprintf(stderr, "Using %2lu bits of the mer for the check.\n", chkBits);





  ////////////////////////////////////////////////////////////
  //
  //  Allocate and clear the counts
  //
  theCountsWords = bitsPerIndex * (u64bitONE << tblBits) / 64 + 2;

  if (beVerbose) {
    fprintf(stderr, "Allocating and clearing the counting table.\n");
    fprintf(stderr, "  %lu entries at %lu bits == %lu MB.\n",
            u64bitONE << tblBits,
            bitsPerIndex,
            theCountsWords * 8 >> 20);
  }

  theCounts      = new u64bit [ theCountsWords ];
  for (u64bit i=theCountsWords; i--; )
    theCounts[i] = u64bitZERO;


  //////////////////////////////////////////////////////////////
  //
  //
  theSeqLen    = 0;
  numberOfMers = 0;
  theSeq       = new u64bit [ 3 * (u64bitONE << bitsPerIndex) / 64 + 1 ];

  if (beVerbose)
    fprintf(stderr, "Computing the bucket sizes (pass one through the sequence).\n");
  readSequenceAndComputeBucketSize();

  if (theNumberOfMers > approxMers) {
    fprintf(stderr, "Sorry.  You need to increase your estimate of the number of mers!\n");
    fprintf(stderr, "I found %lu mers.\n", theNumberOfMers);
    exit(1);
  }


  //////////////////////////////////////////////////////////////
  //
  //
  if (beVerbose)
    fprintf(stderr, "Converting counts to offsets (%lu bits or %lu Mbuckets).\n",
            tblBits,
            (u64bitONE << tblBits) >> 20);
  convertCounts();


  //////////////////////////////////////////////////////////////
  //
  //  Allocate theMers, but don't clear them.
  //
  if (beVerbose) {
    fprintf(stderr, "Allocating space for the mers.\n");
    fprintf(stderr, "  %lu mers at %lu bits each == %lu MB.\n",
            theNumberOfMers,
            chkBits,
            chkBits * theNumberOfMers >> 23);
  }
  theMers = new u64bit [ ((chkBits * theNumberOfMers) >> 6) + 1 ];


  //////////////////////////////////////////////////////////////
  //
  //
  if (beVerbose)
    fprintf(stderr, "Filling the buckets with mers (pass two through the sequence).\n");
  fill();


  //////////////////////////////////////////////////////////////
  //
  //  Sort each bucket
  //
  if (beVerbose)
    fprintf(stderr, "Sorting buckets (%lu total).\n", u64bitONE << tblBits);
  sort();


  //
  //  Yeah, I didn't close the stupid input file.  So what?
  //
}

The first code block reads input sequence from disk, converts each base into a 3-bit code, the counts the size of each bucket.

#include "meryl.H"

void
readSequenceAndComputeBucketSize(void) {
  double  startTime = getTime() - 0.1;
  u64bit  count     = 0;

  theNumberOfMers = u64bitZERO;

  s32bit  timeUntilValid      = 0;
  u64bit  substring           = u64bitZERO;
  u64bit  partialEncoding     = u64bitZERO;
  u32bit  partialEncodingSize = u32bitZERO;

  for (unsigned char ch=nextCharacter(seqFile); ch != 255; ch=nextCharacter(seqFile)) {
    if (ch > 127) {
      timeUntilValid = merSize;
      ch = 0x07;
    } else {
      ch = compressSymbol[ch];

      substring <<= 2;
      substring  |= ch;
      timeUntilValid--;

      if (beVerbose && ((++count & (u64bit)0x3fffff) == u64bitZERO)) {
        fprintf(stderr, " %4ld Mbases -- %8.5f Mb per second\r",
                count >> 20,
                count / (getTime() - startTime) / (1000000.0));
        fflush(stderr);
      }

      if (timeUntilValid <= 0) {
        timeUntilValid = 0;

        theNumberOfMers++;

        u64bit I = ((substring >> chkBits) & tblMask) * bitsPerIndex;

        preIncrementDecodedValue(theCounts,
                                 I >> 6,
                                 I & 0x000000000000003f,
                                 bitsPerIndex);
      }
    }

    //  Place the character into the encoded sequence.  If the character was
    //  not a valid base, ch will have the "I'm a crappy character" flag set,
    //  otherwise, ch is a valid base encoding.
    //
    partialEncoding <<= 3;
    partialEncoding  |= ch;

    partialEncodingSize++;

    //  We can fit 21 bases (at 3 bits per base, 63 bits) per 64 bit word
    //  without going to any trouble.
    //
    if (partialEncodingSize == 21) {
      theSeq[theSeqLen++] = partialEncoding;
      theSeq[theSeqLen]   = ~u64bitZERO;

      partialEncoding     = u64bitZERO;
      partialEncodingSize = u32bitZERO;
    }
  }

  //  Push on the last partialEncoding, if one exists.
  //
  if (partialEncodingSize > 0) {
    while (partialEncodingSize != 21) {
      partialEncoding <<= 3;
      partialEncoding  |= 0x07;
      partialEncodingSize++;
    }

    theSeq[theSeqLen++] = partialEncoding;
  }

  if (beVerbose)
    fprintf(stderr, "\n");
}

The second block of code converts the bucket sizes into an offset to the start of each bucket.

#include "meryl.H"

void
convertCounts(void) {
  double  startTime = getTime() - 0.1;
  u32bit  count     = 0;

  //  Convert the counts into offsets into theMers
  //
  //  We use the trick pointed out by Liliana of setting the offset to the LAST
  //  entry in the bucket, then decrementing when filling.  This also makes
  //  the code below easier -- we don't need to store the size of the bucket
  //  so we can increment the sum after we set the bucket offset.
  //

  u64bit   i = 0;
  u64bit   m = (u64bitONE << tblBits) + 1;
  u64bit   s = 0;
  while (i < m) {

    if (beVerbose && ((++count & (u64bit)0x3fffff) == u64bitZERO)) {
      fprintf(stderr, " %4ld Mbuckets -- %8.5f Mbuckets per second\r",
              count >> 20,
              count / (getTime() - startTime) / (1000000.0));
      fflush(stderr);
    }

    u64bit I = i * bitsPerIndex;

    s = sumDecodedValue(theCounts,
                        I >> 6,
                        I & 0x000000000000003f,
                        bitsPerIndex,
                        s);

    i++;
  }

  if (s != theNumberOfMers) {
    fprintf(stderr, "internal error in stage 2: s != theNumberOfMers.\n");
    exit(1);
  }

  if (beVerbose)
    fprintf(stderr, "\n");
}

The third block of code makes a second pass through all the kmers in the sequence, adding suffixes to the bucket indicated by the prefix.

#include "meryl.H"

void
fill(void) {
  double  startTime = getTime() - 0.1;
  u32bit  count     = 0;

  s32bit  timeUntilValid      = 0;
  u64bit  substring           = u64bitZERO;

  //  Stream again, filling theMers
  //

  for (u64bit i=0; i<theSeqLen; i++) {
    u64bit  t = theSeq[i];
    u64bit  I;
    u64bit  J;
    u64bit  ch;

    for (u32bit j=0; j<21; j++) {
      ch   = t;
      ch >>= 60;
      ch  &= 0x07;

#if 0
      ch   = t & 0x7000000000000000;
      ch >>= 60;
#endif

      t <<= 3;

      if (ch & 0x04) {
        timeUntilValid = merSize;
      } else {
        substring <<= 2;
        substring  |= ch;
        timeUntilValid--;

        if (beVerbose && ((++count & (u64bit)0x3fffff) == u64bitZERO)) {
          fprintf(stderr, " %4ld Mbases -- %8.5f Mb per second\r",
                  count >> 20,
                  count / (getTime() - startTime) / (1000000.0));
          fflush(stderr);
        }

        if (timeUntilValid <= 0) {
          timeUntilValid = 0;

          I = ((substring >> chkBits) & tblMask) * bitsPerIndex;

          J = chkBits * preDecrementDecodedValue(theCounts,
                                                 I >> 6,
                                                 I & 0x000000000000003f,
                                                 bitsPerIndex);

          setDecodedValue(theMers,
                          J >> 6,
                          J & 0x3f,
                          chkBits,
                          substring & chkMask);
        }
      }
    }
  }
  if (beVerbose)
    fprintf(stderr, "\n");
}

The fourth block of code sorts the kmers and writes output. It is using its own implementation of heap sort <https://dl.acm.org/doi/10.1145/512274.512284>_, likely beacuse the C sort() implementation is not in place. Output is written to stdout.

#include "meryl.H"

typedef u64bit heapbit;

void
adjustHeap(heapbit *M, s64bit i, s64bit n) {
  heapbit     m = M[i];

  s64bit     j = (i << 1) + 1;  //  let j be the left child

  while (j < n) {

    if (j<n-1 && M[j] < M[j+1])
      j++;       //  j is the larger child

    if (m >= M[j])   //  a position for M[i] has been found
      break;

    //  Move larger child up a level
    //
    M[(j-1)/2]      = M[j];

    j = (j << 1) + 1;
  }

  M[(j-1)/2]      = m;
}


void
sortBucket(u64bit b) {
  u64bit  st = getCount(b);
  u64bit  ed = getCount(b+1);
  u64bit  sz = ed - st;

  if (sz == 0)
    return;

  heapbit *a = new heapbit [sz + 1];

  for (u64bit i=st; i<ed; i++) {
    u64bit J = i * chkBits;

    a[i-st] = getDecodedValue(theMers,
                              J >> 6,
                              J & 0x3f,
                              chkBits);
  }

  u64bit  substring;
  char    mer[merSize+1];
  u64bit  count     = u64bitONE;
  u32bit  i;
  u32bit  m;

  //  Sort if there is more than one item
  //
  if (sz > 1) {
    s64bit  i;

    //  Create the heap of lines.
    //
    for (i=(sz-2)/2; i>=0; i--)
      adjustHeap(a, i, sz);

    //  Interchange the new maximum with the element at the end of the tree
    //
    for (i=sz-1; i>0; i--) {
      heapbit m  = a[i];
      a[i]       = a[0];
      a[0]       = m;

      adjustHeap(a, 0, i);
    }
  }

  for (i=1; i<sz; i++) {
    if (a[i] == a[i-1]) {
      count++;
    } else {
      if (count > 100) {
        substring = b << chkBits | a[i-1];
        for (m=0; m<merSize; m++)
          mer[merSize-m-1] = decompressSymbol[(substring >> (2*m)) & 0x03];
        mer[m] = 0;
        printf("%s %lu\n", mer, count);
      }
      count = 1;
    }
  }

  //  Output the last one
  //
  if (count > 100) {
    substring = b << chkBits | a[i-1];
    for (m=0; m<merSize; m++)
      mer[merSize-m-1] = decompressSymbol[(substring >> (2*m)) & 0x03];
    mer[m] = 0;
    printf("%s %lu\n", mer, count);
  }

  delete [] a;
}


void
sort(void) {
  double  startTime = getTime() - 0.1;
  u32bit  count     = 0;
  u64bit  m         = u64bitONE << tblBits;

  for (u64bit i=0; i<m; i++) {
    if (beVerbose && ((++count & (u64bit)0x3fff) == u64bitZERO)) {
      fprintf(stderr, "             %4lu buckets -- %8.5f buckets per second\r",
              count,
              count / (getTime() - startTime));
      fflush(stderr);
    }

    sortBucket(i);
  }
  if (beVerbose)
    fprintf(stderr, "\n");
}

Meryl is a tool for working with sets of kmers. A set of kmers, when annotated with an integer value (typically the number of times the kmer occurs in a set of sequences) and optionally a label, is termed a database.

Meryl comprises three (modules), one for generating kmer databases, one for filtering and combining databases, and one for searching databases. A simple, but yet to be documented, C++ API to access kmer databases exists.

Publications

Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21, 245 (2020).

Install

The easiest way to get started is to download a release. If you encounter any issues, please report them using the github issues page.

Alternatively, you can also build the latest unreleased from github:

git clone https://github.com/marbl/meryl.git
cd meryl/src
make -j <number of threads>

Learn