Quick Start

To give a quick introduction to meryl2, we’ll use three random Escherichia coli assemblies pulled from GenBank. They’re not quite random; they’re flagged as complete and have the highest number of scaffolds - which, we hope, means they have a complete nuclear genome and a few handfuls of plasmids.

We’ll use EC931, SCEC020022 and MSB1_4I-sc-2280412, but give them shorter names for convenience.

Downloading data.
1curl -LRO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/522/225/GCA_014522225.1_ASM1452222v1/GCA_014522225.1_ASM1452222v1_genomic.fna.gz
2curl -LRO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/165/095/GCA_002165095.2_ASM216509v2/GCA_002165095.2_ASM216509v2_genomic.fna.gz
3curl -LRO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/071/835/GCA_905071835.1_MSB1_4I/GCA_905071835.1_MSB1_4I_genomic.fna.gz
4
5mkdir data
6
7mv -i GCA_014522225.1_ASM1452222v1_genomic.fna.gz data/ec.fna.gz
8mv -i GCA_002165095.2_ASM216509v2_genomic.fna.gz  data/sc.fna.gz
9mv -i GCA_905071835.1_MSB1_4I_genomic.fna.gz      data/ms.fna.gz

First, lets count the kmers in each and save the results in meryl databases.

Counting k-mers in a single file.
 1% meryl2 -k 42 count data/ec.fna.gz output=ec.meryl
 2
 3Found 1 command tree.
 4
 5|- TREE 0: action #1 count kmers
 6   |> database 'ec.meryl'
 7   |- SET value to that of the kmer in the first input
 8   |- SET label to first -- constant 0
 9   ^- INPUT @1: sequence file 'data/ec.fna.gz'
10|- TREE 0 ends.
11
12Counting 4475 (estimated) thousand canonical 42-mers from 1 input file:
13    sequence-file: data/ec.fna.gz
14
15[...]

We can show the k-mers in the database by printing the output to the screen (or to a file with print=42-mers.txt).

Displaying k-mers.
 1% meryl2 -t 1 -Q print ec.meryl | head
 2
 3AAAAAAAAAAACGGATCTGCATCACAACGAGAGTGATCCCAC      1
 4AAAAAAAAAACTTCCACCAATATGATGGGTGGCGTACAGTAA      1
 5AAAAAAAAAACGGATCTGCATCACAACGAGAGTGATCCCACC      1
 6AAAAAAAAAATAGACAAAAAATACTTTATCAAAACATACATA      1
 7AAAAAAAAACCATCCAAATCTGGATGGCTTTTCATAATTCTG      1
 8AAAAAAAAACCCGATTTTATCAGGCGTCAACCTCTGAATTGT      1
 9AAAAAAAAACCCGCTGATTAAGCGGGTTTTGAATTCTTGCTG      1
10AAAAAAAAACCTGAAAAAACGGCCTGACGTGAATCAAGCAAT      1
11AAAAAAAAACCTGCCATCGCTGGCAGGTTTTTTATGACTAAA      1
12AAAAAAAAACCGACCAAGGGTCGGGGCAAGAATCAGAGTCTG      1

(Messy detail: Option -Q turns off the report of the command tree. Option -t 1 is supplied to prevent meryl from processing its 64 data chunks in parallel; for the print operation this results in the output kmers being unsorted. See {INSERT LINK TO THREADS AND PRINT HERE}.)

Let’s now count the other two databases and combine the kmers from all three genomes into one database, all in one command.

Counting two FASTA files and merging with a third database.
 1% meryl2 \
 2    union-sum \
 3    output=union.meryl \
 4    [count data/sc.fna.gz output=sc.meryl] \
 5    [count output=ms.meryl \
 6           data/ms.fna.gz] \
 7    ec.meryl
 8
 9Found 1 command tree.
10
11|- TREE 0: action #1 filter kmers
12   |> database 'union.meryl'
13   |- SET value to the sum of all kmers and constant 0
14   |- SET label to or -- constant 0
15   |- FILTER 1
16      |- EMIT if <index filter not described>
17   ^- INPUT @1: action #2 count kmers
18      |> database 'sc.meryl'
19      |- SET value to that of the kmer in the first input
20      |- SET label to first -- constant 0
21      ^- INPUT @1: sequence file 'data/sc.fna.gz'
22   ^- INPUT @2: action #3 count kmers
23      |> database 'ms.meryl'
24      |- SET value to that of the kmer in the first input
25      |- SET label to first -- constant 0
26      ^- INPUT @1: sequence file 'data/ms.fna.gz'
27   ^- INPUT @3: meryl database 'ec.meryl'
28|- TREE 0 ends.
29
30[...]

There’s a lot going on here. There are three actions (lines 11, 17 and 22), three output files (lines 12, 18 and 23), and three inputs (lines 17, 22, and 27). Each action decribes how to combine the k-mers in its inputs into an output k-mer, then passes those k-mers to its destination actions. Here, action #1 takes input k-mers from INPUT @1 (which itself is action #2), INPUT @2 (action #3) and INPUT @3 (which is a pre-computed database of k-mers).

The union-sum action in the command line is action #1 in the tree. Lines 13 and 14 describe how this action is computing the output value and label of each k-mer. Line 16 will {EVENTUALLY} describe the conditions that must be met for a k-mer to be output. For union, there is only one condition: the k-mer must be in at least one input.

The end result of this is to independently count kmers in sc.fna.gz and ms.fna.gz, writing the kmers from each to outpout databases {SEE COUNTING} sc.meryl and ms.meryl, respectively. When the counting operations are done, those two new databases and the third pre-computed database are sent as input to the first action which will combine all k-mers, summing their values, into one output.

Notice that the -k 42 option is not present. Meryl will determine the k-mer size from the ec.meryl input database and use that for all operations. However, if a k-mer size is supplied, it must match the size of ALL input databases, and ALL input databases must have k-mers of the same size.

A meryl database also stores the histogram of kmer values. This can be displayed:

A k-mer value histogram.
 1% meryl2 -Q histogram ec.meryl
 2
 31       4911809
 42       37336
 53       7632
 64       1217
 75       2705
 86       2232
 97       4544
108       384
119       862
1211      3
1312      4
1415      967
1516      230
1618      1
1719      1
1821      81
1922      3
2027      39
2129      4
2230      21
2331      19
2432      5
2537      39
2648      3
2749      42
2850      38
2951      15
3052      493
3153      13

Which hints there is a 52 copy repeat of around 500 bases in Escherichia coli EC931. Histograms from the other two genomes show either no high copy repeat (Escherichia coli SCEC020022, sc.meryl) or a potential 64 copy repeat (Escherichia coli MSB1_4I-sc-2280412, ms.meryl). Let’s now extract those kmers and see where they are on the genomes.

Extracting high-value k-mers.
 1% meryl2 print=ec-repeats.dump at-least 48 ec.meryl output=ec-repeats.meryl
 2
 3Found 1 command tree.
 4
 5|- TREE 0: action #1 filter kmers
 6   |> database 'ec-repeats.meryl'
 7   |> text file 'ec-repeats.dump'
 8   |- SET value to that of the kmer in the first input
 9   |- SET label to first -- constant 0
10   |- FILTER 1
11      |- EMIT if output kmer value     is-more-or-equal constant value 48
12   ^- INPUT @1: meryl database 'ec.meryl'
13|- TREE 0 ends.
14
15[...]
16
17% wc -l ec-repeats.dump
18     604 ec-repeats.dump

The meryl-lookup tool compares FASTA/FASTQ sequences against a meryl database (or several databases) and generates various reports about how the k-mers in the database(s) “paint” onto the input sequences. We’ll use it to generate a bed file of the bases covered by kmers in our E.coli database.

Finding runs of high-value k-mers in a genome.
 1% meryl2-lookup -sequence data/ec.fna.gz -mers ec-repeats.meryl -bed-runs > ec-repeats.bed
 2--
 3-- Estimating memory usage for 'ec-repeats.meryl'.
 4--
 5
 6 p       prefixes             bits gigabytes (allowed: 63 GB)
 7-- -------------- ---------------- ---------
 8 2              4            53408     0.000
 9 3              8            53060     0.000
10 4             16            52968     0.000
11 5             32            53388     0.000
12 6             64            54832     0.000 (smallest)
13 7            128            58324     0.000
14 8            256            65912     0.000
15 9            512            81692     0.000 (faster)
1610           1024           113856     0.000
1711           2048           178788     0.000
1812           4096           309256     0.000
1913           8192           570796     0.000
20-- -------------- ---------------- ---------
21              604 total kmers
22
23--
24-- Minimal memory needed: 0.000 GB
25-- Optimal memory needed: 0.000 GB  enabled
26-- Memory limit           63.907 GB
27--
28--
29-- Loading kmers from 'ec-repeats.meryl' into lookup table.
30--
31
32For 604 distinct 42-mers (with 9 bits used for indexing and 75 bits for tags):
33    0.000 GB memory for kmer indices -          512 elements 64 bits wide)
34    0.000 GB memory for kmer tags    -          604 elements 75 bits wide)
35    0.000 GB memory for kmer values  -          604 elements  6 bits wide)
36    0.000 GB memory
37
38Will load 604 kmers.  Skipping 0 (too low) and 0 (too high) kmers.
39Allocating space for 16732 suffixes of 75 bits each -> 1254900 bits (0.000 GB) in blocks of 32.000 MB
40                     16732 values   of 6 bits each -> 100392 bits (0.000 GB) in blocks of 32.000 MB
41Loaded 604 kmers.  Skipped 0 (too low) and 0 (too high) kmers.
42-- Opening input sequences 'data/ec.fna.gz'.
43-- Opening output file '-'.
44Bye!
45
46% head ec-repeats.bed
47CP049118.1      46087   46370
48CP049118.1      46409   46729
49CP049118.1      46729   46856
50CP049118.1      140430  140592
51CP049118.1      140592  140713
52CP049118.1      140752  141072
53CP049118.1      141072  141199
54CP049118.1      270969  271131
55CP049118.1      271131  271252
56CP049118.1      271291  271611

From this we see that there is not a single 52-copy repeat, but several shorter repeats. Curiously, there are several instances of a 447 base repeat with single base differences:

Curiously similar repeats.
1CP049118.1      3782667 3782794
2CP049118.1      3782794 3783114
3
4CP049118.1      3762937 3763257
5CP049118.1      3763257 3763384

Though this isn’t really part of meryl, the high-count kmers can be passed to a greedy assembler with nice results (the greedy assembler is included in the meryl source code, but isn’t installed in the binary directory).

Greedily assembling high-value k-mers.
 1% perl $MERYL/scripts/greedy-assemble-kmers.pl < ec-repeats.dump
 2>1
 3AGCCTGTCATACGCGTAAAACAGCCAGCGCTGGCGCGATTTAGCCCCGACATAGCCCCACTGTTCGTCCATTTCCGC
 4GCAGACGATGACGTCACTGCCCGGCTGTATGCGCGAGGTTACCGACTGCGGCCTGAGTTTTTTAAGTGACGTAAAAT
 5CGTGTTGAGGCCAACGCCCATAATGCGGGCTGTTGCCCGGCATCCAACGCCATTCATGGCCATATCAATGATTTTCT
 6GGTGCGTACCGGGTTGAGAAGCGGTGTAAGTGAACTGCAGTTGCCATGTTTTACGGCAGTGAGAGCAGAGATAGCGC
 7TGATGTCCGGC
 8>2
 9ATGGCGACGCTGGGGCGTCTTATGAGCCTGCTGTCACCCTTTGACGTGGTGATATGGATGACGGATGGCTGGCCGCT
10GTATGAATCCCGCCTGAAGGGAAAGCTGCACGTAATCAGCAAGCGATATACGCAGCGAATTGAGCGGCATAACCTGA
11ATCTGAGGCAGCACCTGGCACGGCTGGGACGGAAGTCGCTGTCGTTCTCAAAATCGGTGGAGCTGCATGACAAAGTC
12ATCGGGCATTATCTGAACATAAAACACTATCAATAAGTTGGAGTCATTACC
13>3
14GTGCTTTTGCCGTTACGCACCACCCCGTCAGTAGCTGAACAGGAGGGACAGCTGATAGAAACAGAAGCCACTGGAGC
15ACCTCAAAAACACCATCATACACTAAATCAGTAAGTTGGCAGCATCACC

Dropping the kmer threshold to 10 (at-least 10) and assembling those repeat k-mers finds 11 repeat sequences, one of length 770 bp and one of length 1195 bp.

For our final example, let’s find the high-value k-mers common to EC931 and MSB1_4I-sc-2280412 and assemble those.

Greedily assembling medium-and-high-value k-mers.
 1% meryl2 -Q \
 2    print \
 3    intersect \
 4      [at-least 48 ec.meryl] \
 5      [at-least 55 ms.meryl] \
 6  | \
 7  perl $MERYL/scripts/greedy-assemble-kmers.pl
 8>1
 9GGTAATGACTCCAACTTATTGATAGTGTTTTATGTTCAGATAATGCCCGATGACTTTGTCATGCAGCTCCACCGATT
10TTGAGAACGACAGCGACTTCCGTCCCAGCCGTGCCAGGTGCTGCCTCAGATTCAGGTTATGCCGCTCAATTCGCTGC
11GTATATCGCTTGCTGATTACGTGCAGCTTTCCCTTCAGGCGGGATTCATACAGCGGCCAGCCATCCGTCATCCATAT
12CACCACGTCAAAGGGTGACAGCAGGCTCATAAGACGCCCCAGCGTCGCCAT
13>2
14AGCCTGTCATACGCGTAAAACAGCCAGCGCTGGCGCGATTTAGCCCCGACATAGCCCCACTGTTCGTCCATTTCCGC
15GCAGACGATGACGTCACTGCCCGGCTGTATGCGCGAGGTTACCGACTGCGGCCTGAGTTTTTTAAGTGACGTAAAAT
16CGTGTTGAGGCCAACGCCCATAATGCGGGCTGTTGCCCGGCATCCAACGCCATTCATGGCCATATCAATGATTTTCT
17GGTGCGTACCGGGTTGAGAAGCGGTGTAAGTGAACTGCAGTTGCCATGTTTTACGGCAGTGAGAGCAGAGATAGCGC
18TGATGTCCGGC
19>3
20GTGCTTTTGCCGTTACGCACCACCCCGTCAGTAGCTGAACAGGAGGGACAGCTGATAGAAACAGAAGCCACTGGAGC
21ACCTCAAAAACACCATCATACACTAAATCAGTAAGTTGGCAGCATCACC

Aside from a strand and sequence order difference caused by the assemlber, they’re the same!

The same!
1% minimap2 --eqx -Y -c ec.fasta ec+ms.fasta
21 282 0 282 - 2 282 0 282 282 282 60 NM:i:0 ms:i:564 AS:i:564 nn:i:0 tp:A:P cm:i:49 s1:i:274 s2:i:0 de:f:0 rl:i:0 cg:Z:282=
32 319 0 319 + 1 319 0 319 319 319 60 NM:i:0 ms:i:638 AS:i:638 nn:i:0 tp:A:P cm:i:58 s1:i:311 s2:i:0 de:f:0 rl:i:0 cg:Z:319=
43 126 0 126 + 3 126 0 126 126 126 60 NM:i:0 ms:i:252 AS:i:252 nn:i:0 tp:A:P cm:i:16 s1:i:115 s2:i:0 de:f:0 rl:i:0 cg:Z:126=
But rearranged.
1% minimap2 -x sr --eqx -Y -c data/ec.fna.gz ec.fasta
21 319 0 319 + CP049118.1 4767973 2542965 2543284 319 319 0 NM:i:0 ms:i:638 AS:i:638 nn:i:0 tp:A:P cm:i:44 s1:i:304 s2:i:304 de:f:0 rl:i:0 cg:Z:319=
32 282 0 282 - CP049118.1 4767973 3430238 3430520 282 282 0 NM:i:0 ms:i:564 AS:i:564 nn:i:0 tp:A:P cm:i:39 s1:i:266 s2:i:266 de:f:0 rl:i:0 cg:Z:282=
43 126 0 126 - CP049118.1 4767973 4539117 4539243 126 126 0 NM:i:0 ms:i:252 AS:i:252 nn:i:0 tp:A:P cm:i:19 s1:i:123 s2:i:123 de:f:0 rl:i:0 cg:Z:126=
5
6% minimap2 -x sr --eqx -Y -c data/ms.fna.gz ec.fasta
71 319 0 319 - LR898874.1 5278133   11725   12044 319 319 0 NM:i:0 ms:i:638 AS:i:638 nn:i:0 tp:A:P cm:i:44 s1:i:304 s2:i:304 de:f:0 rl:i:0 cg:Z:319=
82 282 0 282 - LR898874.1 5278133 1418500 1418782 282 282 0 NM:i:0 ms:i:564 AS:i:564 nn:i:0 tp:A:P cm:i:39 s1:i:266 s2:i:266 de:f:0 rl:i:0 cg:Z:282=
93 126 0 126 - LR898874.1 5278133 1510654 1510780 126 126 0 NM:i:0 ms:i:252 AS:i:252 nn:i:0 tp:A:P cm:i:19 s1:i:123 s2:i:123 de:f:0 rl:i:0 cg:Z:126=

We’ll stop the quick-start here, before we barrel out of control into repeats.