Quick Start¶

To give a quick introduction to meryl2, we’ll use three random Escherichia coli assemblies pulled from GenBank. They’re not quite random; they’re flagged as complete and have the highest number of scaffolds - which, we hope, means they have a complete nuclear genome and a few handfuls of plasmids.

We’ll use EC931, SCEC020022 and MSB1_4I-sc-2280412, but give them shorter names for convenience.

Downloading data.¶

curl -LRO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/522/225/GCA_014522225.1_ASM1452222v1/GCA_014522225.1_ASM1452222v1_genomic.fna.gz
curl -LRO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/165/095/GCA_002165095.2_ASM216509v2/GCA_002165095.2_ASM216509v2_genomic.fna.gz
curl -LRO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/071/835/GCA_905071835.1_MSB1_4I/GCA_905071835.1_MSB1_4I_genomic.fna.gz

mkdir data

mv -i GCA_014522225.1_ASM1452222v1_genomic.fna.gz data/ec.fna.gz
mv -i GCA_002165095.2_ASM216509v2_genomic.fna.gz  data/sc.fna.gz
mv -i GCA_905071835.1_MSB1_4I_genomic.fna.gz      data/ms.fna.gz

First, lets count the kmers in each and save the results in meryl databases.

Counting k-mers in a single file.¶

% meryl2 -k 42 count data/ec.fna.gz output=ec.meryl

Found 1 command tree.

|- TREE 0: action #1 count kmers
   |> database 'ec.meryl'
   |- SET value to that of the kmer in the first input
   |- SET label to first -- constant 0
   ^- INPUT @1: sequence file 'data/ec.fna.gz'
|- TREE 0 ends.

Counting 4475 (estimated) thousand canonical 42-mers from 1 input file:
    sequence-file: data/ec.fna.gz

[...]

We can show the k-mers in the database by printing the output to the screen (or to a file with print=42-mers.txt).

Displaying k-mers.¶

% meryl2 -t 1 -Q print ec.meryl | head

AAAAAAAAAAACGGATCTGCATCACAACGAGAGTGATCCCAC      1
AAAAAAAAAACTTCCACCAATATGATGGGTGGCGTACAGTAA      1
AAAAAAAAAACGGATCTGCATCACAACGAGAGTGATCCCACC      1
AAAAAAAAAATAGACAAAAAATACTTTATCAAAACATACATA      1
AAAAAAAAACCATCCAAATCTGGATGGCTTTTCATAATTCTG      1
AAAAAAAAACCCGATTTTATCAGGCGTCAACCTCTGAATTGT      1
AAAAAAAAACCCGCTGATTAAGCGGGTTTTGAATTCTTGCTG      1
AAAAAAAAACCTGAAAAAACGGCCTGACGTGAATCAAGCAAT      1
AAAAAAAAACCTGCCATCGCTGGCAGGTTTTTTATGACTAAA      1
AAAAAAAAACCGACCAAGGGTCGGGGCAAGAATCAGAGTCTG      1

(Messy detail: Option -Q turns off the report of the command tree. Option -t 1 is supplied to prevent meryl from processing its 64 data chunks in parallel; for the print operation this results in the output kmers being unsorted. See {INSERT LINK TO THREADS AND PRINT HERE}.)

Let’s now count the other two databases and combine the kmers from all three genomes into one database, all in one command.

Counting two FASTA files and merging with a third database.¶

% meryl2 \
    union-sum \
    output=union.meryl \
    [count data/sc.fna.gz output=sc.meryl] \
    [count output=ms.meryl \
           data/ms.fna.gz] \
    ec.meryl

Found 1 command tree.

|- TREE 0: action #1 filter kmers
   |> database 'union.meryl'
   |- SET value to the sum of all kmers and constant 0
   |- SET label to or -- constant 0
   |- FILTER 1
      |- EMIT if <index filter not described>
   ^- INPUT @1: action #2 count kmers
      |> database 'sc.meryl'
      |- SET value to that of the kmer in the first input
      |- SET label to first -- constant 0
      ^- INPUT @1: sequence file 'data/sc.fna.gz'
   ^- INPUT @2: action #3 count kmers
      |> database 'ms.meryl'
      |- SET value to that of the kmer in the first input
      |- SET label to first -- constant 0
      ^- INPUT @1: sequence file 'data/ms.fna.gz'
   ^- INPUT @3: meryl database 'ec.meryl'
|- TREE 0 ends.

[...]

There’s a lot going on here. There are three actions (lines 11, 17 and 22), three output files (lines 12, 18 and 23), and three inputs (lines 17, 22, and 27). Each action decribes how to combine the k-mers in its inputs into an output k-mer, then passes those k-mers to its destination actions. Here, action #1 takes input k-mers from INPUT @1 (which itself is action #2), INPUT @2 (action #3) and INPUT @3 (which is a pre-computed database of k-mers).

The union-sum action in the command line is action #1 in the tree. Lines 13 and 14 describe how this action is computing the output value and label of each k-mer. Line 16 will {EVENTUALLY} describe the conditions that must be met for a k-mer to be output. For union, there is only one condition: the k-mer must be in at least one input.

The end result of this is to independently count kmers in sc.fna.gz and ms.fna.gz, writing the kmers from each to outpout databases {SEE COUNTING} sc.meryl and ms.meryl, respectively. When the counting operations are done, those two new databases and the third pre-computed database are sent as input to the first action which will combine all k-mers, summing their values, into one output.

Notice that the -k 42 option is not present. Meryl will determine the k-mer size from the ec.meryl input database and use that for all operations. However, if a k-mer size is supplied, it must match the size of ALL input databases, and ALL input databases must have k-mers of the same size.

A meryl database also stores the histogram of kmer values. This can be displayed:

A k-mer value histogram.¶

% meryl2 -Q histogram ec.meryl

     4911809
     37336
     7632
     1217
     2705
     2232
     4544
     384
     862
    3
    4
    967
    230
    1
    1
    81
    3
    39
    4
    21
    19
    5
    39
    3
    42
    38
    15
    493
    13

Which hints there is a 52 copy repeat of around 500 bases in Escherichia coli EC931. Histograms from the other two genomes show either no high copy repeat (Escherichia coli SCEC020022, sc.meryl) or a potential 64 copy repeat (Escherichia coli MSB1_4I-sc-2280412, ms.meryl). Let’s now extract those kmers and see where they are on the genomes.

Extracting high-value k-mers.¶

% meryl2 print=ec-repeats.dump at-least 48 ec.meryl output=ec-repeats.meryl

Found 1 command tree.

|- TREE 0: action #1 filter kmers
   |> database 'ec-repeats.meryl'
   |> text file 'ec-repeats.dump'
   |- SET value to that of the kmer in the first input
   |- SET label to first -- constant 0
   |- FILTER 1
      |- EMIT if output kmer value     is-more-or-equal constant value 48
   ^- INPUT @1: meryl database 'ec.meryl'
|- TREE 0 ends.

[...]

% wc -l ec-repeats.dump
     604 ec-repeats.dump

The meryl-lookup tool compares FASTA/FASTQ sequences against a meryl database (or several databases) and generates various reports about how the k-mers in the database(s) “paint” onto the input sequences. We’ll use it to generate a bed file of the bases covered by kmers in our E.coli database.

Finding runs of high-value k-mers in a genome.¶

% meryl2-lookup -sequence data/ec.fna.gz -mers ec-repeats.meryl -bed-runs > ec-repeats.bed
--
-- Estimating memory usage for 'ec-repeats.meryl'.
--

 p       prefixes             bits gigabytes (allowed: 63 GB)
-- -------------- ---------------- ---------
 2              4            53408     0.000
 3              8            53060     0.000
 4             16            52968     0.000
 5             32            53388     0.000
 6             64            54832     0.000 (smallest)
 7            128            58324     0.000
 8            256            65912     0.000
 9            512            81692     0.000 (faster)
10           1024           113856     0.000
11           2048           178788     0.000
12           4096           309256     0.000
13           8192           570796     0.000
-- -------------- ---------------- ---------
              604 total kmers

--
-- Minimal memory needed: 0.000 GB
-- Optimal memory needed: 0.000 GB  enabled
-- Memory limit           63.907 GB
--
--
-- Loading kmers from 'ec-repeats.meryl' into lookup table.
--

For 604 distinct 42-mers (with 9 bits used for indexing and 75 bits for tags):
    0.000 GB memory for kmer indices -          512 elements 64 bits wide)
    0.000 GB memory for kmer tags    -          604 elements 75 bits wide)
    0.000 GB memory for kmer values  -          604 elements  6 bits wide)
    0.000 GB memory

Will load 604 kmers.  Skipping 0 (too low) and 0 (too high) kmers.
Allocating space for 16732 suffixes of 75 bits each -> 1254900 bits (0.000 GB) in blocks of 32.000 MB
                     16732 values   of 6 bits each -> 100392 bits (0.000 GB) in blocks of 32.000 MB
Loaded 604 kmers.  Skipped 0 (too low) and 0 (too high) kmers.
-- Opening input sequences 'data/ec.fna.gz'.
-- Opening output file '-'.
Bye!

% head ec-repeats.bed
CP049118.1      46087   46370
CP049118.1      46409   46729
CP049118.1      46729   46856
CP049118.1      140430  140592
CP049118.1      140592  140713
CP049118.1      140752  141072
CP049118.1      141072  141199
CP049118.1      270969  271131
CP049118.1      271131  271252
CP049118.1      271291  271611

From this we see that there is not a single 52-copy repeat, but several shorter repeats. Curiously, there are several instances of a 447 base repeat with single base differences:

Curiously similar repeats.¶

CP049118.1      3782667 3782794
CP049118.1      3782794 3783114

CP049118.1      3762937 3763257
CP049118.1      3763257 3763384

Though this isn’t really part of meryl, the high-count kmers can be passed to a greedy assembler with nice results (the greedy assembler is included in the meryl source code, but isn’t installed in the binary directory).

Greedily assembling high-value k-mers.¶

% perl $MERYL/scripts/greedy-assemble-kmers.pl < ec-repeats.dump
>1
AGCCTGTCATACGCGTAAAACAGCCAGCGCTGGCGCGATTTAGCCCCGACATAGCCCCACTGTTCGTCCATTTCCGC
GCAGACGATGACGTCACTGCCCGGCTGTATGCGCGAGGTTACCGACTGCGGCCTGAGTTTTTTAAGTGACGTAAAAT
CGTGTTGAGGCCAACGCCCATAATGCGGGCTGTTGCCCGGCATCCAACGCCATTCATGGCCATATCAATGATTTTCT
GGTGCGTACCGGGTTGAGAAGCGGTGTAAGTGAACTGCAGTTGCCATGTTTTACGGCAGTGAGAGCAGAGATAGCGC
TGATGTCCGGC
>2
ATGGCGACGCTGGGGCGTCTTATGAGCCTGCTGTCACCCTTTGACGTGGTGATATGGATGACGGATGGCTGGCCGCT
GTATGAATCCCGCCTGAAGGGAAAGCTGCACGTAATCAGCAAGCGATATACGCAGCGAATTGAGCGGCATAACCTGA
ATCTGAGGCAGCACCTGGCACGGCTGGGACGGAAGTCGCTGTCGTTCTCAAAATCGGTGGAGCTGCATGACAAAGTC
ATCGGGCATTATCTGAACATAAAACACTATCAATAAGTTGGAGTCATTACC
>3
GTGCTTTTGCCGTTACGCACCACCCCGTCAGTAGCTGAACAGGAGGGACAGCTGATAGAAACAGAAGCCACTGGAGC
ACCTCAAAAACACCATCATACACTAAATCAGTAAGTTGGCAGCATCACC

Dropping the kmer threshold to 10 (at-least 10) and assembling those repeat k-mers finds 11 repeat sequences, one of length 770 bp and one of length 1195 bp.

For our final example, let’s find the high-value k-mers common to EC931 and MSB1_4I-sc-2280412 and assemble those.

Greedily assembling medium-and-high-value k-mers.¶

% meryl2 -Q \
    print \
    intersect \
      [at-least 48 ec.meryl] \
      [at-least 55 ms.meryl] \
  | \
  perl $MERYL/scripts/greedy-assemble-kmers.pl
>1
GGTAATGACTCCAACTTATTGATAGTGTTTTATGTTCAGATAATGCCCGATGACTTTGTCATGCAGCTCCACCGATT
TTGAGAACGACAGCGACTTCCGTCCCAGCCGTGCCAGGTGCTGCCTCAGATTCAGGTTATGCCGCTCAATTCGCTGC
GTATATCGCTTGCTGATTACGTGCAGCTTTCCCTTCAGGCGGGATTCATACAGCGGCCAGCCATCCGTCATCCATAT
CACCACGTCAAAGGGTGACAGCAGGCTCATAAGACGCCCCAGCGTCGCCAT
>2
AGCCTGTCATACGCGTAAAACAGCCAGCGCTGGCGCGATTTAGCCCCGACATAGCCCCACTGTTCGTCCATTTCCGC
GCAGACGATGACGTCACTGCCCGGCTGTATGCGCGAGGTTACCGACTGCGGCCTGAGTTTTTTAAGTGACGTAAAAT
CGTGTTGAGGCCAACGCCCATAATGCGGGCTGTTGCCCGGCATCCAACGCCATTCATGGCCATATCAATGATTTTCT
GGTGCGTACCGGGTTGAGAAGCGGTGTAAGTGAACTGCAGTTGCCATGTTTTACGGCAGTGAGAGCAGAGATAGCGC
TGATGTCCGGC
>3
GTGCTTTTGCCGTTACGCACCACCCCGTCAGTAGCTGAACAGGAGGGACAGCTGATAGAAACAGAAGCCACTGGAGC
ACCTCAAAAACACCATCATACACTAAATCAGTAAGTTGGCAGCATCACC

Aside from a strand and sequence order difference caused by the assemlber, they’re the same!

The same!¶

% minimap2 --eqx -Y -c ec.fasta ec+ms.fasta
282 0 282 - 2 282 0 282 282 282 60 NM:i:0 ms:i:564 AS:i:564 nn:i:0 tp:A:P cm:i:49 s1:i:274 s2:i:0 de:f:0 rl:i:0 cg:Z:282=
319 0 319 + 1 319 0 319 319 319 60 NM:i:0 ms:i:638 AS:i:638 nn:i:0 tp:A:P cm:i:58 s1:i:311 s2:i:0 de:f:0 rl:i:0 cg:Z:319=
126 0 126 + 3 126 0 126 126 126 60 NM:i:0 ms:i:252 AS:i:252 nn:i:0 tp:A:P cm:i:16 s1:i:115 s2:i:0 de:f:0 rl:i:0 cg:Z:126=

But rearranged.¶

% minimap2 -x sr --eqx -Y -c data/ec.fna.gz ec.fasta
319 0 319 + CP049118.1 4767973 2542965 2543284 319 319 0 NM:i:0 ms:i:638 AS:i:638 nn:i:0 tp:A:P cm:i:44 s1:i:304 s2:i:304 de:f:0 rl:i:0 cg:Z:319=
282 0 282 - CP049118.1 4767973 3430238 3430520 282 282 0 NM:i:0 ms:i:564 AS:i:564 nn:i:0 tp:A:P cm:i:39 s1:i:266 s2:i:266 de:f:0 rl:i:0 cg:Z:282=
126 0 126 - CP049118.1 4767973 4539117 4539243 126 126 0 NM:i:0 ms:i:252 AS:i:252 nn:i:0 tp:A:P cm:i:19 s1:i:123 s2:i:123 de:f:0 rl:i:0 cg:Z:126=

% minimap2 -x sr --eqx -Y -c data/ms.fna.gz ec.fasta
319 0 319 - LR898874.1 5278133   11725   12044 319 319 0 NM:i:0 ms:i:638 AS:i:638 nn:i:0 tp:A:P cm:i:44 s1:i:304 s2:i:304 de:f:0 rl:i:0 cg:Z:319=
282 0 282 - LR898874.1 5278133 1418500 1418782 282 282 0 NM:i:0 ms:i:564 AS:i:564 nn:i:0 tp:A:P cm:i:39 s1:i:266 s2:i:266 de:f:0 rl:i:0 cg:Z:282=
126 0 126 - LR898874.1 5278133 1510654 1510780 126 126 0 NM:i:0 ms:i:252 AS:i:252 nn:i:0 tp:A:P cm:i:19 s1:i:123 s2:i:123 de:f:0 rl:i:0 cg:Z:126=

We’ll stop the quick-start here, before we barrel out of control into repeats.

Quick Start¶

Previous topic

Next topic

This Page