.. _history: History ======= Meryl was first imagined in late-2000/early-2001/mid-2001 while BPW as at `Celera Genomics `_. A 3 June 2001 journal entry is titled "merConting the genome", but makes no mention of motivation for doing so. The same journal has an 18 June entry outlining the algorithm, and this entry also strongly hints the application was to build a fast 20-mer to assembly coordinate lookup table. The "Meryl" name is mentioned on 16 August 2002, desiring to increment/decrement the counts. There is even *source code* dating back to 12 February 2001 deep in the dusty archives; as that's the oldest archive found, meryl is doubtless a little bit older yet. It was originally called "merMaid" or "merCounter", the "Meryl" name first appears on 11 June 2001, and the code looks reasonably mature. Meryl was definitely used twice in the assembly of Anopheles gambiae [`Holt, et al. 2002 `_], once to find k-mers that occur frequently and exclude them from seeding overlaps (which is how it is still used in Canu [`Koren and Walenz 2017 `_]) and once to estimate repeat content of the assembly. It is mentioned, anonymously, the paper: By counting the number of times each 20-nucleotide oligomer in the Anopheles and Drosophila assemblies appeared in its corre- sponding whole-genome shotgun data, we con- firmed that simple repeats are not expanded in Anopheles The supplement elaborates a bit: **METHODS FOR ESTIMATING REPEAT CONTENT:** The consensus sequence of scaffolded contigs was analyzed for repeat content as follows. For each 20mer of the consensus, we counted the number of times that that 20mer appeared in the set of approximately 4.5 million sequence reads. Out of 262.8 M consensus 20mers, 26.3M (10.0%) occurred more than 50 times (5 times what is expected for unique sequence, given 10- fold sequence coverage). For comparison, of 133.5M consensus 20mers from a recent reassembly of the D. melanogaster genome that was done using the same version of the assembly software that was used for mosquito , 13.0M (9.7%) were observed more than 60 times (5 times more than expected, given 12-fold sequence coverage of the Drosophila genome). Using a count cutoff of 1.5 times expected or 50 times expected gives a similar picture: the repetative fractions of the A. gambiae and D. melanogaster genomes are not strikingly different. Unfortunately, the earliest versions of the software are not online, likely because they were never stored in a version control system. The earliest online version is found in the `initial commit `_ of the `kmer subversion repository `_ on 2 January 2003. The `main function in meryl.C `_ shows it supporting the "binary and mathematical" operations, including "increment/decrement" from the 16 August 2002 journal entry. The `Celera Assembler `_ subversion repository also has a relatively early version. CA was publicly released to SourceForge on 14 April 2004 (`revision 4 has the good stuff `_). Use of meryl is mentioned in `wga.pl `_. The meryl code and a CA-specific function to access sequence data are `nicely organized `_, the bulk of meryl in one file (`AS_MER_meryl.cc `_). The offline BPW archives contain several early versions, including one just nine days after the first journal entry mentioned above: .. code-block:: none -rw-r--r-- 1 bri bri 5964 Jun 10 2001 ne-20010611-1812/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 180 Jul 3 2001 ne-20010703-1747/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 1322 Jul 6 2001 ne-20010706-1813/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 4132 Jul 9 2001 ne-20010709-1647/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 4294 Jul 10 2001 ne-20010710-1842/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 4675 Jul 11 2001 ne-20010712-1907/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 4775 Jul 18 2001 ne-20010718-1745/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 4706 Jul 23 2001 ne-20010723-1800/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 4850 Aug 15 2001 ne-20010817-1800/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 4988 Feb 27 2005 ne-20010831-1745/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 4988 Aug 30 2001 ne-20010912-1829/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 5130 Feb 27 2005 ne-20010928-1826/near-identity/meryl/meryl.C -rw-r--r-- 1 bri bri 5707 Feb 27 2005 ne-20011109-1837/near-identity/meryl/meryl.C (I'm not sure where the 2005 time stamps came from; I vaguely remember replacing duplicate files with symlinks before some of these directories were tar'd and compressed.) Peeking into that first version, we find: .. code-block:: none d:.../ne-20010611-1812/near-identity/meryl> ls -l total 88 -rw-r--r-- 1 bri bri 2187 Jun 10 2001 1.C -rw-r--r-- 1 bri bri 1232 Jun 10 2001 2.C -rw-r--r-- 1 bri bri 1582 Jun 10 2001 3.C -rw-r--r-- 1 bri bri 2496 Jun 10 2001 4.C -rw-r--r-- 1 bri bri 538 Jun 11 2001 Makefile -rwxr-xr-x 1 bri bri 50432 Jun 10 2001 meryl -rw-r--r-- 1 bri bri 5964 Jun 10 2001 meryl.C -rw-r--r-- 1 bri bri 1725 Jun 10 2001 meryl.H Is that a binary? .. code-block:: none d:.../ne-20010611-1812/near-identity/meryl> file meryl meryl: ELF 32-bit LSB executable, \ Intel 80386, \ version 1 (FreeBSD), \ dynamically linked, \ interpreter /usr/libexec/ld-elf.so.1, \ for FreeBSD 4.1, \ not stripped The algorithm is quite simple, but the details are complicated: it simply builds a list of all the kmers in the input sequences, sorts, then counts how many times each kmer is in the list and write that to the output file. The details are to split the kmer into a prefix and a suffix. The prefix points to a bucket into which all the suffixes are listed. Once all kmers are stored, each bucket is sorted then scanned to count the kmers. Bit-packed integers are used throughout to minimize memory usage. In all it's embarassing glory, here is the first ever version of meryl (omitting the functions that do the actual work). .. code-block:: c++ #include "meryl.H" // theSeq is a compressed sequence. Three bits per character, the first bit // telling us if the next two bits are valid sequence. // u64bit *theSeq = 0L; u64bit theSeqLen = 0; u64bit numberOfMers = 0; // theCounts is a packed-word array of the number of mers // that all have the same TABLEBITS bits on the left side of // the mer. // // theMers is a list of the other bits in the mers. // u64bit *theCounts = 0L; u64bit theCountsWords = 0; u64bit *theMers = 0L; u64bit theNumberOfMers = 0; u64bit approxMers = 1073741824; u64bit merSize = 20; u64bit merMask = u64bitMASK(merSize); u64bit tblBits = 20; u64bit tblMask = u64bitMASK(tblBits); u64bit chkBits = 0; u64bit chkMask = 0; u64bit bitsPerIndex = 0; bool beVerbose = true; bool doReverseComplement = false; char *seqFileName = 0L; FILE *seqFile = 0L; void usage(char *name) { fprintf(stderr, "usage: %s [-m merSize] [-v] [-r] [-f seqFile] [-s seqFileSize] [-h]\n", name); fprintf(stderr, " -m Sets the size of mer to count. Must be less than 32.\n"); fprintf(stderr, " -v Be noisy about it.\n"); fprintf(stderr, " -r Also counts the reverse complement.\n"); fprintf(stderr, " -f File to count mers in. MultiFastA\n"); fprintf(stderr, " -s Upper limit on the number of mers.\n"); fprintf(stderr, " -h This.\n"); } int main(int argc, char **argv) { int arg = 1; while (arg < argc) { switch (argv[arg][1]) { case 'v': beVerbose = true; break; case 'r': doReverseComplement = true; break; case 'f': arg++; seqFileName = argv[arg]; break; case 'm': arg++; merSize = atol(argv[arg]); break; case 's': arg++; approxMers = atol(argv[arg]); break; case 'h': usage(argv[0]); exit(1); break; default: usage(argv[0]); fprintf(stderr, "Unknown option '%s'\n", argv[arg]); exit(1); break; } arg++; } if ((seqFileName == 0L) || ((seqFileName[0] == '-') && (seqFileName[1] == 0))) seqFile = stdin; else seqFile = fopen(seqFileName, "r"); if (seqFile == 0L) { fprintf(stderr, "Couldn't open the sequence file '%s'.\n", seqFileName); exit(1); } //////////////////////////////////////////////////////////// // // Based on the approximate number of mers given, set the // size of the table and number of bits per each table entry. // // Not the greatest way to find the number of bits needed // to encode approxMers, but easy // bitsPerIndex = 1; while (approxMers > u64bitMASK(bitsPerIndex)) bitsPerIndex++; // Determine the size of the table to reduce the memory footprint. // This is computed backwards, so that we pick the best smallest // table size. // u64bit bestMem = ~u64bitZERO; u64bit bestSiz = u64bitZERO; u64bit mem; for (u64bit siz=32; siz>18; siz-=2) { mem = bitsPerIndex * (u64bitONE << siz) + approxMers * (2*merSize - siz); mem >>= 23; if (mem < bestMem) { bestMem = mem; bestSiz = siz; } } tblBits = bestSiz; tblMask = u64bitMASK(tblBits); chkBits = 2 * merSize - tblBits; chkMask = u64bitMASK(chkBits); fprintf(stderr, "You told me there will be no more than %lu mers.\n", approxMers); fprintf(stderr, "Using %2lu bits to encode positions.\n", bitsPerIndex); fprintf(stderr, "Using %2lu bits of the mer for the count.\n", tblBits); fprintf(stderr, "Using %2lu bits of the mer for the check.\n", chkBits); //////////////////////////////////////////////////////////// // // Allocate and clear the counts // theCountsWords = bitsPerIndex * (u64bitONE << tblBits) / 64 + 2; if (beVerbose) { fprintf(stderr, "Allocating and clearing the counting table.\n"); fprintf(stderr, " %lu entries at %lu bits == %lu MB.\n", u64bitONE << tblBits, bitsPerIndex, theCountsWords * 8 >> 20); } theCounts = new u64bit [ theCountsWords ]; for (u64bit i=theCountsWords; i--; ) theCounts[i] = u64bitZERO; ////////////////////////////////////////////////////////////// // // theSeqLen = 0; numberOfMers = 0; theSeq = new u64bit [ 3 * (u64bitONE << bitsPerIndex) / 64 + 1 ]; if (beVerbose) fprintf(stderr, "Computing the bucket sizes (pass one through the sequence).\n"); readSequenceAndComputeBucketSize(); if (theNumberOfMers > approxMers) { fprintf(stderr, "Sorry. You need to increase your estimate of the number of mers!\n"); fprintf(stderr, "I found %lu mers.\n", theNumberOfMers); exit(1); } ////////////////////////////////////////////////////////////// // // if (beVerbose) fprintf(stderr, "Converting counts to offsets (%lu bits or %lu Mbuckets).\n", tblBits, (u64bitONE << tblBits) >> 20); convertCounts(); ////////////////////////////////////////////////////////////// // // Allocate theMers, but don't clear them. // if (beVerbose) { fprintf(stderr, "Allocating space for the mers.\n"); fprintf(stderr, " %lu mers at %lu bits each == %lu MB.\n", theNumberOfMers, chkBits, chkBits * theNumberOfMers >> 23); } theMers = new u64bit [ ((chkBits * theNumberOfMers) >> 6) + 1 ]; ////////////////////////////////////////////////////////////// // // if (beVerbose) fprintf(stderr, "Filling the buckets with mers (pass two through the sequence).\n"); fill(); ////////////////////////////////////////////////////////////// // // Sort each bucket // if (beVerbose) fprintf(stderr, "Sorting buckets (%lu total).\n", u64bitONE << tblBits); sort(); // // Yeah, I didn't close the stupid input file. So what? // } The first code block reads input sequence from disk, converts each base into a 3-bit code, the counts the size of each bucket. .. code-block:: c++ #include "meryl.H" void readSequenceAndComputeBucketSize(void) { double startTime = getTime() - 0.1; u64bit count = 0; theNumberOfMers = u64bitZERO; s32bit timeUntilValid = 0; u64bit substring = u64bitZERO; u64bit partialEncoding = u64bitZERO; u32bit partialEncodingSize = u32bitZERO; for (unsigned char ch=nextCharacter(seqFile); ch != 255; ch=nextCharacter(seqFile)) { if (ch > 127) { timeUntilValid = merSize; ch = 0x07; } else { ch = compressSymbol[ch]; substring <<= 2; substring |= ch; timeUntilValid--; if (beVerbose && ((++count & (u64bit)0x3fffff) == u64bitZERO)) { fprintf(stderr, " %4ld Mbases -- %8.5f Mb per second\r", count >> 20, count / (getTime() - startTime) / (1000000.0)); fflush(stderr); } if (timeUntilValid <= 0) { timeUntilValid = 0; theNumberOfMers++; u64bit I = ((substring >> chkBits) & tblMask) * bitsPerIndex; preIncrementDecodedValue(theCounts, I >> 6, I & 0x000000000000003f, bitsPerIndex); } } // Place the character into the encoded sequence. If the character was // not a valid base, ch will have the "I'm a crappy character" flag set, // otherwise, ch is a valid base encoding. // partialEncoding <<= 3; partialEncoding |= ch; partialEncodingSize++; // We can fit 21 bases (at 3 bits per base, 63 bits) per 64 bit word // without going to any trouble. // if (partialEncodingSize == 21) { theSeq[theSeqLen++] = partialEncoding; theSeq[theSeqLen] = ~u64bitZERO; partialEncoding = u64bitZERO; partialEncodingSize = u32bitZERO; } } // Push on the last partialEncoding, if one exists. // if (partialEncodingSize > 0) { while (partialEncodingSize != 21) { partialEncoding <<= 3; partialEncoding |= 0x07; partialEncodingSize++; } theSeq[theSeqLen++] = partialEncoding; } if (beVerbose) fprintf(stderr, "\n"); } The second block of code converts the bucket sizes into an offset to the start of each bucket. .. code-block:: none #include "meryl.H" void convertCounts(void) { double startTime = getTime() - 0.1; u32bit count = 0; // Convert the counts into offsets into theMers // // We use the trick pointed out by Liliana of setting the offset to the LAST // entry in the bucket, then decrementing when filling. This also makes // the code below easier -- we don't need to store the size of the bucket // so we can increment the sum after we set the bucket offset. // u64bit i = 0; u64bit m = (u64bitONE << tblBits) + 1; u64bit s = 0; while (i < m) { if (beVerbose && ((++count & (u64bit)0x3fffff) == u64bitZERO)) { fprintf(stderr, " %4ld Mbuckets -- %8.5f Mbuckets per second\r", count >> 20, count / (getTime() - startTime) / (1000000.0)); fflush(stderr); } u64bit I = i * bitsPerIndex; s = sumDecodedValue(theCounts, I >> 6, I & 0x000000000000003f, bitsPerIndex, s); i++; } if (s != theNumberOfMers) { fprintf(stderr, "internal error in stage 2: s != theNumberOfMers.\n"); exit(1); } if (beVerbose) fprintf(stderr, "\n"); } The third block of code makes a second pass through all the kmers in the sequence, adding suffixes to the bucket indicated by the prefix. .. code-block:: none #include "meryl.H" void fill(void) { double startTime = getTime() - 0.1; u32bit count = 0; s32bit timeUntilValid = 0; u64bit substring = u64bitZERO; // Stream again, filling theMers // for (u64bit i=0; i>= 60; ch &= 0x07; #if 0 ch = t & 0x7000000000000000; ch >>= 60; #endif t <<= 3; if (ch & 0x04) { timeUntilValid = merSize; } else { substring <<= 2; substring |= ch; timeUntilValid--; if (beVerbose && ((++count & (u64bit)0x3fffff) == u64bitZERO)) { fprintf(stderr, " %4ld Mbases -- %8.5f Mb per second\r", count >> 20, count / (getTime() - startTime) / (1000000.0)); fflush(stderr); } if (timeUntilValid <= 0) { timeUntilValid = 0; I = ((substring >> chkBits) & tblMask) * bitsPerIndex; J = chkBits * preDecrementDecodedValue(theCounts, I >> 6, I & 0x000000000000003f, bitsPerIndex); setDecodedValue(theMers, J >> 6, J & 0x3f, chkBits, substring & chkMask); } } } } if (beVerbose) fprintf(stderr, "\n"); } The fourth block of code sorts the kmers and writes output. It is using its own implementation of `heap sort _`, likely beacuse the C sort() implementation is not in place. Output is written to ``stdout``. .. code-block:: none #include "meryl.H" typedef u64bit heapbit; void adjustHeap(heapbit *M, s64bit i, s64bit n) { heapbit m = M[i]; s64bit j = (i << 1) + 1; // let j be the left child while (j < n) { if (j= M[j]) // a position for M[i] has been found break; // Move larger child up a level // M[(j-1)/2] = M[j]; j = (j << 1) + 1; } M[(j-1)/2] = m; } void sortBucket(u64bit b) { u64bit st = getCount(b); u64bit ed = getCount(b+1); u64bit sz = ed - st; if (sz == 0) return; heapbit *a = new heapbit [sz + 1]; for (u64bit i=st; i> 6, J & 0x3f, chkBits); } u64bit substring; char mer[merSize+1]; u64bit count = u64bitONE; u32bit i; u32bit m; // Sort if there is more than one item // if (sz > 1) { s64bit i; // Create the heap of lines. // for (i=(sz-2)/2; i>=0; i--) adjustHeap(a, i, sz); // Interchange the new maximum with the element at the end of the tree // for (i=sz-1; i>0; i--) { heapbit m = a[i]; a[i] = a[0]; a[0] = m; adjustHeap(a, 0, i); } } for (i=1; i 100) { substring = b << chkBits | a[i-1]; for (m=0; m> (2*m)) & 0x03]; mer[m] = 0; printf("%s %lu\n", mer, count); } count = 1; } } // Output the last one // if (count > 100) { substring = b << chkBits | a[i-1]; for (m=0; m> (2*m)) & 0x03]; mer[m] = 0; printf("%s %lu\n", mer, count); } delete [] a; } void sort(void) { double startTime = getTime() - 0.1; u32bit count = 0; u64bit m = u64bitONE << tblBits; for (u64bit i=0; i