And another short discourse on rling.
"rling, your memory, and your files. A guide."
I thought I would take some time to explain memory usage with some real examples, and performance indications, and which mode you would use for different applications. I'm going to refer to several files here: 1bs.txt, a 1,000,000,000 line ~10gigabyte file, containing the numbers 1-1,000,000,000 in lexical order. xa[a-j], 10 files each containing 100,000,000 lines in lexical order (generated from split -l 100000000 1bs.txt). hashes.org-2012-2019.txt, a 1,270,725,636 unique line (228 duplicates), ~13gigabyte file from hashes.org, and hs.txt, the previous file, sorted in lexical order with the duplicates removed, and 10M.txt, a 10,000,000 line file about 80Mbytes in size.. Note well, that hashes.org file is very much not a "clean" file, in that many lines have NULs or other reserved characters embedded.
First, let's use the 1bs.txt and xa? files to do a simple remove operation - this should end up with a zero-length output file:
Code:
rling 1bs.txt out xa?
Reading "1bs.txt"...9888888890 bytes total in 2.0060 seconds
Counting lines...Found 1000000000 lines in 2.8918 seconds
Optimal HashPrime is 1610612741
Estimated memory required: 38,878,648,450 (36.21Gbytes)
Processing input list... 1000000000 unique (0 duplicate lines) in 17.8454 seconds
Occupancy is 744971109/1610612741 46.2539%, Maxdepth=9
Removing from "xaa"... 100000000 removed
Removing from "xab"... 100000000 removed
Removing from "xac"... 100000000 removed
Removing from "xad"... 100000000 removed
Removing from "xae"... 100000000 removed
Removing from "xaf"... 100000000 removed
Removing from "xag"... 100000000 removed
Removing from "xah"... 100000000 removed
Removing from "xai"... 100000000 removed
Removing from "xaj"... 100000000 removed
1,000,000,000 total lines removed in 48.6426 seconds
Writing to "out"
Wrote 0 lines in 0.2297 seconds
Total runtime 71.6157 seconds
And we do. We read in 1,000,000,000 lines, removed 10 sets of 100,000,000 lines, and ended up with 0 lines of output. This, the default mode of rling, uses a hash table to quickly locate words in teh "remove" files from the input files. It computed the size of the hash table automatically from the input data as being 1,610,612,741 entries (more than the number of input lines). Now, its true that you can work with a smaller hash table. rling will still work fine, but it will be slower. I'll explain more about this later.
How does this, the default mode, compare to the other modes? Why use other modes? What are the modes?
The hashed access is, generally speaking, the fastest way to use rling. In this, rling reads each input line, and uses xxHash to get a 64-bit value. This value is then used as an index into the hash table. This way, fewer compare operations need to be done to see if lines are the same, when processing the remove lists. The problem is that we are computing a 64-bit value from a (potentially) much longer string - so there will be multiple strings which could hash to the same values. rling tells you, in the "Occupancy" line, the "worst case" collisions - in this example, 9. That means that there was at most 9 input lines that hashed to the same value (well, not really, because the hash is "wrapped" with a modulo operation to the size of the table, but it's the same effect). For any of those 9 input lines, 1 to 9 string compare operations will need to be done to figure out if the remove line is _really_ the same as the input file, or not. Still, it is a lot better than having to compare against 1,000,000,000 lines! But this hashed access takes memory - as you can see from the above, rling suggests that it will need 38,878,648,450 bytes of storage to do this. That's a problem if you don't have a 48G or 64G machine available.
Because of this, rling supports slower modes of operation, but ones that use less memory. Let's look at the table, and then analyze each one.
Arguments | Run time | Memory Usage |
---|
rling 1bs.txt out xa? | 71.61 seconds | 37,985,536k |
rling -b 1bs.txt out xa? | 80.55 seconds | 17,593,728k |
rling -2 1bs.txt out xa? | 110.63 seconds | 50,496k |
rling -f 10M out xa? | 395.65 seconds | 132,160k |
The -b mode does not use a hash table, but instead internally sorts the input file into lexical order (and "unsorts" it back to input file order when complete, unless you ask for sorted output using the -s option). When sorted, a binary search can be used - not quite as fast a the hash method, but on the other hand there are no collisions, and duplicates can be removed from the input substantially faster). So, while it is 10-15% slower than the default mode, it uses less than half the memory! A substantial savings.
The -2 mode doesn't need much memory at all - this mode (named after the rli2 utility in hashcat-utils) reads the input and remove files at the same time, and creates the output file "on the fly". Unlike the rli2 utility, any number of remove files can be used. This uses very little memory (only enough for basic I/O buffering, and you can further limit or increase this with the -M option), but does depend on the files being in lexical sorted order, and takes about twice as long as the hashed mode. Unlike rli2, rling can also de-duplicate the input, and actually *checks* the input to ensure it is sorted order. If any line is not in sorted order, you will get an error message, and rling will let you know which file, and which line(s) are out of order.
The -f mode actually uses an on-disk database. This is for those that don't have enough memory *and* don't want to sort their files. This is massively slower than any of the other modes - something like 100 to 200 times slower. You really, really should not use this mode unless you have absolutely no choice.
Yeah, yeah - but what about *REAL* files?
Test files are nice, but what mode is best for working with real files? That depends on your system, and your files. If at all possible, try to fit whatever you are doing into memory. That might mean cutting files in half, and doing two (or more) runs. I assure you that this will be faster than the alternatives. Let's look at real world data again (I wish I knew how to size these tables better). Hopefully, this gives you some idea of what you need - handling a 1.2billion line, 13gigabyte file can be done very quickly with only 32gigabytes of ram (using the -b option). Not many files need to be this large, though.
Arguments | Run time | Memory Usage |
---|
rling hashes.org-2012-2019.txt out xa? | 88.88 seconds | 45,871,680k |
rling -b hashes.org-2012-2019.txt out xa? | 100.86 seconds | 23,368,256k |
rling -2 hs.txt out xa? | 146.70 seconds | 54,720k |