rling (rli replacement) testing required

WreckTangle · Jul 31, 2020

I am pleased you and Blazer are working together I just misunderstood the "Inspiration" paragraph on your github page.

I see the problem now you mention the -c switch option. How about "checking"? I just think the word "removing" might make users panic.

I tried a 2.08GB file with the frequency analysis option using -q a and rling said I needed over 9GB of RAM.

I tried a small 200MB list with -q a and I noticed rling did not use more than 25% CPU, is this intentional?

Thanks

Waffle · Jul 31, 2020

rling tries to do the best it can to estimate the memory size - it's not exact, but it is usually pretty close (because sizes are determined dynamically, it can't know how much memory it will use until it has analyzed the file, and to analyze the file, it needs to allocate the memory). If it says it needs 9G, you should believe it :-)
The -q option uses more memory for the length statistics than, for example, the -b option or the default hash mode. The exact details will depend on the number of different lengths, and the maximum line length, in the file.

Again, for small memory machine, you need to cut the sizes down of the input as much as possible. Please consider adding more memory to your machine - prices have certainly come down...

As far as CPU usage, it will only use as much as required. It is fully capable of using 100% of 80 cores on large jobs.

WreckTangle · Jul 31, 2020

OK Thanks Waffle for your help and explanations.

When I built this computer it was super cool with 8GB RAM. LOL

I am not nagging but I wondered if you accepted the feature request to sort lists by length of password?

a
aa
aaa
aaaa

I really appreciate the splitlen.exe tool, it will be useful.

Waffle · Jul 31, 2020

I've checked with the guys, and no one seems to think that hashcat in any way is affected by line length. As a result, it's not likely to happen, but you can certainly do it trivially with splitlen. take any file, (sorted or not), splitlen it, then join the sections with copy (or cat). Presto - sorted in no time.

WreckTangle · Jul 31, 2020

I once read something atom said about hashcat working slightly better when the passwords were of the same length. It was something to do with how hashcat queued candidates or something like that. I think his comments were made a long time ago and perhaps hashcat works differently now.

My request for sorting by line length was not just for use with hashcat, it is often helpful to look through passwords when they are sorted by length. By switching how I view them between sorted alphabetically and then by length I notice different patterns in them. I spend time cleaning my lists with Blazers ReLP.exe program which is really useful.

I understand what you say about using splitlen then rejoining etc but it would be great to have an option to sort by length in one go. Perhaps instead of adding this sort method to rling could I maybe encourage you to consider adding a switch to splitlen to allow the user to chose between outputting to each file of the same length (as now) but also an option to sort the output to one text file?

Thanks again for your time.

Waffle · Jul 31, 2020

You can now use splitlen -so outname to give you a single-file sorted output. Well, actually it creates all of the normal file, then merges them together, and deletes the intermediate files automatically.

WreckTangle · Jul 31, 2020

You Sir are a password list manipulating rock star! :)

Thank you very much, it will save me a lot of time and really help me to automate things much easier.

WreckTangle · Aug 1, 2020

Cross Reference Feature Request

Would it be possible to include a partial match option with the cross reference tool within rling? Or perhaps separate these tools to individual executables?

Allow the user to switch whether or not rling uses items in the reference list as case sensitive and also allow partial match of items.

The reason for the partial match is to allow the user to remove user-name type passwords from their lists or even foreign languages to produce a smaller more targeted list.

Example:

Reference list contains two words...

usernamewedonotwant
wordwedonotwant

Items matched and removed from source list would be...

usernamewedonotwant is the best
UserNameWeDoNotWant123
!!!usernamewedonotwant!!!
WORDWEDONOTWANT
wordwedonotwant!!!!!!
blahblahwordwedonotwant

Waffle · Aug 1, 2020

Would this not be better handled with grep -iv ?

grep -iv usernamewedonotwant <input | grep -iv wordwedonotwant >output
or
fgrep -iv -f remove.list <input >output

Reasoning:
While it is possible to include general expression parsing in the tools, it makes them larger, and (worse) it increases the execution time exponentially. Consider, exact matching a line against a file happens in O ( n ) time for a hashed list (where n=1 for many cases, and n is the depth of the linked list on the hash - the "maxdepth" value rling shows you), regardless of the number of lines of input. Binary search increases that time to O(log2 size), where size is the number of lines in the input list. The -2 mode is a bit more complex, but usually comes down to O ( n ) where n is the number of remove lists in play (actually, *usually* O(1) +C, where C is the time to re-sort the heap of lines from the remove lists - a very small amount of time). For hashed and binary search modes, all cores in a given CPU can be used, which then divides O by the number of cores involved.

For a regular expression, however, it is a minimum of O(size*m), where size is the number of lines of input, and m is the number of lines of remove. O, in this case, is also much higher than O for the previous case, because there is a lot more work to do. Let's say you can do a regular-expression compare of a line in 1 millisecond - so that's O. To match 2 lines (m=2) against 500M lines (size) will take 277 hours. The hash-and-compare operation, on the other hand, can happen in microseconds *and* you are doing a lot fewer of them.

The point I'm trying to make is that while this is a good idea, the idea of *how* to use it must be somewhat re-thought. Because each line in the input list has to be compared against each remove line, this is not something you would just say "hey, look at these 10,000 most-common roots, and get rid of them from my 744,000,000 line Master Password List". You would be looking at 7,440,000,000,000 compare operations, which is gonna take a long, long time. Removing 10,000 lines (using a hash) only takes 10,000 compare operations, which is why it happens relatively quickly.

Looking at a more "real" example. Let's remove the top 10k password list from the the hashes.org 2020 list (about 11,204,180 lines, on my version). On my old, busy x86_64 system (system load of 16), this takes 4.5 seconds with rling (and about 0.4 seconds on the Power system). and about 15 seconds with fgrep (about 3 seconds on the Power system). As the files get larger, the run time increases as the product of the file sizes...

So, fgrep exists, and may be a better option for this kind of use.

WreckTangle · Aug 1, 2020

Thank you very much Waffle for that explanation, I had no idea how the workload increased so much.

I have just downloaded grep and I will experiment.

I do like using Blazers regex tool which is brilliant and I find it useful for cleaning lists.

I am still experimenting with rling and large files using -f, I am currently tweaking the -M figure. I cannot get rling to complete in anything like the time Blazers sort64lm.exe does when sorting and de-duplicating. I know rling is built for speed on computers with huge RAM but the time difference between rling and sort64lm.exe is huge when using -f, which makes me certain I am just not setting rling correctly.

Waffle · Aug 1, 2020

The goal, as mentioned, is to *not* use -f unless you absolutely have to. It will *never* be faster than sort, because it is using a database. It is designed to be used as a last resort, for dealing with very small memory and very large lists. The latest version (available on github) also includes the ability to shard the database (using -L to specify how many lines go into each database). But the -f option will never be able to compete with -b, or the hashed mode (or with an external sort program). It is really like comparing a race car with a bumble bee. They aren't even close to the same thing.

Want speed? Chop your files down so they will fit into memory - that means 1g file for 8 G memory, 2g file for 16G, and so on. Need a *little* more file size? Use -b. Have a 10g file, and 4 gig of memory? split the file up.

Have a 100g file, and only 100g of ram? split the file up.

Have a 1T file, and only 100g of ram? Split the file up, or use -f as a last resort.

Waffle · Aug 2, 2020

Speaking of which, it's probably time to talk about the other tools, how they are used, and why they exist.

Dedupe:
dedupe was a program created to address a specific issue - getting rid of duplicate lines. It was originally written as an awk script, which more or less looked like

Code:

awk '!x[$0]++'

. This kind of does a de-dupe function, but not really, and does not take into consideration lines with embedded NULs and so on. The function is to take the input, and in real time only output the first copy of any duplicate lines. You would use this when generating candidates externally (say by adding the date onto existing words). But, if you are using an expensive hash function, you don't want to try the same candidate twice.
For example,

Code:

awk '!x[$0]++' <hashes.org-2020.txt >/dev/null

takes about 12 seconds on my machine, and uses 1.788G of memory.

Code:

dedupe <hashes.org-2020.txt >/dev/null

takes about 5 seconds, and uses 619M of memory in default mode, and with -x about 422M of memory. (hashes.org-2020.txt is 115,796,714 bytes).

Why would you use it? To remove duplicates from a stream (typically) or a generated file which may have duplicates. It also auto $HEX[] encodes anything needing it. It's very fast, but has to keep some information about each line in order to see if they are duplicates. That means more (unique) lines == more memory. With the -x switch, you can reduce the memory usage, but at the cost of some potential false removals (if two lines hash to the same value, which can happen). Without the -x, only "real" duplicate lines are identified. How fast? About 2 Million lines per second on my ancient machine. If there are lots of duplicates (for example

Code:

y | head -100000000 | dedupe] the speed ramps up to about 20 Million lines per second.

Why does it $HEX[] encode lines?  Any lines that contain values outside the ASCII printable range will be $HEX encoded (values 0x00-0x1f, 0x3a, and 0x7f-0xff), because this helps programs properly deal with passwords that contain unique encoding formats.  You can change this, but if your lines contain NULs, this is likely to break other programs that depend on data not having NULs.  If you want to allow all UTF-8 characters, you can turn off the encoding with the -U switch: [CODE]dedupe -U 1-255

. It will still encode NULs this way - if you really want to allow all characters through, and you know what you are doing, you can use

Code:

dedupe -x -U 0-255

. -x is required in this case, because the lines are stored internally as NUL terminated strings.

I'll do a short explanation of each of the newer programs, as the days go by.

WreckTangle · Aug 2, 2020

It is very interesting to read these explanations Waffle, I'm looking forward to more.

Please will you also copy your thoughts regarding your tools to your github page because they will get lost within this thread.

You mentioned Blazer might be joining this forum to discuss his set of ULM tools. I was wondering if you and Blazer might work together on fixing or optimising the ULM suite of tools?

I understand the new set "rling" is optimised for professionals with HUGE RAM but Blazers tools are incredibly useful for the rest of us :)

Do the two of you intend to refresh the code on ULM particularly the CLI tools?

Waffle · Aug 2, 2020

That's up to Blazer. Tychotithonus is adding to the documentation on the GIthub page, but in the meantime, I wanted people to get some understanding of what and why these tools exist. They are all purpose built, and each have an application.

WreckTangle · Aug 2, 2020

Waffle said:
That's up to Blazer.

Well if you don't mind giving Blazer a nudge in that direction I would be very grateful. It would be fantastic to have the two of you working on those ULM CLI tools together.

Meanwhile I am constantly refreshing your github page to see what happens next to rling!

Thanks for everything you are doing.

WreckTangle · Aug 3, 2020

I was wondering do you have plans to have a 'Waffle' version of ReLP.exe? An optimised RegEx tool faster than ReLP.exe, including the same features, would be very useful.

Waffle · Aug 3, 2020

No. Grep does fine for me. I've haven't used Windows (in general) for nearly 20 years. I port some of my code there simply because some folks find it useful - but replicating grep is not high on my priority list :-)

WreckTangle · Aug 4, 2020

Oh ok that's a shame I could use a faster RegEx tool, Grep is too slow especially when working on huge lists.

ReLP.exe is really useful and I just wondered if there was any possibility of some tweaking to improve performance a little.

I can't thank Blazer enough for ReLP.exe.

I know you haven't used windows for a long time but if you need to do a lot of RegEx work ReLP.exe is much faster the Grep.

I am using your splitlen.exe a lot, both outputting to single file and also to multi files sorted by length. It has really increased the speed of some of my scripts as I had previously resorted to using RegEx to split lengths which was slow compared. Thanks a lot for coding splitlen.exe it's great.

Any progress encouraging Blazer to join this forum and work on his suite of word-list tools? It would be wonderful to see you two refresh and update his applications.

It's really frustrating not having enough RAM to use rling.exe fully but it does make me appreciate Blazers Low Memory tools. I understand I can keep splitting lists to make them smaller and smaller but that is counter productive to the speed gain. One day I hope I will be able to afford a computer like yours LOL.

Thanks again for all you are doing.

Waffle · Aug 4, 2020

And another short discourse on rling.

"rling, your memory, and your files. A guide."

I thought I would take some time to explain memory usage with some real examples, and performance indications, and which mode you would use for different applications. I'm going to refer to several files here: 1bs.txt, a 1,000,000,000 line ~10gigabyte file, containing the numbers 1-1,000,000,000 in lexical order. xa[a-j], 10 files each containing 100,000,000 lines in lexical order (generated from split -l 100000000 1bs.txt). hashes.org-2012-2019.txt, a 1,270,725,636 unique line (228 duplicates), ~13gigabyte file from hashes.org, and hs.txt, the previous file, sorted in lexical order with the duplicates removed, and 10M.txt, a 10,000,000 line file about 80Mbytes in size.. Note well, that hashes.org file is very much not a "clean" file, in that many lines have NULs or other reserved characters embedded.

First, let's use the 1bs.txt and xa? files to do a simple remove operation - this should end up with a zero-length output file:

Code:

rling 1bs.txt out xa?
Reading "1bs.txt"...9888888890 bytes total in 2.0060 seconds
Counting lines...Found 1000000000 lines in 2.8918 seconds
Optimal HashPrime is 1610612741
Estimated memory required: 38,878,648,450 (36.21Gbytes)
Processing input list... 1000000000 unique (0 duplicate lines) in 17.8454 seconds
Occupancy is 744971109/1610612741 46.2539%, Maxdepth=9
Removing from "xaa"... 100000000 removed
Removing from "xab"... 100000000 removed
Removing from "xac"... 100000000 removed
Removing from "xad"... 100000000 removed
Removing from "xae"... 100000000 removed
Removing from "xaf"... 100000000 removed
Removing from "xag"... 100000000 removed
Removing from "xah"... 100000000 removed
Removing from "xai"... 100000000 removed
Removing from "xaj"... 100000000 removed

1,000,000,000 total lines removed in 48.6426 seconds
Writing to "out"

Wrote 0 lines in 0.2297 seconds
Total runtime 71.6157 seconds

And we do. We read in 1,000,000,000 lines, removed 10 sets of 100,000,000 lines, and ended up with 0 lines of output. This, the default mode of rling, uses a hash table to quickly locate words in teh "remove" files from the input files. It computed the size of the hash table automatically from the input data as being 1,610,612,741 entries (more than the number of input lines). Now, its true that you can work with a smaller hash table. rling will still work fine, but it will be slower. I'll explain more about this later.

How does this, the default mode, compare to the other modes? Why use other modes? What are the modes?

The hashed access is, generally speaking, the fastest way to use rling. In this, rling reads each input line, and uses xxHash to get a 64-bit value. This value is then used as an index into the hash table. This way, fewer compare operations need to be done to see if lines are the same, when processing the remove lists. The problem is that we are computing a 64-bit value from a (potentially) much longer string - so there will be multiple strings which could hash to the same values. rling tells you, in the "Occupancy" line, the "worst case" collisions - in this example, 9. That means that there was at most 9 input lines that hashed to the same value (well, not really, because the hash is "wrapped" with a modulo operation to the size of the table, but it's the same effect). For any of those 9 input lines, 1 to 9 string compare operations will need to be done to figure out if the remove line is _really_ the same as the input file, or not. Still, it is a lot better than having to compare against 1,000,000,000 lines! But this hashed access takes memory - as you can see from the above, rling suggests that it will need 38,878,648,450 bytes of storage to do this. That's a problem if you don't have a 48G or 64G machine available.

Because of this, rling supports slower modes of operation, but ones that use less memory. Let's look at the table, and then analyze each one.

Arguments	Run time	Memory Usage
rling 1bs.txt out xa?	71.61 seconds	37,985,536k
rling -b 1bs.txt out xa?	80.55 seconds	17,593,728k
rling -2 1bs.txt out xa?	110.63 seconds	50,496k
rling -f 10M out xa?	395.65 seconds	132,160k

The -b mode does not use a hash table, but instead internally sorts the input file into lexical order (and "unsorts" it back to input file order when complete, unless you ask for sorted output using the -s option). When sorted, a binary search can be used - not quite as fast a the hash method, but on the other hand there are no collisions, and duplicates can be removed from the input substantially faster). So, while it is 10-15% slower than the default mode, it uses less than half the memory! A substantial savings.

The -2 mode doesn't need much memory at all - this mode (named after the rli2 utility in hashcat-utils) reads the input and remove files at the same time, and creates the output file "on the fly". Unlike the rli2 utility, any number of remove files can be used. This uses very little memory (only enough for basic I/O buffering, and you can further limit or increase this with the -M option), but does depend on the files being in lexical sorted order, and takes about twice as long as the hashed mode. Unlike rli2, rling can also de-duplicate the input, and actually *checks* the input to ensure it is sorted order. If any line is not in sorted order, you will get an error message, and rling will let you know which file, and which line(s) are out of order.

The -f mode actually uses an on-disk database. This is for those that don't have enough memory *and* don't want to sort their files. This is massively slower than any of the other modes - something like 100 to 200 times slower. You really, really should not use this mode unless you have absolutely no choice.

Yeah, yeah - but what about *REAL* files?

Test files are nice, but what mode is best for working with real files? That depends on your system, and your files. If at all possible, try to fit whatever you are doing into memory. That might mean cutting files in half, and doing two (or more) runs. I assure you that this will be faster than the alternatives. Let's look at real world data again (I wish I knew how to size these tables better). Hopefully, this gives you some idea of what you need - handling a 1.2billion line, 13gigabyte file can be done very quickly with only 32gigabytes of ram (using the -b option). Not many files need to be this large, though.

Arguments	Run time	Memory Usage
rling hashes.org-2012-2019.txt out xa?	88.88 seconds	45,871,680k
rling -b hashes.org-2012-2019.txt out xa?	100.86 seconds	23,368,256k
rling -2 hs.txt out xa?	146.70 seconds	54,720k

Waffle · Aug 5, 2020

Which of course brings me to: I have (x) memory. How big of a file can I deal with?

Here's a handy table, all using the -b option. It's only a bit slower than the hash mode, and should give you good performance. If you use -bs (to sort the output file), it will actually be faster than using -b, because it can skip the final sort (internally, rling has already sorted the file in order to do the binary search). This is a great way to sort a file, and is substantially faster than the normal "sort" program. And (unless you use -n) it will automatically de-dupe the file, too! rling -bs input output.sort will do the same as sort -u input >output.sort

Memory size	Approximate number of lines in file (based on hashes.org list)
1 Gigabyte	40,000,000 lines
2 Gigabytes	100,000,000 lines
4 Gigabytes	200,000,000 lines
8 Gigabytes	400,000,000 lines
16 Gigabytes	900,000,000 lines

rling (rli replacement) testing required

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member

Active member