rling (rli replacement) testing required

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
Plaintext considered Harmful

This introduces the "rehex" utility, and an old problem I ran into, many years ago, while I was developing mdxfind. It's a good time to go over it, as I just had a user complain about this very issue. Back in the day, I was discovering many "created" hashes that included interesting "unprintable" characters, along with the traditional UTF-8, UTF-16, and high-bit-set ASCII. Many programs at the time (hashcat, for example) would correctly crack a hash with an embedded NUL character, but would not be able to output it, meaning that attempting to re-use the "cracked" words, just would not work. I created the $HEX[] format and petitioned for it to be added to hashcat. It was originally rejected, but fortunately was added about a year later. As an aside, there are other "special" keywords that mdxfind recognizes, such as $TESTVEC, which allows for patterned test vectors to be loaded, up to 1 megabyte in size. Anyway, back to the problem.

The user reported that when using rling to remove his already known passwords from a new list, some were not being removed! One of the examples was "qwerty". When I got the list he was using, I did a quick grep:
Code:
 grep ^qwerty$ kaonashi.txt
qwerty
qwertypython
qwerty
Interesting. The pattern appears on 3 lines - and one of them can't possibly be correct, given the pattern I gave grep. This, then is the reason for rehex - taking things that are not obvious, and making them visible:
Code:
grep ^qwerty$ kaonashi.txt | rehex
qwerty
$HEX[71776572747900707974686f6e]
$HEX[71776572747900]
As can be seen, the second and third "matches" are not the same at all - and contain embedded NUL characters. Looking at someting, even with a text editor, may not reveal exactly what the line contains.

rehex will take an input file (or group of files, or stdin) , and properly encode each of the lines, wrapping them with $HEX[] as required. This ensures that things like embedded NUL, \\r, \\n and other "special" characters all make it through processing correctly. This is a great thing to do to prepare your lists for long-term use:
Code:
rehex input >output.final
Rehex has a number of features that allow you to control how decisions are made. The -S allows to you "set" which characters in a line will trigger $HEX[] encoding and -U will allow you to "unset" characters. You can use the actual characters you want separated by commas, or ranges of characters by using a - between them, or decimal chars (or ranges), or hex characters (or ranges) to finely control exactly what you want. For example, if you want to encode NUL's *only*, you can say: rehex -U 0-255 -S 0x00, or if you want to include CR and LF in the list, you can say rehex -U 0-255 -S 0x00,10,13

But rehex has another use too - you can "unhex" data, too: rehex -u This will remove the $HEX[] encoding, and return the file to the regular, unsafe, harmful, plaintext.
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
Oh yeah - there is a new rling out. I've ripped out the Berkeley database system I was using, and replaced it with a "virtual memory" system. This allows people even with very small memory sizes to process large files using the -f switch, and it is a *lot* faster. Using the same data as before. Yes, it's not going to be as fast as having everything in memory, but it saves making you split files up. Enjoy!

ArgumentsRun timeMemory Usage
rling 1bs.txt out xa?71.61 seconds37,985,536k
rling -b 1bs.txt out xa?80.55 seconds17,593,728k
rling -2 1bs.txt out xa?110.63 seconds50,496k
rling -f 1bs.txt out xa?636.82 seconds100M approximate
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
Organizing your lists

As you are working through lists, particularly those that have more than one hash type in them, how to organize your data becomes more important. Here's how I've dealt with lists for the last 10 years or so, and its seemed to work for me. When I first get a list of new hashes, it, inevitably, is "dirty" in some way. Maybe it has text at the beginning describing it. Maybe it was exported from an excel spreadsheet, or it has extra fields in it, other than the hash. So, I always store the original hash as name.orig, like isw2012-raw-md5.orig. I then do whatever processing is required, and create the "working" list, which I would name isw2012-raw-md5.txt. This is what will be worked on going forward - and becomes the "left" list.

Hash lists can, and often do, contain more than one kind of hash. Sometimes, I'm told that you can just look at the hash, and know what kind it is. Rarely, that's true. But in the case of a contest-organized list, you can trust them, right? Right? I mean, isw2012-raw-md5 has raw MD5 *right in the name*! But of course, there's more than one kind of hash in this list. Let's look at some examples, and then how to deal with it.

Sure there are MD5's:
97bf34d31a8710e6b1649fd33357f783:pa55w0rd
b154356eddac45e2b8af33c5ed24028c:pacific

But there are also truncated SHA1:
a207f2750870f0cb9a0a3028f7040a7e:amsterdam

And even chopped up SHA256!:
928c7fd9ac252ccfe68407e5e817c6be:iorn cross
9f4750b346b24148f3920ac9372e40d4:iorn cross

In fact, in this list there are millions of "other than MD5" hashes. I've found (so far) over 800 different hash types (and variations). And no, I've not found them all. In fact, I have 5 million "unsolved" hashes in this list after 8 years (but I have managed to solve 134 million of them :-)

Just dumping all of the "solved" hashes into a bucket may work for some people. I mean, if the purpose is to "get the password" for the contest, does it matter what kind of hash it is? And for those people, just a hashcat.potfile is probably good enough. But I want more than that. I want to be able to _verify_ the hashes, which means that I need to know the algorithm used - and potentially the encoding of the word - in order to do that. So, I came up with a system to mark the type of hash, and a tool, mdsplit, to split the "solved" hashes from the unsolved hash list.

The way I do this is to indicate the hash type as all-uppercase, then add an "iteration" count to the end. This is the number of times the hash function is "looped". For example, most of the hashes in this list are just "normal" MD5, so they are grouped into a solved list with an extension .MD5x01. "double" md5 (like hashcat mode -m 2600), are stored in .MD5x02, and so on. In this list, there are at least 120 different iteration levels for just MD5, from MD5x01 to MD5x999999:

edf788209a1cdc6eecb84e713d15a7b5:1

So a list of the isw2012-raw-md5 files looks like this:

isw2012-raw-md5.GOST-CRYPTOx01
isw2012-raw-md5.GOSTx01
isw2012-raw-md5.GROESTL256x01
isw2012-raw-md5.HAV128_4MD5x01
isw2012-raw-md5.HAV128_4x01
isw2012-raw-md5.HAV128_5x01
isw2012-raw-md5.HAV128MD5x01
...
isw2012-raw-md5.txt
isw2012-raw-md5.WRLx01

The mdsplit utility (found in the mdxfind distribution, available on hashes.org), does this splitting for you automatically (at least when using mdxfind to solve the hashes). It still works great with hashcat output too, but in this case you have to tell it the type of hash. You don't have to follow my practice of naming the files by hash type. You could, for example, choose to name them:

isw2012-raw-md5.m0
isw2012-raw-md5.m2600

and so on (though I think this is less descriptive).

So if you were using hashcat to solve this list for md5(md5($pass))) hashes, you would say: hashcat -m 2600 -a 0 -f isw2012-raw-md5.txt -o /tmp/2600.out /dict/mydict.list
Then, when it was done, you could say: mdsplit -t MD5x02 -f /tmp/2600.out isw2012-raw-md5.txt
which would then take the solved hashes from 2600.out, and remove them from the isw2012-raw-md5.txt list.

The same command, in mdx land, would be: mdxfind -f isw2012-raw-md5.txt -i 2 /dict/mydict.list | mdsplit isw2012-raw-md5.txt
(but mdxfind would solve for both MD5x01 and MD5x01 at the same time, since it has to calculate the first hash, in order to calculate the second one - save time)

But what does this have to do with rling? getpass, and how to effectively re-use your solved passwords to build your own lists.

getpass is the utility that comes with rling that extracts passwords from solved hash lists. It "knows" the standard way I organize my lists, so it can use shortcuts, if you choose to organize your lists the same way.

To get a list of all of the solved MD5 passwords from this list, I use: getpass isw2012-raw-md5.MD5x01 >/tmp/isw.pass
This will extract just the solved passwords from the MD5x01 list. But if I want *all* the passwords, I can say: getpass isw2012-raw-md5.txt >/tmp/isw.pass
getpass "knows" that the .txt file is the "root", and will search for all other files with the base name (isw2012-raw-md5.\*), skipping the .orig file and others automatically. These can be customized with the -x switch, or you can disable this with the -t switch if you want to specify each file individually.

If you have been cracking hashes for a while, it is also instructive to go over these "solved" hashes. You may find, like I did while writing this post, that previously hashcat-solved hashes don't work with the supplied passwords. This is likely due to NULs and UTF-16 expansion. I ended up with hundreds of these. Some, I solved already. Others, I'm still working on.

Solved: 08ea71e8eda54475beb2e7da049acc77:!Sect<F6>r/G became 08ea71e8eda54475beb2e7da049acc77:$HEX[2153656374c3b6722f47]
Unsolved: e58f4f55ef27c675182c02a7196e07b2:<D2><C5><C4><C9> <D4><D5> <ED><C1><DE> <D4><D2><C9>
(both are NTLM)
 

WreckTangle

Active member
Joined
Jan 24, 2020
Messages
94
Reaction score
15
Credits
730
Oh yeah - there is a new rling out. I've ripped out the Berkeley database system I was using, and replaced it with a "virtual memory" system. This allows people even with very small memory sizes to process large files using the -f switch, and it is a *lot* faster. Using the same data as before. Yes, it's not going to be as fast as having everything in memory, but it saves making you split files up. Enjoy!
Thank you Waffle this is very much appreciated. Speaking as a user with only 8GB RAM I was finding it hard to enjoy the benefits of using RLING.

Just a non professional quick test I got the following results.

Win7 64bit 8GB RAM
Working on 4.82GB Word list
Sort and de-duplicate

RLING Time Taken = 1 Hour 19 Minutes 16 Seconds

sort64lm Time Taken = 17 minutes 6 Seconds

Is there a way to see what sort64lm is doing which makes it so different?

Thank you again for all you do.
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
Sort64lm was designed with different views. It has limitations which make it unusable for large lists (crashed while I was preparing this post on multiple files). If it works for you - great. The reason rling exists is because it has very few limitations. Unlimited (really) line length on input. Unlimited file size. Use with named pipes. Mostly-unlimited line length on remove list files (about 25,000,000 characters, but easy to remove this limit too). Very, very high speed.

But sort64lm can't do what rling can. Let's look at some examples:

sortlm64 -u -i 1bs.txt -o out takes 65 seconds on my machine, but fails (looks like it read only 2.5g of the file, 128M lines)
sortlm64 -u -i 100M.txt -o out takes 26 seconds
rling 100M.txt out takes 18 seconds

So, while it may be faster in corner cases, it has built in limitation which make it unsuitable for use on larger lists. rling was designed from the ground up to be unlimited size, high speed, for dealing with large scale list processing. If sortlm works for your lists - again, great - that's what you should use. It did not work for me (on anything but my smallest lists), and ULM does not have the scale for my processing. It also does not have the rest of the features that I needed.

On the other hand, 8G of memory upgrade for your computer is $30-60 on Amazon :-) In order to process large lists in memory, the memory actually has to be there. rling doesn't waste memory - it only asks for what it really needs, but it really uses all that it asks for.

Or you could split your list in two, and enjoy all of the high speed features of rling.
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
Yeah - one more thing. The 4.82G file you are using? How many lines does it have (use wc -l if you have it, or the line count from rling). Best as I can tell (still waiting to hear from Blazer for confirmation), sort64lm is limited to around 130 million lines. Check the output file from your 4.8 g file to make sure it really has the right number of lines...
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
Ok, I fixed Blazer's code, so now it works correctly on large files. There were a few issues on Unix, which caused it simply not to work for large files.

On my system (ancient x86_64 box with system load of 16):

sort64lm -u -i 1bs.txt -o out takes 741 seconds
rling 1bs.txt out takes 111 seconds.

but that's hardly fair, since rling does no sorting in that case (just the de-dupe function).

rling -bs 1bs.txt out takes 207 seconds

So, either way you look at it, rling is still going to be faster at basic sorting - which is the least of its usage - given adequate memory. Again - rling is not a sort program. I can build a sort program similar to Blazer's which uses a small amount of memory, and offers adequate performance. The *primary* purpose of rling is processing a single input file against a list of other files, to give you new lists which contain the difference, or matching (using the -c) lines between a large number of lines. Without having the source line in memory, you cannot do a compare operation against those other files (well, yes, you can do an approximate match with a hash, but then you still have to go back to the original file to do the final compare). That's rling's primary purpose, not just as a sort. So, the source file has to be in memory (using the default hash mode, as well as the hash, and a pointer to that line in memory), or you have to sort the input list in memory, and just have the line, and a pointer to the line (which is what the -b option does), or you have to pre-sort the files on disk - and in this case you can handle terrabyte size files with only 50M of RAM in use (which is how the -2 option works).

If you don't want to have all of the file in memory with a hash table, and don't want to have all the file in memory and do a binary search, and don't want to sort your files on disk, then the *last* option is the -f option, which effectively does the exact same as the -b option, but mmaps the files into memory. This allows the operating system to page in and out as required. Yeah, not ideal, but if you don't want to sort your files, then this will still allow you to use all of the features of rling, albeit at a slower rate. Not impossibly slower, but slower.

But if all you want to do is sort - use a dedicated sort program. It will do the job better than rling if you don't have enough memory on your system. If you *do* have enough memory, rling will smoke even a great dedicated sort program, even doing a basic sort. And blazer's sort64lm is great (now that it works :-)
 

WreckTangle

Active member
Joined
Jan 24, 2020
Messages
94
Reaction score
15
Credits
730
Hi Waffle

Thank you for your reply, explanation and testing.

I hope you did not take my question as criticism it certainly was not meant in that way.

I appreciate your point that users of RLING who only have 8GB RAM can go out and buy more RAM to work on lists larger than 4GB. My point is what about users with significant RAM but just never enough for the ever growing size of password lists.

I do understand your point that users can continually split lists until they are small enough to use.

sort64lm.exe does not work for me in the way I think you mean, well it does but has bugs, which is why I was pleased to see you making a word list tool such as RLING and why I have made complimentary comments throughout your thread.


The 4.82GB file I was using was picked at random from my word list collection. It contains 518469125 lines.

I tested again with the original sort64lm.exe with the same 4.82GB list to check for the number of lines outputted:

sort64lm.exe Time Taken = 16 minutes 57 Seconds

518469125 Lines Input
518469125 Lines Output

As you have fixed sort64lm.exe, are you allowed to share it or is it copyright to Blazer? Perhaps Blazer could host your fixed version?

I think one of the most significant problems I had with sort64lm.exe was handling files with certain names. I noticed names containing several (underscores) caused issues.

Yes I understand RLING's primary purpose and why you made it work in the way it does to achieve its performance. I think you get me wrong I am a huge fan of RLING. I guess my comments and requests are out of frustration of not having a computer like yours as much as anything.

I can see you are talented and so few coders seem interested in word list tools which is why I have somewhat hogged your thread :)

I can build a sort program similar to Blazer's which uses a small amount of memory, and offers adequate performance.
Just tell me how low would I need to grovel until you would take enough pity on me to create something like that? :)

It would be VERY COOL if I could interest you in perhaps keeping RLING as the high performance sports version but perhaps also consider more day to day tools for normal less professional users who need to work on huge lists but lack the money to buy a computer like yours or maybe those who work on lists larger than the RAM slots available.

The temptation to bombard you with feature requests and tool requests is huge and likely to make you dread my posting back LOL. Please keep up the good work and I sincerely hope you understand the difference between my frustration and enthusiasm for your software.
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
One of the prime reasons to use rling is to *prevent* the explosion of password lists. There is (simply put) no reason to keep monolithic lists.

1) There is no reason (ever) to have a monolithic list. Make your lists smaller. 518M lines in a list could be split up into 100M line chunks, more or less.

2) As @tychotithonus pointed out, keep smaller lists, use rules. As an aside, did you know that while using MD5 with mdxfind, for example, you can run 4 rules in the same time as you can run 1 plaintext? That's certainly not true with ever algorithm, but it still an argument for smaller lists.

3) When you get a new list, keep only the new works. That's a great use for rling.

A hashkiller list I downloaded earlier (sometime in January) had 266M lines in it. Let's see what happens when I process that against the top 10k passwords, and the rockyou list (just as two examples)
Code:
rling -bs /big/hashkiller-dict.txt out /big/words/10kpass.txt /big/words/rockyou.txt 
Reading "hashkiller-dict.txt"...2961969523 bytes total in 1.3167 seconds
Counting lines...Found 266518036 lines in 4.4251 seconds
Estimated memory required: 5,198,971,443 (4.84Gbytes)
Sorting... took 30.0079 seconds
De-duplicating: 266518036 unique (0 duplicate lines) in 2.3639 seconds
Removing from "/big/words/10kpass.txt"... 9898 removed
Removing from "/big/words/rockyou.txt"... 4675936 removed

4,685,834 total lines removed in 4.5239 seconds
Writing to "out"

Wrote 261,832,202 lines in 56.9410 seconds
Just like that, almost 5M duplicate passwords gone from that list - and if I were to apply even one more existing list, like the hashes.org 2012-2019 list, the total would be 200M passwords removed, leaving only 66M "new" passwords.

I've been busy cleaning up my own lists, since starting rling in July. I was staggered when I realized just how much time I had been wasting, trying the same password multiple times throughout my lists. When I got a new list, I was just tossing it into my "words" directory, thinking I would get around to cleaning them at some point.

Split your lists up into 100M line chunks. Call them "mine-01.dict", "mine-02.dict", and so on. As you get a new list, split *it* into 100M line chunks, then run those chunks against all of your lists, combine what's left, and append that to your new "mine" lists. As you crack new passwords with rules, if you want to keep those (for whatever reason), build a new set, using getpass, and call them "rules-01.dict", "rules-02.dict", or "cracked-01", "cracked-02" and so on. Build lists that work *for* you, not against you.

Here's a neat trick that @tychotithonus got me to add a few versions ago. Let's say you have your dictionaries in /dict. You can say:
Code:
rling -bs /dict/mine-47.dict /dict/mine-47.newdict /dict/*
Now, normally, that would also match mine-47.dict, and thus remove all of the lines from your file, because they would all (of course) match. He asked me to change this to skip the remove file, if it was the same name as the input file. This makes it easy to clean a list against all of your other lists. Yes, it's still possible to shoot yourself in the foot (by manipulating the directory path, or linking files, etc), but for the most part, it will do the right thing.

Anyway - if all you have is a limited amount of memory, get creative. If I can get by on my Raspberry pi's, you can get by with 8 G. Just stop trying to do more than your system is capable of, and work _with_ your system, instead of _against_ your system. Split your files.
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
Don't say it again

Sometimes, you need to try things. Sometimes you need to generate some passwords to try. While this is critical during contests, it happens in the real world too. You notice a pattern - say, words followed by numbers. Or word number word. Or adding different endings to words (ed, ing, ly, and so on). A combinator attack is often the best approach for this, sometimes offering even better performances than hashcat masks, and brute force.

combinator is a tool in the hashcat utilities that allows you to combine all words in one list, with all words in another list, to create a new list. But this can generate a very large number of words to try, very quickly. What worse, depending on the lists, you might end up generating the same word multiple times. This can also happen with other generation techniques. None of the other programs that do word generation generally have the ability to detect these duplicates. So, how to deal with them?

One way is to use rling -bs stdin testfile. This does the job, but if you just want to *try* something out, means you have to wait for the generation of all of the words, then sort and write the output. If it's a throwaway list (one that you are just using to *try* to see if it might be worth a shot), wouldn't it be better to have a way to "de-duplicate" on the fly?

Enter dedupe, part of the rling utilities. This program accepts words from files, or stdin (or named pipes), and writes *only* unique words to stdout. Duplicate words are automatically discarded, and the "insertion time" of dedupe in the path is quite small (a typical latency of only a few microseconds. Internally, dedupe uses a Judy array to store and check the unique words, and it also automatically $HEX[] processes the output if required. This does mean that it has to store one copy of each line in memory - thus it uses about 3 gigabytes of memory for every 100,000,000 lines, which is more than rling uses, but offers "real time" output of each line, instead of a delay. You can reduce this, at the expense of an "approximate" dedupe, using the -x switch, which stores only the hash of each line, rather than the whole line.

To use it try:
Code:
combinator words suffixes | dedupe | mdxfind -f myhash.txt stdin >myhash.res
Of course, you can redirect the output to a file - but in that case, it may be better to use rling.

Anyway, another tool for your password cracking convenience.
 

WreckTangle

Active member
Joined
Jan 24, 2020
Messages
94
Reaction score
15
Credits
730
I accept I am going to have to start splitting my lists up to use RLING.

You kindly made splitlen.exe which is VERY useful.

Could I interest you in making splitchar.exe? This tool would allow users to split their lists in the same way splitlen.exe does but based on characters.

Ideally with an option to be case sensitive or not. This tool would allow users to split their lists up in different ways by length or by character to better organise themselves for different scenarios.

Splitting into understandable groups will allow for easy review of the content to look for low quality passwords to be removed.

Also the user might be able to make fewer total lists (but still small enough to use RLING) than splitting by length.


Sorting in RLING
When sorting would it be possible to please allow the user to chose case sensitive or not?


Frequency analysis

Could you please make it so the user can choose the minimum occurrences of a password before it is included in the results file?

The simplest way to try to describe what I mean is imagine the user has a unique content 1 GB text file the results output file is the same size as RLING shows the user that each password has an occurrence of 1.

If the user is able to set (for example) the minimum occurrence level to 2 the output would be nothing. However if there was a single duplication then the output file would only contain 1 result showing 2 occurrences.

Imagine the example but the threshold is set to 100 on a large original leaked list. The user analysis list would be more practically useful as it only outputted significantly popular passwords which have a popularity of 100 or more.
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
A better choice, if you want to split by character, remains grep. grep ^[Aa] for example, would gather all your words starting with A or a together. It has the case insensitive flag -i, as well.

I challenge you, however, on "splitting into understandable groups will allow for easy review". I see no way in which a human could review 500M passwords. Or even 50M. Or even 50,000.

I use automation for this, and pattern analysis. For example, I will often strip numbers from the beginning or end of passwords, ending up with basewords. I then apply rules and/or other techniques to re-extend the passwords.

The only sensible way to split large lists, is by line count. Use split -l for this. Break them into groups of 100M, or 50M, or whatever makes sens for you.

Separation by language may also be useful.

With regard to sorting, no, lexical order sorting is required to make the other functions of rling sensible (like -2). By default (unless you specify -s), rling will not sort the output, and will preserve input line order, so you can feel free to store your lists in whichever order makes sense for you. I keep all mine in lexical order, with the exception of my probe lists, which I keep in frequency order. These help me to identify new hash methods.

With regard to -q, no I won't be adding a lower limit. This is trivial to do in post-processing (using a perl or awk script), or just using head (since they are sorted in frequency order anyway, the top words are always going to be in the first thousand lines). Because you can output the report to stdout, this becomes trivial to create in a pipeline, and would not make rling use any less time or memory (since all of the input file must be read to determine if there are more than one of any particular word).

Code:
getpass current_file.txt | rling -q clsw stdin stdout | head -1000 
...
   Count   Length  Percent   Cumul Line
    2533        6    6.53%   6.53% 123456
    1642        9    4.24%  10.77% 123456789
     720        7    1.86%  12.63% 1234567
     488        8    1.26%  13.88% 12345678
     346       10    0.89%  14.78% 1234567890
     337        6    0.87%  15.65% 111111
     313        6    0.81%  16.45% 123123
     270        6    0.70%  17.15% 121212
     231        9    0.60%  17.75% kurdistan
     189        6    0.49%  18.23% 000000
     187        6    0.48%  18.72% 112233
     169        6    0.44%  19.15% 123321
     128        9    0.33%  19.48% 987654321
     106        8    0.27%  19.75% 11223344
      99        6    0.26%  20.01% 654321
      98        9    0.25%  20.26% 123123123
      92        8    0.24%  20.50% kurdstan
      81        7    0.21%  20.71% 1111111
      81        8    0.21%  20.92% 11111111
      81       10    0.21%  21.13% 1234512345
      79       10    0.20%  21.33% 1122334455
 ...
For example. Remember - -q is *only* to be used on solved hashes, and is *best* used on solved hash lists that are either:
1) not uniqued prior to solving, so that the same hash appears many times in the output file
2) on salted or username-seeded hashes, so that password re-use *can* be seen.

It can be used on many lists, but the results will be misleading if you have artificially "helped" the process by removing duplicate hashes. Yes, hashcat isn't particularly fast at removing duplicates (mdxfind is very fast at this), and it is tempting to remove the duplicates. This will, naturally distort and render invalid any attempt at frequency analysis.

It's also totally pointless, other than from a histogram sense (-q h), to analyze password lists, since they should already have been uniqued.
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
Oh yeah - another great use for rling. To generate lists in frequency order, use getpass (or another technique) to get your solved hash passwords, the using -q w to output them in frequency order. Doing this over a large number of solved lists will give you a very, very useful list.

Code:
getpass *.txt | rling -q w stdin mypass.pass
 

WreckTangle

Active member
Joined
Jan 24, 2020
Messages
94
Reaction score
15
Credits
730
I challenge you, however, on "splitting into understandable groups will allow for easy review". I see no way in which a human could review 500M passwords. Or even 50M. Or even 50,000.
I split and sort in different ways to scroll through lists looking for salts or patterns to help me clean my password lists.

If there is a huge list and I sort alphabetically as you say scrolling through takes a long time and will be difficult to notice patterns.

Outputting to separate lists sorted by length using splitlen.exe is very useful and allows a human to see the lists presented differently and may help (certainly does for me) to find patterns of low quality words which I make regex's to remove later.

However if splitchar.exe could output by character (including digit and special) then the lists would be smaller and when sorted easier for a human to scroll through looking for passwords starting with a common salt.

Just changing the method of sorting and splitting presents a different view and simply helps me find low quality passwords with a common theme.

Cleaning lists is more important to me than getting new lists or making hashcat rules now. This is why I tried to interest you in making a Waffle RegEx tool because having a fast RegEx tool will do more for my cracking speed than anything (except for duplication removing).

Testing duplicate passwords is a huge waste of time but now removing duplications is easy the next stage for me is removing junk text from password lists.

I know I cannot interest you now but if you ever get bored and feel like a new coding challenge then please consider a Waffle replacement for ReLP.exe. grep is just too slow for the number of regex I use.
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
grep will do all that you want (and more), and can also help you find embedded salts. But there is no way that a human can scroll through hundreds millions of passwords, let alone scroll through and remove low quality ones.

Grep already runs at a fine speed, will generate output at a far faster rate than I can read :-) I might be able to write something faster, but why? The hashes.org 2012-2019 list contains 1,270,725,636 lines. Grep -a ^[Aa] takes 60 seconds to run through it on my ancient x86_64 system, producing 67,956,038 lines. That a million lines per second, output speed. Pretty damn fast.
Code:
for val in a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9
do
grep -ai ^$val input.file >$val.output
done
grep -avi ^[a-z0-9] input.file >others.ouput
This will split your file up into words starting with a-z and 0-9, then create one file for the "rest" called others.output. It will run through 1.2 billion lines in about 30 minutes, and leave you ready to post progress them exactly how you like. You can start working on the first one after 60 seconds - no need to wait for the last one to complete...

Really, there's nothing more that I need to write. You have all of the tools you need to do what you are suggesting - and then some. All you need to do is spend a bit of time learning how best to use them.

For example, I just tried something. I took an existing solved hash list. I unhexed them (using rehex -h), and cut each line in half. I put the low half in one file, and the rest of the line in another file. Then I used combinator to try all the combinations of endings. Found another 400 MD5 solutions, just like that.
 

WreckTangle

Active member
Joined
Jan 24, 2020
Messages
94
Reaction score
15
Credits
730
I'm sorry I am not very good at explaining myself which is always a problem for me.

But there is no way that a human can scroll through hundreds millions of passwords, let alone scroll through and remove low quality ones.
I do have a method which is at least making an attempt.

When changing how I view the lists I can sometimes see a pattern. Sorting alphabetically then by length gives different views.

If I scroll down a huge list, at speed, every now and then the text seems to fix. At that point there will be thousands or many more items starting (or ending) with the same group of characters. So no I don't read every word but I can see when there a many with a similar generated pattern.

This is particularly useful for salted passwords like...

the river runs through it password1
buzzmachines password1
...etc

When I whizz through a list I see the text "the river runs through it" suddenly appear fixed at the start of many lines so I know I have found another password salt and I stop scrolling and add it to my RegEx lists.

It is possible to find these salts manually although very slow. An automated way would be cool but there isn't one available.



I understand grep could work on one RegEx at a time at a speed useful for your needs. I have hundreds of RegEx to run through on many, many GB's to clean my lists.

I have been waiting for someone who is clever enough to make these tools which is why I have jumped on your thread. I understand your explanations of how to use workarounds to achieve what I need with the current tools available.

I was just hoping to interest you in basically fixing / upgrading the suite of tools by Blazer.

For example the cross reference tool by Blazer seems to me (I have no idea how it works) to only load the reference list into RAM meaning a user can cross reference larger lists than with RLING. I might be wrong about how it works but I can certainly work on much larger lists with ccr64.exe than I can with RLING and at a very good speed.



The reason I was trying to get you interested in making a new RegEx tool was not just to split A-Z it was to run hundreds of RegEx stored in text files on many GB's of password lists. I just wondered if you could make something faster than ReLP.exe as some of the runs can take over 30 hours.

I am simply trying to make cleaner, therefore smaller, password lists.

I can see I am somewhat high-jacking your thread and it would seem my needs are more extreme than what everyone else appears to want.

With this in mind I will thank you very much for making splitlen.exe for me and unless you find a need for a fast RegEx tool yourself I will stop pestering you and filling up your thread with requests.

Thank you very much Waffle for your work on RLING and especially splitlen.exe.
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
So, what are bad passwords, who put them there, and how can I remove them?
(this is gonna be a long one, bear with me)

The topic has been brought up recently, not just by @WreckTangle, and I thought I would share my thoughts and how I've been dealing with them over the years. This is a complex topic, and is not covered in the depth I think it should have been. I've posted on Hashkiller previously about it, and it has been a topic of disagreement even in cracking contests.

In my view, there are four classes of "bad" passwords. There is a fifth category that I am specifically not addressing here, which is "bad parsing". Troy Hunt's lists are particular examples of this.

1. Passwords which are unlikely to be re-used, because they are random or machine-generated. It may have been an achievement to crack a 30 character password +0^AtOUrjubwD/Hu$ojmwuUMk'NoK@, but the likelihood of it being re-used is quite small. Does it belong in a general purpose list? Well, if you are cracking a list for a contest, and it's one of the critical high-value entries, then you better have it in your list. But otherwise, no. It does not belong in a generic list. A subset of these are "known" values, such as BAF3A9C3DBFA8454937DB77F2B8852B1.

2. Passwords which contain salts and/or peppers. A good example of this is the river runs through itunbroidery, xCg532%@%gdvf^5DGaa6&*rFTfg^FD4$OIFThrR_gh(ugf*/es@live[redacted], or somepasswordBSF75663. Each of these used a static salt, prepended to the actual password. Some have even been more complex, embedding separators, and the like.

3. Passwords which contain usernames, or userids. A good sample of these are 1351-saralucia, Admin1234, and pqonieneo4N36tbfzC. These, too, can be quite complex, involving padding, embedded NUL characters, and so on.

4. Passwords which, while hashing to the correct value, are not in fact the correct password for the original hash. These break into two sub-categories: Hash algorithm collisions; incorrect hash types. In the realm of password collisions, some hashing algorithms (well, in fact, all hashing algorithms, but some are more likely to have this problem) generate the same hash for multiple inputs. Incorrect hash types are a more broad category, but include has solutions like 5f4dcc3b5aa765d61d8327deb882cf99G%Y, 55ae17202f23e50f30883ee4bb581001 and a60259fe312f320dfcf2c003041b9218$yHlU/^\.R3',Dvt7,%VY~(P~s$S.{.

Let's talk about each of these, why they are there, how they make their way into lists, and how to get rid of them (if you think it is appropriate).

Random password

Random passwords, or machine generated passwords, can be very effective. Properly random passwords are very difficult to crack, and if not re-used, are a good security measure (though typically unworkable for humans, since they inevitably put them on sticky notes on the side of their monitors). A user recently asked me "how do I get rid of the random passwords in my lists" (and was more specific, indicating that he knew there were fixed-length random strings). But, regretfully, there is no automated way that I'm aware of that can analyze a password, and declare it random. "sotesifaa" may look random, but it's actually the first letters of the first paragraph of page 124 of a book on my shelf ("the 8087 primer" if anyone is interested :-). Many of the lists do have lots of 5 character random passwords in them, as well as 10 character. Turns out that many of these were added as a result of a class 4 (see above) cracking attempt. Should they be included? This is a deeper issue.

Many hash lists found contain *lots* of seeded, or random hashes. Sometimes, these are used to identify where the lists leaked from. Sometimes, they are added to add bulk to lists, to give them the appearance of being larger than they are, or to hide the origin of the list. Sometimes, users are added *by the site owners* themselves, to artificially inflate the user count. When these are cracked (often by long term brute force attacks), these random passwords get lumped into the "found" pile, and end up on password lists.

A "special case" of these are the "known value" hashes. BAF3A9C3DBFA8454937DB77F2B8852B1 for example, is the root key hash for a version of the Nokia phones - known to many that were doing phone hacking back in the day. There are many other "known value" passwords that are used, that appear to be other things.

Should they be there? Yes, they should remain on your general purpose lists. They are not useful for working on complex, or expensive hashes, which is why you need to generate *your own* "top 10,000 password" list. These are passwords which have been seen *most frequency" on the types of lists you have to work with. Your top 10k list is not going to be the same as mine. If you do pentesting for "Joe's Bakery and Shell Reloading", the passwords used are going to be different than pentesting for a "Dragon Farm" website.

Extra salt

Passwords which contain embedded salts and/or peppers should be stripped. These are, actually, a subset of class 4 - using the wrong hash, and getting the right value. Because internally, a given application may be adding a static salt, that salt may not be available in the hash lists, or (in some case) not even known. It is through pure luck that some of these salts and peppers have been found (but, far more often, by careful examination of the source code for a given application). Now, because the salts/peppers are not used in the hash lists, some people think that the only "correct" solution is to include the salt/pepper in the hash (because, in many cases, the algorithm cannot be solved with Hashcat without doing this, because Hashcat does not support that particular hash-with-salt-and-pepper method). This is wrong, and no matter if you have to post-process the results, or use a different program (like mdxfind), only the actual password should be stored.

The problem arises when you are called to crack an unknown list. If that list contains salted and or peppered variants such as the examples given, then you won't find them. This increases the time-to-ID new hashes, unless you also pre-process your lists with known founds. As usual, this is a trade off, and one that you will have to determine for yourself.

How to remove them? This is a two-phase problem, involving both your "solved" passwords, and also the original file. I think the easiest way to do this is to work through your solved passwords first, then clean up the original list that contained them. When I'm doing this, I take all of the "suspect" solutions, and place them in a working directory. I then spend time (sometimes minutes, often hours, occasionally weeks) working through them, and solving them with the correct algorithms.

I then take all of the "bad" password solutions, and find the common elements. For example, let's look at "somepassword" as a salt:
Code:
grep ^somepassword infile >work
(edit work to confirm it contains only the passwords I want to get rid of)
sed 's/somepassword//' work >>infile
rling -bs infile infile.new work
cat work >>junk
mv infile.new infile
"junk" passwords, a term I followed from the hashes.org "junk" password lists, can then be merged into the junk lists. These are *still* useful for the future, but not part of the "everyday" passwords that I will use.

Users and padding

When usernames are included (or usernames and padding), this is handled much in the same way as salts. What becomes different is how you re-solve the hashes (again, copy them to a work directory, work through to identify the proper algorithm, then apply to the original hash lists). Once you have identified this as a username-included solution, then you may be able to find many more passwords in the original password list that have common username included. When the usernames approach single-user, this is another good reason to not have them in the original list. Admin is an example of a username which *could* appear as part of a password, but if you see GenMahem47-monkey, you might reasonably assume that the username is GenMahem47, and the password is monkey. Again, this is one of the cases where judgement, and the use of a "junk" password list, may be very useful.

Password removal techniques follow the general case of the salted passwords, shown above, but significant work should go into the vetting of the removal lists. Pay attention to patterns that reveal themselves (are usernames really user numbers? Are they sequential? Are you seeing the general case of [0-9][0-9][0-9]-password, for example, throughout your password lists? This can help you in analysis.

Padding is another issue. Sometimes, you can see "Joeblogs password", or null-padded passwords. These may show that there is a fixed-length-maximum username field, followed by the password. Pay attention to details - they matter.

Invalid hashes

This was one of the core reasons that I made mdxfind. Hashes within hashes. When I started to see "passwords" of the form 5f4dcc3b5aa765d61d8327deb882cf99G%j, I knew I had to do something. But they didn't stop there. I saw b78b3c0674aed5a05ecf4931d20d1c3d, and f881b9b4af89da8a203598b410e1d846 and more.

Today, i regularly look through my MD5x01, SHA1x01, and many other "solved" hashes for strings of hex digits. Virtually all of them are the results of incorrect hashes. If you see in your "solved" list a hash like 1ed735adc33427448d0c7264e479be40:5f4dcc3b5aa765d61d8327deb882cf99G%j, you should know that, while this is a "valid" solution (meaning that if you take the MD5 of the part on the right of the :, you will get the result on the left), it is in no way a "password" solution.

These require work. There is no shortcut.

Much of my time is spent cleaning up lists like this. In general, I will look for *any* solution with a length over 16 characters as being suspect (since, according to the histograms of my "true solved" passwords, lengths over 16 characters represent less than 1% of the solutions). I get *very* interested in lengths >= 32 characters. So, how do I go about it?

Code:
awk 'length > 64 {print;}' *.MD5x01 | grep -v HEX >work/level1.txt
cd work
cut -c 34-65 level1.txt >level2.txt
cut -c 66- level1.txt >level3.txt
mdxfind -f level2.txt /dictall/* | mdsplit level2.txt
mdxfind -f leve3.txt /dictall/* | mdsplit level3.txt
cut -c 34- level2.MD5x01 | mdxfind -f level1.txt -h ^md5salt$ -s level3.txt stdin | mdsplit level1.txt
...
Then I repeat, looking for other hashed and salted variants. Then, instead of only looking at 32 characters in the "level2" hashes, I look at 40, and 64 characters. Then I try different combinations. Rinse and repeat, as required.

Once I find sets of invalid hashes in the password files, I again extract them to the "junk" list, and remove them from the standard working dictionaries that I uses. The work on this is ongoing, and is not a short term issue.


How to prevent this?

Know your passwords. Look at the solutions. The "most important" think you can do is not to just blindly look for the "latest great password list", but to understand how people think, how passwords are generated, and what kinds of things those passwords are protecting. Build your *own* lists from passwords you have solved, then add rules as @tychotithonus suggests. Start with the most common passwords - which is particularly true on salted or high-iteration hash algorithms. Use brute force as a last resort, not as crutch.

I get more done with 1M passwords and rules than many people with billions of "Real Passwords Solutions - only $19.95".

But *most* important - don't give up. Try new things. Share your knowledge,
 

Waffle

Active member
Cracker
Joined
Dec 30, 2019
Messages
1,708
Reaction score
69
Credits
816
New version posted of rling - this should offer better read speed on files. Probably close to the same on pipes, but should be better line counting speed - I spread the load between the cores better in this version.
 

WreckTangle

Active member
Joined
Jan 24, 2020
Messages
94
Reaction score
15
Credits
730
Hi Waffle

I am having a problem using rling.exe

System win7 64bit
rling version 1.73
Command used rling.exe -b -s input.txt output.txt
Text file is 2.06 GB
System RAM 8GB
As rling.exe -s requires over 8GB to sort this 2.06GB text file "-b" is used.

Problem is output file is empty after sort and de-dupe process completes.

Thanks.
 
Top