rling (rli replacement) testing required

Waffle

Active member
Contributor
Feedback: 0 / 0 / 0
Joined
Dec 30, 2019
Messages
1,708
Reaction score
71
Credits
816
I have some new code I've been working on for the last couple of weeks, with the usual CSP folks. This is a replacement for rli, but substantially faster (10x to 20x).

rling accepts an input file with any number of lines. It will de-dupe it (unless you ask it not to), then remove lines that match a list of "remove" files.

It's quite fast - small files (100K lines or so) are effectively instant. Large files (1,000,000,000 lines, 10 gigabytes or so in size) are processed in about 25 seconds on a moderately fast machine.

I need some testers, please. I have pre-compiled binaries for AIX, PowerPC (BE and LE), Intel, Windows, and the like, and source is available.

I need you to try to break this, please. We've been testing it for the last couple of weeks, and I would like to expand that.
 

Waffle

Active member
Contributor
Feedback: 0 / 0 / 0
Joined
Dec 30, 2019
Messages
1,708
Reaction score
71
Credits
816
So, looks like the code is pretty much done, now. It's a lot faster that rli/rli2, and you can grab it from https://github.com/Cynosureprime/rling. There are binaries for both Windows and linux (x86-64, 64 bit), plus source code. There are additional utilities that really help in processing wordlist, on a massive scale.

To save everyone some time - if you don't create wordlists, but simply consume them, you probably won't need these tools. If you process any amount of data, generate your own wordlists, or do more extensive work with large lists - have a look. I'll give some usage examples, so you'll have an idea of what usage is intended, and how best to bend these tools to your will.

In basic use, you can use rling (ARE-ling - yes, it's supposed to be RLI next gen, but ARE-ling sounds better) to simply de-duplicate your lists, while retaining their existing order.

rling infile outfile

will read in all of infile into memory, de-duplicate it, and write the output to outfile. Because it reads into memory, it's very fast, with typical speeds of millions of lines per second. On one of my test systems, it takes about 25 seconds to read, de-duplicate, and write a 10 gigabyte 1,000,000,000 line file. In general, though, you are going to need about 3x RAM, compared to the file size. Because of this, rling has several modes.

rling -b infile outfile
rling -f infile outfile
rling -2 infile outfile /dev/null

By default, rling uses a hash table to speed lookups. This takes extra memory, so the -b option forgoes the hash for a binary search, using about 2/3rds of the memory as the default. Still can be considerable if you have large files, so the -f option create a disk database, allowing you to handle unlimited size files (well, limited to your disk space) and not use much RAM at all. -2 uses even less RAM, but the input files _must_ be in sorted order.

Most people will be able to get by with just the defaults. But rling can do a lot more than just "dedupe wordlists". Let's say you get a new wordlist, and want to run it against some hashes. But you don't want to try words you've already tried.

rling new-words.txt outfile /dict/old/* /some/other/files*

rling will rip through (at millions of lines per second) your old wordlists, comparing them against the new-words.txt files. Any words that exist in your existing files in /dict/old/* and /some/other/files* will be removed from the new-words.txt file, and the result written to "outfile". Handy.

rling can also be used in a pipleline, which is great if you have a bunch of wordlists, and want to put them together in specific ways. For example, getting ready for a contest?

zcat /archive/names/firstname.[a-n]* | rling stdin stdout | gzip -9 >names.a-n.gz
grep -h ^[a-fA-F] /archive/names/lastname* | rling stdint stdout | gzip -9 >lastname.a-f.gz

This will read your existing files (or grep a bunch of files, in the second case) de-dupe them, and pipe them to gzip to be compressed.

Let's say you have a list of candidate words you want to extract from a bunch of files. rling can help with that too:

cat /archive/names/* | gzip -c stdin namesfound.txt /tmp/list-of-names

Here we look through all the files in the /archive/names/ directory, and show the "common" (-c) lines between them and the "/tmp/list-of-names" files. Kinda like the inverse of removing them.

Want more? How about this:

getpass /myfounds/*.MD5x01 | rling -q w stdin myMD5.pass

This will extract passwords from your solved hashes, pipe them to rling, and use the -q option to sort the words, arrange them _by usage frequency_, and write to myMD5.pass

You can do this with your hashcat .pot files too - getpass is reasonably smart, and can extract just the passwords from these files too.

getpass *.pot | rling -q a stdin allpot.pass

In this case it will display not only the solved passwords, arranged by frequency, but also counts, word length, and a histogram of the selected passwords.

There's so much more. Check out rehex (which will fix your existing passwords if they are un $HEX[]ed), and all of the other options.

You can even use it to sort your wordlists (a lot faster than /bin/sort :-)

rling -bs old-hashes.pass sorted-hashes.pass

There's lots more to try. Enjoy!
 

pasnger57

Active member
Contributor
Feedback: 4 / 0 / 0
Joined
Dec 30, 2019
Messages
3,531
Reaction score
1,127
Credits
7,291
interesting ... i have a few 9Gb personal list that my benefit from this tool (as of now i know its to be 99.5% unique non numeric) i sure this to be the case as i have refined it over a few times with awk sed grep and the appmearg tool a few times
ill give it a test
it be nice to have a way to Compare it to Outer list and add more unique lines
 

pasnger57

Active member
Contributor
Feedback: 4 / 0 / 0
Joined
Dec 30, 2019
Messages
3,531
Reaction score
1,127
Credits
7,291
humm ...

9082867749 bytes total in 148.1625 seconds
Counting lines...Found 702910832 lines in 471.7371 seconds
Optimal HashPrime is 1610612741
Estimated memory required: 33,319,200,621 (31.03Gbytes)
Processing input list... 0%Killed
look as if i need more ram...
 

Waffle

Active member
Contributor
Feedback: 0 / 0 / 0
Joined
Dec 30, 2019
Messages
1,708
Reaction score
71
Credits
816
Yup. Or you can use the -b which uses less, or -2 which is very fast (but requires the list in sorted order, uses almost no memory), or -f which uses disk. Slower, yes, but can handle any size.
 

pasnger57

Active member
Contributor
Feedback: 4 / 0 / 0
Joined
Dec 30, 2019
Messages
3,531
Reaction score
1,127
Credits
7,291
Yup. Or you can use the -b which uses less, or -2 which is very fast (but requires the list in sorted order, uses almost no memory), or -f which uses disk. Slower, yes, but can handle any size.
my list is in a sorted order but its sorted by smallest to largest is from 8 to 9 to 10 to 11 to 12 and so on up to 36 in length not sure how well the a-Z sets or if capsulized was 1st over lowercase don't recall off hand
how i set that woud that have any affects on the use of -2 flag
 

Waffle

Active member
Contributor
Feedback: 0 / 0 / 0
Joined
Dec 30, 2019
Messages
1,708
Reaction score
71
Credits
816
If it is not in standard sort order, rling will let you know :-)
 

pasnger57

Active member
Contributor
Feedback: 4 / 0 / 0
Joined
Dec 30, 2019
Messages
3,531
Reaction score
1,127
Credits
7,291
and Results.....

9082867749 bytes total in 109.3928 seconds
Counting lines...Found 702910832 lines in 440.8629 seconds
Estimated memory required: 14,811,012,037 (13.79Gbytes)
Sorting... took 3274.1507 seconds
De-duplicating: 702910819 unique (13 duplicate lines) in 709.3183 seconds

0 total lines removed in 709.3200 seconds
Final sort in 355.1142 seconds
 

Waffle

Active member
Contributor
Feedback: 0 / 0 / 0
Joined
Dec 30, 2019
Messages
1,708
Reaction score
71
Credits
816
Perfect. So you had 13 duplicate lines. If you like your files in sorted order, you can use -bs instead of just -b), and it will skip the final sort.

Next to try analyzing some of your founds. Pick your favorite list that you have already solved (ideal would be a salted list), and use:

getpass -n solved.list* | rling -q a stdin solved.pass

and have a look at the solved.pass file. You will find your solutions words in frequency order, with a nice histogram of the sizes at the end.
 

WreckTangle

Active member
Feedback: 0 / 0 / 0
Joined
Jan 24, 2020
Messages
95
Reaction score
23
Credits
737
Thank you for your work and for sharing.

I had an error on win7

"The program cannot start because msys-2.0.dll is missing."


Request

Could you add sort by length please? Short to long and perhaps long to short?

Allow user to change:

aaaa
aaa
aa
a
aaaaaa
aaaaa

into

a
aa
aaa
aaaa
aaaaa
aaaaaa

Thank you
 

WreckTangle

Active member
Feedback: 0 / 0 / 0
Joined
Jan 24, 2020
Messages
95
Reaction score
23
Credits
737
Problem for me.

System = win7 64bit with 8GB RAM

I tried to sort and de-duplicate a 4.82 GB text file.

Due to my RAM not being more than double the input file I tried...

rling.exe -v -b -s test.txt test2.txt
rling.exe -v -f -s test.txt test2.txt

Using either option above the time taken was in excess of 40 minutes so I aborted each test.

I can sort unique the same test.txt file using Blazers old software sort64lm.exe in 8 minutes.


Requests

Sort by length short to long as already mentioned above.

Could there be some feedback to the user when rling is "Reading" such as a percentage done?

Option for user to write output messages from rling output to a log file, such as work done and time taken etc

Please consider users with low RAM. Even users with good amounts of RAM will likely want to work on much larger lists than their RAM can handle.

Option to split output files by either ALPHA/NUMERIC or Line Length. Example: user opts to output by line length and rling sorts and places all items in separate text files such as...

originallistname_Length_1.txt
originallistname_Length_2.txt
originallistname_Length_3.txt
...
originallistname_Length_63.txt

Thank you.
 

Waffle

Active member
Contributor
Feedback: 0 / 0 / 0
Joined
Dec 30, 2019
Messages
1,708
Reaction score
71
Credits
816
There in fact used to be feedback while reading, but that turned out to take more time than the actual read. Processing in lower memory situations is a legitimate concern. We were more worried about speed while building it - going from 12 minutes to 22 seconds was a major accomplishment (and using less memory than rli). Those considerations really focused on speed, though and much less on memory.
Sorting by length isn't an issue. Output is, however, since this is designed to be used in a pipeline.

Let me think on the best ways to implement your suggestions - particular on how best to handle limited memory.
 

Waffle

Active member
Contributor
Feedback: 0 / 0 / 0
Joined
Dec 30, 2019
Messages
1,708
Reaction score
71
Credits
816
I've added a few things that will help. First, the lowest memory usage I can get to would be the -2 option. Grab the latest code, please.

rling -2 infile out.file

will us only about 50M of space (just buffers). You still have most of the features available in this mode (-c for common lines, any number of remove files, etc), and it is quite fast. To process a billion lines on my ancient x86_64 machine takes about 45 seconds (a very, very busy system).

Second, you now have a "splitlen" program. So you can do things like:

splitlen -o split#.txt bigdict.txt
or
rling -2 bigdict.txt stdout | splitlen -o bigdict#.txt

The "#" will be replaced by the line length, and as many files created as required. There is no limit on line length, and no limit on the number of files that can be created.

If your files are not in lexical sort order, you can do:

sort -S 1g bigdict.txt | rling -2 stdin newdict.txt

This gets you closer to your ideal. -2 mode does not have any limit on line length or file size. Any number of lines can be used, and it will use only as much memory as you give it with -M

It will be quite hard to think of a way to reduce the memory usage for the default mode (hashed), or -b (binary search) modes. Yes, you can reduce the hash table size (with -p), but that will seriously affect performance. With -b, I'm only storing a single pointer per line (8 bytes), so without tricks, it's going to be hard to reduce it further.

Memory is cheap again, so maxing out your system is the best way to have the highest performance, but I get that not everyone can afford it, nor do all systems allow for huge memory. That's why these other modes are present...

If you have an ssd, using -f with -T to store the database at the indicated path (like rling -f -T /media/ssd/ ...) will certainly help too, and give you "ok" performance....

I'll continue to think on this.
 

WreckTangle

Active member
Feedback: 0 / 0 / 0
Joined
Jan 24, 2020
Messages
95
Reaction score
23
Credits
737
Thank you Waffle I am very grateful for your help.

The low RAM problem might not just be an issue for people like me with 8GB RAM. I am sure with the HUGE password lists being shared today that users with 16GB+ RAM will soon start running into low RAM problems.

As you are friends with Blazer I assume you have looked at his word list tools available here.

Blazers set of tools are very useful and work on any file size but each tool has it's own bug in some form or another. Are you and Blazer perhaps able to work together on bug fixing the tools already available?

I am just a hobby hash cracker and I perhaps do not understand the requirements of professionals such as yourself. I am obviously just guessing what other users need but speaking generally I think average users are simply looking for the following.


All for HUGE lists often larger than the entire available RAM.

A fast merge tool.
(Blandy has made a good tool for this but it is unable to sort and there is a small bug when one list is empty or contains few candidates)

A fast de-duplicator.
(No point testing the same password twice. Blandy's tool does this very quickly but is not able to sort)

A fast sorter A-Z or optional by line length short to long.
(It is helpful if the user can read them in alphabetical order. Also lists sorted by length are useful for WPA testing or known length tests. I also understand hashcat used to, or perhaps still does, work faster when the passwords are of equal length. Blazers sort64lm.exe does sort but has issues with file names, leaving old temp files behind and sometimes just crashes/hangs)

A fast regex tool to output both matching and unmatched items to separate lists.
(This is very useful for users to clean their lists as testing very low quality candidates is as time wasting as testing duplicates. Being able to see what matched and what didn't is useful to ensure you do not throw anything valuable away)

Cross Reference HUGE lists
(Remove items contained in one or more lists from a new list but again this needs to be for HUGE lists)

I guess most average users would like to write simple scripts and be able to use them each time they receive a new password list regardless of size. Just knowing the task will be carried out bug free would be wonderful rather than hag the ability to sort small password lists very quickly.

Thank you once again Waffle for your help. I may have misunderstood the purpose of rling.exe as I am not an expert and do not know what experts require from word list tools. Just as a humble hobby user having a bug free set of word list tools would be fantastic even if they are not the fastest but simply just work.
 

Waffle

Active member
Contributor
Feedback: 0 / 0 / 0
Joined
Dec 30, 2019
Messages
1,708
Reaction score
71
Credits
816
Part of the issue is that you may be missing some of the basic tools. Grab some environment tools that bring basic GNU tools to you. Keep your passwords in lexical order. If you really think password lengths make a difference, then use splitlen on them to group them.

Use sort -m to merge sorted lists. It's very fast.
Use sort -S 2g (or whatever ram you can spare) to sort your lists. Lexical order.
Use rling to remove your existing lists from a new list. Since the remove lists are not read into memory, you can use the ultra-fast default option for this. Split your new lists into 1Gbyte chunks,
if required, then join the (unsorted) chunks together with sort -S 2g chunk1 chunk2 chunk3 >newlist.txt or use the built in -s option to sort each chunk, then sort -m to join them.

Here's an example (written in sh or bash, modify as required for windows):

#!/bin/sh
split -l 100000000 $1
for arg in x??
do
rehex $arg | rling -s stdin $arg.proc /mymain/dict/* /myother/dict/*
done
sort -S 1g -m x??.proc >$1.processed
rm x?? x??.proc

This will split your new list into 100,000,000 line chunks (about 1 gigabyte for typical password lengths - make it 50M or other length if you like). It will then process each chunk (split uses the default output format of x??, starting at xaa and going to xaz...) by feeding it through rehex to get rid of any UTF16, nulls in the files, and so on. That output is piped into rling, and uses the default hashed mode. Running these chunks against all of your existing dictionaries should run pretty much as fast as your system can read them, and the size won't matter (because they are not read into memory). sort -S1g -m will then merge the processed chunks back together, and build the final output file.

Using 100M lines will take about 4 gigabytes of memory with rling, so should fit fine in 8G. Adjust as required.

Other interesting ways you can process your lists: Let's say you want to know *which* of your lists have been the most effective cracking a set (or all) of your files?

To do this requires some understanding (and organization) of your lists. I keep my lists split up (using mdsplit of course, even with other cracking programs), so that the uncracked hashes are stored with .txt as the extension, and the solved hashes have an extension that indicates the type of hash. You don't have to do this, but you do have to have some way to recognize solved vs unsolved. For example, the ISW 2012 crack challenge had dozens of different hashes, all mixed together, so my files have isw2012-raw-md5.txt as the unsolved, and:

isw2012-raw-md5.GOST-CRYPTOx01 isw2012-raw-md5.GOSTx01 isw2012-raw-md5.GROESTL256x01 isw2012-raw-md5.HAV128_4MD5x01 isw2012-raw-md5.HAV128_4x01
isw2012-raw-md5.HAV128_5x01 isw2012-raw-md5.HAV128MD5x01 isw2012-raw-md5.HAV128x01 isw2012-raw-md5.HAV160_3x01 isw2012-raw-md5.HAV160_4x01
isw2012-raw-md5.HAV160_5x01 isw2012-raw-md5.HAV192_3x01 isw2012-raw-md5.HAV192_4x01 isw2012-raw-md5.HAV192_5x01 isw2012-raw-md5.HAV224_3x01
isw2012-raw-md5.HAV224_4x01 isw2012-raw-md5.HAV224_5MD5x01 isw2012-raw-md5.HAV224_5x01 isw2012-raw-md5.HAV256_4x01 isw2012-raw-md5.HAV256_5x01
isw2012-raw-md5.HAV256MD5x01 isw2012-raw-md5.HAV256x01 isw2012-raw-md5.HAVx01 isw2012-raw-md5.hexsalt isw2012-raw-md5.HMAC-MD5 isw2012-raw-md5.HMAC-SHA1chop32
isw2012-raw-md5.LMx01 isw2012-raw-md5.LOTUS5 isw2012-raw-md5.MANGOS isw2012-raw-md5.MD2MD5x01 isw2012-raw-md5.MD2x01 isw2012-raw-md5.MD2x02
isw2012-raw-md5.MD2x03 isw2012-raw-md5.MD4 isw2012-raw-md5.MD4.MD4x01 isw2012-raw-md5.MD4MD5PASSx01 isw2012-raw-md5.MD4MD5x01
...
isw2012-raw-md5.SHA384x01 isw2012-raw-md5.SHA512MD5x01 isw2012-raw-md5.SHA512x01 isw2012-raw-md5.SHA512x02 isw2012-raw-md5.SHA512x04
isw2012-raw-md5.SMF isw2012-raw-md5.SMFchop32 isw2012-raw-md5.SNE128MD5x01 isw2012-raw-md5.SNE128x01 isw2012-raw-md5.SNE256x01
isw2012-raw-md5.SQL5x01 isw2012-raw-md5.TIGERx01

Which of my lists were most effective cracking these? This is why getpass has special processing for .txt files by default (you can disable it with -t). By default, if you give getpass a .txt file, it will automatically replace the .txt with a *, and look up all of the files associated with the original. getpass will then extract the passwords from each of those files. A *lot* more handy than trying to name each file separately. You don't have to follow this practice, but you do have to have some way of keeping the "solved" and "unsolved" files separate. Careful use of the exclude extension (-x option to getpass) will help you do this.

So, back to the question: which files helped "the most" in solving these hashes?

getpass isw2012.txt | rling -q cslwh stdin isw2012.pass /mymain/dict/* /myother/dict/* 2>isw2012.dictstat

isw2012.pass will be filled with the sorted (by frequency) passwords, and isw2012.dictstat will have each of the wordlist stats:
```
Count Percent Cumul Length Line
59263 0.04% 0.04% 6 123456
22428 0.02% 0.06% 0
14001 0.01% 0.07% 10 YAgjecc826
12436 0.01% 0.08% 9 123456789
10112 0.01% 0.09% 8 password
8757 0.01% 0.09% 6 123123
7901 0.01% 0.10% 6 qwerty
6403 0.00% 0.11% 5 12345
6153 0.00% 0.11% 8 12345678
5740 0.00% 0.11% 6 111111
4682 0.00% 0.12% 4 1234
4359 0.00% 0.12% 8 j38ifUbn
...

Histogram of lengths
Count Length Percent Cumulative
18355832 8 14.52% 14.52%
16471204 10 13.03% 27.55%
14414887 9 11.40% 38.95%
12196255 7 9.65% 48.60%
12092543 11 9.57% 58.16%
10253875 12 8.11% 66.28%
7631331 13 6.04% 72.31%
5616111 14 4.44% 76.75%
4676250 6 3.70% 80.45%
```
 

WreckTangle

Active member
Feedback: 0 / 0 / 0
Joined
Jan 24, 2020
Messages
95
Reaction score
23
Credits
737
Thank you Waffle for splitlen.exe, it works really well.

You have enabled me to split 2GB text files into sorted lengths in 105 seconds :)

I can now use your splitlen.exe tool to reduce some lists to a size small enough so I can use rling.exe to sort them.

Thank you for explaining how to spit and sort lists with my limited RAM. I do work on MASSIVE lists but I always think splitting things up means I miss duplicates somehow.

I have used basic GNU tools before but as soon as I found Blazers tools, linked to above, I found it hard to go back to anything else despite the bugs!

I have been dreaming of tools like Blazers, but without the bugs, for some time now.

Your rling.exe is what I have been waiting for except for the issues with list sizes and my available RAM.

If you could make rling.exe work on large files at the same speed or even slightly slower than Blazers sort64lm.exe it would be awesome.

It seems to me that a lot of time is taken writing the db file when I use rling.exe -b or -f options and I wonder if it is possible to ask Blazer how he got around the large file problem?

Anyway thank you Waffle, people like you providing tools such as rling.exe, makes life much more interesting for those of us who are not smart enough to write them ourselves!
 

Waffle

Active member
Contributor
Feedback: 0 / 0 / 0
Joined
Dec 30, 2019
Messages
1,708
Reaction score
71
Credits
816
There is no database in use with -b, only with -f. -b should be your best bet for speed, if the lists are not sorted (or you want to use rling to sort, use -bs).

I did talk to Blazer - he knows about the bugs, and is trying to get back on here so he can get more info from you.
 

WreckTangle

Active member
Feedback: 0 / 0 / 0
Joined
Jan 24, 2020
Messages
95
Reaction score
23
Credits
737
I tried -bs with a 4.82GB list and I had to stop it after about an hour.

Thanks for talking with Blazer, why don't you two code something together instead of working on separate projects?

Actually, as you have improved the rli utility in hashcat utilities can I interest you in improving mask-processor by adding an option to produce random (pre defined) characters?

example = mp64.exe -p -1 ?lud -2 d ?1?1?1?1?1?1?1?2

p could be used to stand for pseudo-random.

The command above would not work sequentially but generate random output using lower upper and numbers, the last position would be just random numbers as an example.

This would be fun to have and also could be used on those hashes in everyone's "unbreakable" hash lists LOL

Obviously it is not an optimised attack but it is useful for learning and fun to play with.

Thanks again for you work.
 

WreckTangle

Active member
Feedback: 0 / 0 / 0
Joined
Jan 24, 2020
Messages
95
Reaction score
23
Credits
737
1
I tried the cross reference feature of rling.exe and noticed every time it started a new list it checked for duplicate lines.

In order to speed it up a little I pre sorted the lists and used the -n command so rling.exe would not bother checking for duplicates.

However when I was watching the output screen I noticed rling.exe kept checking for duplicates. I am not sure if this is just a standard screen output or if rling.exe was actually checking each time.

2
In the screen output of the cross reference it says "Removing from (input list)" which might mislead the user into believing rling.exe is taking lines out of the reference list instead of the source list.

3
I worked on a 4.82GB source list using an 80MB reference list.

After 50 minutes rling.exe was still reading the source file. I had to close the command window as my computer became unresponsive. I assume this is due to me having low RAM of only 8GB?
 

Waffle

Active member
Contributor
Feedback: 0 / 0 / 0
Joined
Dec 30, 2019
Messages
1,708
Reaction score
71
Credits
816
Blazer *is* working on this code too. This codebase is designed for more advanced purposes, than the old code (which I believe is 32 bit, and has many other limitations). The point of this is to remove the limitations of the old code, and better support high speed modern processors with many cores. It still supports older processors (of course), and limited memory. It uses far less resources than rli (from the hashcat utilities), which is where the origin came from. But ULM will never support unlimited line lengths, and other features that we think are important for current uses.

Now, on to your items:
1. The duplicate check also does the indexing of the lengths. It's a two-in-one feature.

2. I'm open to better wording. What would you suggest? Because it doesn't *just* do removal (you can use the -c switch to do common lines between multiple files), this seemed the best compromise.

3. The README suggests that rling wants 3 times the file size in memory. At 4.82G, you should have at least 13-14G of *free* RAM to process it. This really can't be cut down much - I'm already using about 2/3rds as much ram as RLI, and doing a lot more with it). So, you are swapping (or paging, depending on how you look at it). With 8G of ram (of which only 6G is likely free), you should consider 1-2G as the maximum input file size you should process. This really can vary, because it depends on the number of lines in the file (and the length of each line). a 4G file with 100,000 lines uses a lot less memory than a 2g file with 2,000,000,000 lines.

Split your files up, and do multiple passes - it will be a *lot* faster than any other method.

There's a new tool, released today, called dedupe. That's all it does. Reads any number of any length lines, and gets rid of duplicates. You can choose between keeping a copy of each line in memory (the default, which uses a lot of memory) or just keeping a hash of each line (currently -h, but that's likely to change to -x soon). Of course, just keeping a hash means that a false positive is possible (two lines which are not the same, hashing to the same value).

So again - the best bet, with limited memory, is to split your files up. That's what I do on the Raspberry PI and ODROID systems I have. They only have 4g of ram, so it's even more important there.
You can use splitlen for this, or just the UNIX utility split. 100,000 lines is a good choice, but try a few different ways. The goal, as always, is to limit the total number of duplicate candidates to as clost to zero as practical.
 
Top