Removing duplicates from BIG text file

Removing duplicates from BIG text file - python

I have a rather big text file , average 30GB. I want to remove duplicate lines from this file. What is a good efficient algorithm to do this. for small files, I usually use dictionaries, eg Python dictionaries to store unique keys. But this time the file is rather big. any language suggestion is fine. ( i am thinking of using C? or is it rather not language dependent but the algorithm that is more important? ). thanks

If you can't just fire up an instance on amazon with enough memory to hold everything in RAM, this is the strategy I would use:
Step 1 - go through and generate a checksum/hashvalue for each line. I'd probably use SIPHASH. Output these to a file.
Step 2 - sort the file of siphash values, and throw away any that only have one entry. Output the result as a set of hashvalues & number of matches.
Step 3 - read through the file. regenerate the hashvalue for each line. If its a line that has a match, hold onto it in memory. If there's another already in memory with same hashvalue, compare to see if the lines themselves match. Output "match" if true. If you've already seen all N lines that have the same hashvalue and they didn't match, go ahead and dispose of the record.
This strategy depends on the number of duplicates being only a small fraction of the number of total lines. If that's not the case, then I would use some other strategy, like a divide and conquer.

Related

More efficient data structure/algorithm to find similar imagehashes in database

I am writing a small python program that tries to find images similar enough to some already in a database (to detect duplicates that have been resized/recompressed/etc). I am using the imagehash library and average hashing, and want to know if there is a hash in a known database that has a hamming distance lower than, say, 3 or 4.
I am currently just using a dictionary that matches hashes to filenames and use brute force for every new image. However, with tens or hundreds of thousands of images to compare to, performance is starting to suffer.
I believe there must be data structures and algorithms that can allow me to search a lot more efficiently but wasn’t able to find much that would match my particular use case. Would anyone be able to suggest where to look?
Thanks!

Here's a suggestion. You mention a database, so initially I will assume we can use that (and don't have to read it all into memory first). If your new image has a hash of 3a6c6565498da525, think of it as 4 parts: 3a6c 6565 498d a525. For a hamming distance of 3 or less any matching image must have a hash where at least one of these parts is identical. So you can start with a database query to find all images whose hash contains the substring 3a6c or 6565 or 498d or a525. This should be a tiny subset of the full dataset, so you can then run your comparison on that.
To improve further you could pre-compute all the parts and store them separately as additional columns in the database. This will allow more efficient queries.
For a bigger hamming distance you would need to split the hash into more parts (either smaller, or you could even use parts that overlap).
If you want to do it all in a dictionary, rather than using the database you could use the parts as keys that each point to a list of images. Either a single dictionary for simplicity, or for more accurate matching, a dictionary for each "position".
Again, this would be used to get a much smaller set of candidate matches on which to run the full comparison.

Pyarrow Write/Append Columns Arrow File

I have a calculator that iterates a couple of hundred object and produces Nx1 arrays for each of those objects. N here being 1-10m depending on configurations. Right now I am summing over these by using a generator expression, so memory consumption is low. However, I would like to store the Nx1 arrays to file, so I can do other computations.(Compute quantiles, partial sums etc. pandas style) Preferably I would like to use pa.memory_map on a single file (in order to have dataframes not loaded into memory), but I can not see how I can produce such a file without generating the entire result first. (Monte Carlo results on 200-500*10m floats).
If I understand correctly RecordBatchStreamWriter needs a part of the entire table, and I can not produce only a part of it. The parts the calculator produces is the columns, not parts of all columns. Is there any way of writing "columns" one by one? Either by appending, or create an empty arrow file which can be filled? (schema known).
As I see it, my alternative is to write several files and use "dataset" /tabular data to "join" them together. My "other computations" would then have to filter or pull parts into memory as I can`t see in the docs that "dataset()" work with memory_map.The result set is to big to fit in memory. (At least on the server it is running on)
I`m on day 2 of digging the docs and trying to understand how it all works, so apologies if the "lingo" is not all correct.
On further inspection, it looks like all files used in datasets() must have same schema, so I can not split "columns" in separate files either, can I..
EDIT
After wrestling with this library, I now produce single column files which I later combine in a single file. However, in following the suggested solution visible memory consumption (task manager) skyrockets in the step of combining the files. I would expect peaks for every "rowgroup" or combined recordbatch, but instead steadily increase to use all memory. A snip of this step:
readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
combined_schema = pa.unify_schemas([r.schema for r in readers])
with pa.ipc.new_stream(
os.path.join(self.filepath, self.outfile_name + ".arrow"),
schema=combined_schema,
) as writer:
for group in zip(*readers):
combined_batch = pa.RecordBatch.from_arrays(
[g.column(0) for g in group], names=combined_schema.names
)
writer.write_batch(combined_batch)
From this link I would expect that running memory consumption to be that of combined_batch and some.

You could do the write in two passes.
First, write each column to its own file. Make sure to set a row group size small enough that a table consisting of one row group from each file comfortably fits into memory.
Second, create a streaming reader for each file you created and one writer. Read a single row group from each one. Create a table by combining all of the partial columns and write the table to your writer. Repeat until you've exhausted all your readers.
I'm not sure that memory mapping is going to help you much here.

Finding least commonly occurring fuzzy string

I've got a 10K+ line log file that I'm using to debug an issue. I'm looking for 'aberrant' log lines that occur infrequently relative to the other lines in the file to hopefully extract interesting happenings that may be occurring. These log lines are highly variable and diverse.
My initial approach was to do a fuzzy comparison of each line against the remaining lines in the file, get an average of those ratios and assign it to each line, sort those ratios and return the smallest N items in that set.
However, this takes a very, very long time on my machine when using Python (I'm using fuzzywuzzy).
Any alternative suggestions?

Instead of that comparison, make one pass of the file to categorize the lines by their distinctive features. Store a reference to each line in a dict, keyed by category.
Then make a pass over the dict, eliminating any keys with too many references (i.e. boring categories). The remaining categories are the interesting ones.
This is an O(N) process, rather than the O(N^2) process with which you started.

Prefix search against half a billion strings

I have a list of 500 mil strings. The strings are alphanumeric, ASCII characters, of varying size (usually from 2-30 characters). Also, they're single words (or a combination of words without spaces like 'helloiamastring').
What I need is a fast way to check against a target, say 'hi'. The result should be all strings from the 500mil list which start with 'hi' (for eg. 'hithere', 'hihowareyou' etc.). This needs to be fast because there will be a new query each time the user types something, so if he types "hi", all strings starting with "hi" from the 500 mil list will be shown, if he types "hey", all strings starting with "hey" will show etc.
I've tried with the Tries algo, but the memory footprint to store 300 mil strings is just huge. It should require me 100GB+ ram for that. And I'm pretty sure the list will grow up to a billion.
What is a fast algorithm for this use case?
P.S. In case there's no fast option, the best alternative would be to limit people to enter at least, say, 4 characters, before results show up. Is there a fast way to retrieve the results then?

You want a Directed Acyclic Word Graph or DAWG. This generalizes #greybeard's suggestion to use stemming.
See, for example, the discussion in section 3.2 of this.

If the strings are sorted then a binary search is reasonable. As a speedup, you could maintain a dictionary of all possible bigrams ("aa", "ab", etc.) where the corresponding values are the first and last index starting with that bigram (if any do) and so in O(1) time zero in on a much smaller sublist that contains the strings that you are looking for. Once you find a match, do a linear search to the right and left to get all other matches.

If you want to force the user to digit at least 4 letters, for example, you can keep a key-value map, memory or disk, where the keys are all combinations of 4 letters (they are not too many if it is case insensitive, otherwise you can limit to three), and the values are list of positions of all strings that begin with the combination.
After the user has typed the three (or four) letters you have at once all the possible strings. From this point on you just loop on this subset.
On average this subset is small enough, i.e. 500M divided by 26^4...just as example. Actually bigger because probably not all sets of 4 letters can be prefix for your strings.
Forgot to say: when you add a new string to the big list, you also update the list of indexes corresponding to the key in the map.

If you doesn't want to use some database, you should create some data related routines pre-existing in all database engines:
Doesn't try to load all data in memory.
Use fixed length for all string. It increase storage memory consumption but significantly decrease seeking time (i-th string can be found at position L*i bytes in file, where L - fixed length). Create additional mechanism to work with extremely long strings: store it in different place and use special pointers.
Sort all of strings. You can use merge sort to do it without load all strings in memory in one time.
Create indexes (address of first line starts with 'a','b',... ) also indexes can be created for 2-grams, 3-grams, etc. Indexes can be placed in memory to increase search speed.
Use advanced strategies to avoid full indexes regeneration on data update: split a data to a number of files by first letters and update only affected indexes, create an empty spaces in data to decrease affect of read-modify-write procedures, create a cache for a new lines before they will be added to main storage and search in this cache.
Use query cache to fast processing a popular requests.

In this hypothetical, where the strings being indexed are not associated with any other information (e.g. other columns in the same row), there is relatively little difference between a complete index and keeping the strings sorted in the first place (as in, some difference, but not as much as you are hoping for). In light of the growing nature of the list and the cost of updating it, perhaps the opposite approach will better accomplish the performance tradeoffs that you are looking for.
For any given character at any given location in the string, your base case is that no string exists containing that letter. For example, once 'hello' has been typed, if the next letter typed is 't', then your base case is that there is no string beginning 'hellot'. There is a finite number of characters that could follow 'hello' at location 5 (say, 26). You need 26 fixed-length spaces in which to store information about characters that follow 'hello' at location 5. Each space either says zero if there is no string in which, e.g., 't' follows 'hello', or contains a number of data-storage addresses by which to advance to find the list of characters for which one or more strings involve that character following 'hellot' at location 6 (or use absolute data-storage addresses, although only relative addressess allow the algorithm I propose to support an infinite number of strings of infinite length without any modification to allow for larger pointers as the list grows).
The algorithm can then move forward through this data stored on disk, building a tree of string-beginnings in memory as it goes, and avoiding delays caused by random-access reads. For an in-memory index, simply store the part of the tree closest to the root in memory. After the user has typed 'hello' and the algorithm has tracked that information about one or more strings beginning 'hellot' exists at data-storage address X, the algorithm finds one of two types of lists at location X. Either it is another sequence of, e.g., 26 fixed-length spaces with information about characters following 'hellot' at location 6, or it is a pre-allocated block of space listing all post-fixes that follow 'hellot', depending on how many such post-fixes exist. Once there are enough post-fixes that using some traditional search and/or sort algorithm to both update and search the post-fix list fails to provide the performance benefits that you desire, it gets divided up and replaced with a sequence of, e.g., 26 fixed-length spaces.
This involves pre-allocating a relatively substantial amount of disk-storage upfront, with the tradeoff that your tree can be maintained in sorted form without needing to move anything around for most updates, and your searches can be peformed in full in a single sequential read. It also provides more flexibility and probably requires less storage space than a solution based on storing the strings themselves as fixed-length strings.

First of all I should say that the tag you should have added for your question is "Information Retrieval".
I think using Apache Lucene's PrefixQuery is the best way you can handle wildcard queries. Apache has a Python version if you are comfortable with python. But to use Apache lucent to solve your problem you should first know about indexing your data(which is the part that your data will be compressed and saved in a more efficient manner).
Also looking to indexing and wildcard query section of IR book will give you a better vision.

How to jump to the same line in two huge text files?

I am trying to use python to do some manipulations on huge text files, and by huge I mean over 100GB. Specifically, I'd like to take samples from the lines of the files. For example, let's say I have a file with ~300 million lines, I want to take just a million, write them to a new file and analyze them later to get some statistics. The problem is, I can't start from the first line, since the first fraction of the file does not represent the rest of it good enough. Therefore, I have to get about 20% into the file, and then start extracting lines. If I do it the naive way, it takes very long (20-30 minutes on my machine) to get to the 20% line. For example, let's assume again that my file has 300 million lines, and I want to start sampling from line 60,000,000th (20%) line. I could do something like:
start_in_line = 60000000
sample_size = 1000000
with open(huge_file,'r') as f, open(out_file,'w') as fo:
for x in range(start_in_line):
f.readline()
for y in range(sample_size):
print(f.readline(),file=fo)
But as I said, this is very slow. I tried using some less naive ways, for example the itertools functions, but the improvement in running time was rather slight.
Therefore, I went for another approach - random seeks into the file. What I do is get the size of the file in bytes, calculate 20% of it and than make a seek to this byte. For example:
import os
huge_file_size = os.stat(huge_file).st_size
offset_percent = 20
sample_size = 1000000
start_point_byte = int(huge_file_size*offset_percent/100)
with open(huge_file) as f, open(out_file,'w') as fo:
f.seek(start_point_byte)
f.readline() # get to the start of next line
for y in range(sample_size):
print(f.readline(),file=fo)
This approach works very nice, BUT!
I always work with pairs of files. Let's call them R1 and R2. R1 and R2 will always have the same number of lines, and I run my sampling script on each one of them. It is crucial for my downstream analyses that the samples taken from R1 and R2 coordinate, regarding the lines sampled. For example, if I ended up starting sampling from line 60,111,123 of R1, I must start sampling from the very same line in R2. Even if I miss by one line, my analyses are doomed. If R1 and R2 are of exactly the same size (which is sometimes the case), then I have no problem, because my f.seek() will get me to the same place in both files. However, if the line lengths are different between the files, i.e. the total sizes of R1 and R2 are different, then I am in a problem.
So, do you have any idea for a workaround, without having to resort to the naive iteration solution? Maybe there is a way to tell which line I am at, after performing the seek? (couldn't find one...) I am really out of ideas at this point, so any help/hint would be appreciated.
Thanks!

If the lines in each file can have different lengths, there is really no other way than to scan them first (unless there is some form of unique identifier on each line which is the same in both files).
Even if both files have the same length, there could still be lines with different lengths inside.
Now, if you're doing those statistics more than once on different parts of the same files, you could do the following:
do a one time scan of both files and store the filepositions of each line in a third file (preferably in binary form (2 x 64bit value) or at least the same width so you can directly jump to the position-pair of the line you want, which you could calculate then).
then just use those filepositions to access the lines in both files (you could even calculate the size of the block you need from the different filepositions in your third file).
When scanning both files at the same time, make sure you use some buffering to avoid a lot of harddisk seeks.
edit:
I don't know Python (I'm a C++ programmer), but I did a quick search and it seems Python also supports memory mapped files (mmap).
Using mmap you could speed things up dramaticly (no need to do a readline each time just to know the positions of the lines): just map a view on part(s) of your file and scan through that mapped memory for the newline (\n or 0x0a in hexadecimal). This should take no longer than the time it takes to read the file.

Unix files are just streams of characters, so there is no way to seek to a given line, or find the line number corresponding to a given character, or anything else of that form.
You can use standard utilities to find the character position of a line. For example,
head -n 60000000 /path/to/file | wc -c
will print the number of characters in the first 60,000,000 lines of /path/to/file.
While that may well be faster than using python, it is not going to be fast; it is limited by the speed of reading from disk. If you need to read 20GB, it's going to take minutes. But it would be worth trying at least once to calibrate your python programs.
If your files don't change, you could create indexes mapping line numbers to character position. Once you build the index, it would be extremely fast to seek to the desired line number. If it takes half an hour to read 20% of the file, it would take about five hours to construct two indexes, but if you only needed to do it once, you could leave it running overnight.

OK, so thanks for the interesting answers, but this is what I actually ended up doing:
First, I estimate the number of lines in the file, without actually counting them. Since my files are ASCII, i know that each character takes 1 byte, so I get the number of characters in, say, the first 100 lines, then get the size of the file and use these numbers to get a (quite rough) estimation of the number of lines. I should say here that although my lines might be of different length, they are within a limited range, so this estimation is reasonable.
Once I have that, I use as system call of the Linux sed command to extract a range of lines. So let's say that my file really has 300 million lines, and I estimated it to have 250 million lines (I get much better estimations, but it doesn't really matter in my case). I use an offset of 20%, so I'd like to start sampling from line 50,000,000 and take 1,000,000 lines. I do:
os.system("sed -n '50000000,51000000p;51000000q' in_file > out_file")
Note the 51000000q - without this, you'll end up running on the whole file.
This solution is not as fast as using random seeks, but it's good enough for me. It also includes some inaccuracy, but it doesn't bother me in this specific case.
I'd be glad to hear your opinion on this solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.