Finding least commonly occurring fuzzy string - python

I've got a 10K+ line log file that I'm using to debug an issue. I'm looking for 'aberrant' log lines that occur infrequently relative to the other lines in the file to hopefully extract interesting happenings that may be occurring. These log lines are highly variable and diverse.
My initial approach was to do a fuzzy comparison of each line against the remaining lines in the file, get an average of those ratios and assign it to each line, sort those ratios and return the smallest N items in that set.
However, this takes a very, very long time on my machine when using Python (I'm using fuzzywuzzy).
Any alternative suggestions?

Instead of that comparison, make one pass of the file to categorize the lines by their distinctive features. Store a reference to each line in a dict, keyed by category.
Then make a pass over the dict, eliminating any keys with too many references (i.e. boring categories). The remaining categories are the interesting ones.
This is an O(N) process, rather than the O(N^2) process with which you started.

Related

Optimizing a counter that loops through documents to check how many words from the document appear in a list

I am using a lexicon of positive and negative words, and I want to count how many positive and negative words appear in each document from a large corpus. The corpus has almost 2 million documents, so the code I'm running is taking too long to count all these occurrences.
I have tried using numpy, but get a memory error when trying to convert the list of documents into an array.
This is the code I am currently running to count just the positive words in each document.
reviews_pos_wc = []
for review in reviews_upper:
pos_words = 0
for word in review:
if word in pos_word_list:
pos_words += 1
reviews_pos_wc.append(pos_words)
After running this for half an hour, it only gets through 300k documents.
I have done a search for similar questions on this website. I found someone else doing a similar thing, but not nearly on the same scale as they only used one document. The answer suggested using the Counter class, but I thought this would just add more overhead.
It appears that your central problem is that you don't have the hardware needed to do the job you want in the time you want. For instance, your RAM appears insufficient to hold the names of 2M documents in both list and array form.
I do see a couple of possibilities. Note that "vectorization" is not a magic solution to large problems; it's merely a convenient representation that allows certain optimizations to occur among repeated operations.
Regularize your file names, so that you can represent their names in fewer bytes. Iterate through a descriptive expression, rather than the full file names. This could give you freedom to vectorize something later.
Your variable implies that your lexicon is a list. This has inherently linear access. Change this to a data structure amenable to faster search, such as a set (hash function) or some appropriate search tree. Even a sorted list with an interpolation search would speed up your work.
Do consider using popular modules (such as Collections); let the module developers optimize the common operations on your behalf. Write a prototype and time its performance: given the simplicity of your processing, the coding shouldn't take long.
Does that give you some ideas for experimentation? I'm hopeful that my first paragraph proves to be unrealistically pessimistic (i.e. that something does provide a solution, especially the lexicon set).

splitting long string at indices in parallel in python

I have many files that each have several million rows; each row is a dumped data entry and is several hundred characters long. The rows come in groups and the first two characters tell me the type of row it is, and I use that to parse it. This structure prohibits me from loading the rows to a dataframe, for example, or anything else that does not go through the rows one at a time.
For each row, I currently create a dictionary vals = {}, and then sequentially run through about fifty keys along the lines of
vals{'name'} = row[2:24]
vals{'state'} = row[24:26]
Instead of doing fifty assignments sequentially, can I do this simultaneously or in parallel in some simple manner?
Is
vals{'name'},vals{'state'} = row[2:24],row[24:26]
faster if I do this simultaneous assignment for many entries? I could also reformulate this as a list comprehension. Would that be faster than running through sequentially?
To answer your question, no, doing multiple assignment will not speed up your program. This is because the multiple assignment syntax is just a different way of writing multiple assignments on different lines.
For example
vals{'name'},vals{'state'} = row[2:24],row[24:26]
is equivalent to
vals{'name'}= row[2:24]
vals{'state'} = row[2:24]
If you want to optimize your code, your should start by profiling it to determine the parts that are taking the largest amount of time. I would also check to ensure that you are not doing multiple reads from the same file, as these are very slow compared to reading from memory. If possible, you should read the entire file into memory first, and then process it.

Prefix search against half a billion strings

I have a list of 500 mil strings. The strings are alphanumeric, ASCII characters, of varying size (usually from 2-30 characters). Also, they're single words (or a combination of words without spaces like 'helloiamastring').
What I need is a fast way to check against a target, say 'hi'. The result should be all strings from the 500mil list which start with 'hi' (for eg. 'hithere', 'hihowareyou' etc.). This needs to be fast because there will be a new query each time the user types something, so if he types "hi", all strings starting with "hi" from the 500 mil list will be shown, if he types "hey", all strings starting with "hey" will show etc.
I've tried with the Tries algo, but the memory footprint to store 300 mil strings is just huge. It should require me 100GB+ ram for that. And I'm pretty sure the list will grow up to a billion.
What is a fast algorithm for this use case?
P.S. In case there's no fast option, the best alternative would be to limit people to enter at least, say, 4 characters, before results show up. Is there a fast way to retrieve the results then?
You want a Directed Acyclic Word Graph or DAWG. This generalizes #greybeard's suggestion to use stemming.
See, for example, the discussion in section 3.2 of this.
If the strings are sorted then a binary search is reasonable. As a speedup, you could maintain a dictionary of all possible bigrams ("aa", "ab", etc.) where the corresponding values are the first and last index starting with that bigram (if any do) and so in O(1) time zero in on a much smaller sublist that contains the strings that you are looking for. Once you find a match, do a linear search to the right and left to get all other matches.
If you want to force the user to digit at least 4 letters, for example, you can keep a key-value map, memory or disk, where the keys are all combinations of 4 letters (they are not too many if it is case insensitive, otherwise you can limit to three), and the values are list of positions of all strings that begin with the combination.
After the user has typed the three (or four) letters you have at once all the possible strings. From this point on you just loop on this subset.
On average this subset is small enough, i.e. 500M divided by 26^4...just as example. Actually bigger because probably not all sets of 4 letters can be prefix for your strings.
Forgot to say: when you add a new string to the big list, you also update the list of indexes corresponding to the key in the map.
If you doesn't want to use some database, you should create some data related routines pre-existing in all database engines:
Doesn't try to load all data in memory.
Use fixed length for all string. It increase storage memory consumption but significantly decrease seeking time (i-th string can be found at position L*i bytes in file, where L - fixed length). Create additional mechanism to work with extremely long strings: store it in different place and use special pointers.
Sort all of strings. You can use merge sort to do it without load all strings in memory in one time.
Create indexes (address of first line starts with 'a','b',... ) also indexes can be created for 2-grams, 3-grams, etc. Indexes can be placed in memory to increase search speed.
Use advanced strategies to avoid full indexes regeneration on data update: split a data to a number of files by first letters and update only affected indexes, create an empty spaces in data to decrease affect of read-modify-write procedures, create a cache for a new lines before they will be added to main storage and search in this cache.
Use query cache to fast processing a popular requests.
In this hypothetical, where the strings being indexed are not associated with any other information (e.g. other columns in the same row), there is relatively little difference between a complete index and keeping the strings sorted in the first place (as in, some difference, but not as much as you are hoping for). In light of the growing nature of the list and the cost of updating it, perhaps the opposite approach will better accomplish the performance tradeoffs that you are looking for.
For any given character at any given location in the string, your base case is that no string exists containing that letter. For example, once 'hello' has been typed, if the next letter typed is 't', then your base case is that there is no string beginning 'hellot'. There is a finite number of characters that could follow 'hello' at location 5 (say, 26). You need 26 fixed-length spaces in which to store information about characters that follow 'hello' at location 5. Each space either says zero if there is no string in which, e.g., 't' follows 'hello', or contains a number of data-storage addresses by which to advance to find the list of characters for which one or more strings involve that character following 'hellot' at location 6 (or use absolute data-storage addresses, although only relative addressess allow the algorithm I propose to support an infinite number of strings of infinite length without any modification to allow for larger pointers as the list grows).
The algorithm can then move forward through this data stored on disk, building a tree of string-beginnings in memory as it goes, and avoiding delays caused by random-access reads. For an in-memory index, simply store the part of the tree closest to the root in memory. After the user has typed 'hello' and the algorithm has tracked that information about one or more strings beginning 'hellot' exists at data-storage address X, the algorithm finds one of two types of lists at location X. Either it is another sequence of, e.g., 26 fixed-length spaces with information about characters following 'hellot' at location 6, or it is a pre-allocated block of space listing all post-fixes that follow 'hellot', depending on how many such post-fixes exist. Once there are enough post-fixes that using some traditional search and/or sort algorithm to both update and search the post-fix list fails to provide the performance benefits that you desire, it gets divided up and replaced with a sequence of, e.g., 26 fixed-length spaces.
This involves pre-allocating a relatively substantial amount of disk-storage upfront, with the tradeoff that your tree can be maintained in sorted form without needing to move anything around for most updates, and your searches can be peformed in full in a single sequential read. It also provides more flexibility and probably requires less storage space than a solution based on storing the strings themselves as fixed-length strings.
First of all I should say that the tag you should have added for your question is "Information Retrieval".
I think using Apache Lucene's PrefixQuery is the best way you can handle wildcard queries. Apache has a Python version if you are comfortable with python. But to use Apache lucent to solve your problem you should first know about indexing your data(which is the part that your data will be compressed and saved in a more efficient manner).
Also looking to indexing and wildcard query section of IR book will give you a better vision.

How to jump to the same line in two huge text files?

I am trying to use python to do some manipulations on huge text files, and by huge I mean over 100GB. Specifically, I'd like to take samples from the lines of the files. For example, let's say I have a file with ~300 million lines, I want to take just a million, write them to a new file and analyze them later to get some statistics. The problem is, I can't start from the first line, since the first fraction of the file does not represent the rest of it good enough. Therefore, I have to get about 20% into the file, and then start extracting lines. If I do it the naive way, it takes very long (20-30 minutes on my machine) to get to the 20% line. For example, let's assume again that my file has 300 million lines, and I want to start sampling from line 60,000,000th (20%) line. I could do something like:
start_in_line = 60000000
sample_size = 1000000
with open(huge_file,'r') as f, open(out_file,'w') as fo:
for x in range(start_in_line):
f.readline()
for y in range(sample_size):
print(f.readline(),file=fo)
But as I said, this is very slow. I tried using some less naive ways, for example the itertools functions, but the improvement in running time was rather slight.
Therefore, I went for another approach - random seeks into the file. What I do is get the size of the file in bytes, calculate 20% of it and than make a seek to this byte. For example:
import os
huge_file_size = os.stat(huge_file).st_size
offset_percent = 20
sample_size = 1000000
start_point_byte = int(huge_file_size*offset_percent/100)
with open(huge_file) as f, open(out_file,'w') as fo:
f.seek(start_point_byte)
f.readline() # get to the start of next line
for y in range(sample_size):
print(f.readline(),file=fo)
This approach works very nice, BUT!
I always work with pairs of files. Let's call them R1 and R2. R1 and R2 will always have the same number of lines, and I run my sampling script on each one of them. It is crucial for my downstream analyses that the samples taken from R1 and R2 coordinate, regarding the lines sampled. For example, if I ended up starting sampling from line 60,111,123 of R1, I must start sampling from the very same line in R2. Even if I miss by one line, my analyses are doomed. If R1 and R2 are of exactly the same size (which is sometimes the case), then I have no problem, because my f.seek() will get me to the same place in both files. However, if the line lengths are different between the files, i.e. the total sizes of R1 and R2 are different, then I am in a problem.
So, do you have any idea for a workaround, without having to resort to the naive iteration solution? Maybe there is a way to tell which line I am at, after performing the seek? (couldn't find one...) I am really out of ideas at this point, so any help/hint would be appreciated.
Thanks!
If the lines in each file can have different lengths, there is really no other way than to scan them first (unless there is some form of unique identifier on each line which is the same in both files).
Even if both files have the same length, there could still be lines with different lengths inside.
Now, if you're doing those statistics more than once on different parts of the same files, you could do the following:
do a one time scan of both files and store the filepositions of each line in a third file (preferably in binary form (2 x 64bit value) or at least the same width so you can directly jump to the position-pair of the line you want, which you could calculate then).
then just use those filepositions to access the lines in both files (you could even calculate the size of the block you need from the different filepositions in your third file).
When scanning both files at the same time, make sure you use some buffering to avoid a lot of harddisk seeks.
edit:
I don't know Python (I'm a C++ programmer), but I did a quick search and it seems Python also supports memory mapped files (mmap).
Using mmap you could speed things up dramaticly (no need to do a readline each time just to know the positions of the lines): just map a view on part(s) of your file and scan through that mapped memory for the newline (\n or 0x0a in hexadecimal). This should take no longer than the time it takes to read the file.
Unix files are just streams of characters, so there is no way to seek to a given line, or find the line number corresponding to a given character, or anything else of that form.
You can use standard utilities to find the character position of a line. For example,
head -n 60000000 /path/to/file | wc -c
will print the number of characters in the first 60,000,000 lines of /path/to/file.
While that may well be faster than using python, it is not going to be fast; it is limited by the speed of reading from disk. If you need to read 20GB, it's going to take minutes. But it would be worth trying at least once to calibrate your python programs.
If your files don't change, you could create indexes mapping line numbers to character position. Once you build the index, it would be extremely fast to seek to the desired line number. If it takes half an hour to read 20% of the file, it would take about five hours to construct two indexes, but if you only needed to do it once, you could leave it running overnight.
OK, so thanks for the interesting answers, but this is what I actually ended up doing:
First, I estimate the number of lines in the file, without actually counting them. Since my files are ASCII, i know that each character takes 1 byte, so I get the number of characters in, say, the first 100 lines, then get the size of the file and use these numbers to get a (quite rough) estimation of the number of lines. I should say here that although my lines might be of different length, they are within a limited range, so this estimation is reasonable.
Once I have that, I use as system call of the Linux sed command to extract a range of lines. So let's say that my file really has 300 million lines, and I estimated it to have 250 million lines (I get much better estimations, but it doesn't really matter in my case). I use an offset of 20%, so I'd like to start sampling from line 50,000,000 and take 1,000,000 lines. I do:
os.system("sed -n '50000000,51000000p;51000000q' in_file > out_file")
Note the 51000000q - without this, you'll end up running on the whole file.
This solution is not as fast as using random seeks, but it's good enough for me. It also includes some inaccuracy, but it doesn't bother me in this specific case.
I'd be glad to hear your opinion on this solution.

Removing duplicates from BIG text file

I have a rather big text file , average 30GB. I want to remove duplicate lines from this file. What is a good efficient algorithm to do this. for small files, I usually use dictionaries, eg Python dictionaries to store unique keys. But this time the file is rather big. any language suggestion is fine. ( i am thinking of using C? or is it rather not language dependent but the algorithm that is more important? ). thanks
If you can't just fire up an instance on amazon with enough memory to hold everything in RAM, this is the strategy I would use:
Step 1 - go through and generate a checksum/hashvalue for each line. I'd probably use SIPHASH. Output these to a file.
Step 2 - sort the file of siphash values, and throw away any that only have one entry. Output the result as a set of hashvalues & number of matches.
Step 3 - read through the file. regenerate the hashvalue for each line. If its a line that has a match, hold onto it in memory. If there's another already in memory with same hashvalue, compare to see if the lines themselves match. Output "match" if true. If you've already seen all N lines that have the same hashvalue and they didn't match, go ahead and dispose of the record.
This strategy depends on the number of duplicates being only a small fraction of the number of total lines. If that's not the case, then I would use some other strategy, like a divide and conquer.

Categories