counting (large number of) strings within (very large) text - python

I've seen a couple variations of the "efficiently search for strings within file(s)" question on Stackoverflow but not quite like my situation.
I've got one text file which contains a relatively large number (>300K) of strings. The vast majority of these strings are multiple words (for ex., "Plessy v. Ferguson", "John Smith", etc.).
From there, I need to search through a very large set of text files (a set of legal docs totaling >10GB) and tally the instances of those strings.
Because of the number of search strings, the strings having multiple words, and the size of the search target, a lot of the "standard" solutions seem fall to the wayside.
Some things simplify the problem a little -
I don't need sophisticated tokenizing / stemming / etc. (e.g. the only instances I care about are "Plessy v. Ferguson", don't need to worry about "Plessy", "Plessy et. al." etc.)
there will be some duplicates (for ex., multiple people named "John Smith"), however, this isn't a very statistically significant issue for this dataset so... if multiple John Smith's get conflated into a single tally, that's ok for now.
I only need to count these specific instances; I don't need to return search results
10 instances in 1 file count the same as 1 instance in each of 10 files
Any suggestions for quick / dirty ways to solve this problem?
I've investigated NLTK, Lucene & others but they appear to be overkill for the problem I'm trying to solve. Should I suck it up and import everything into a DB? bruteforce grep it 300k times? ;)
My preferred dev tool is Python.
The docs to be searched are primarily legal docs like this - http://www.lawnix.com/cases/plessy-ferguson.html
The intended results are tallys for how often the case is referenced across those docs -
"Plessey v. Ferguson: 15"

An easy way to solve this is to build a trie with your queries (simply a prefix tree, list of nodes with a single character inside), and when you search through your 10gb file you go through your tree recursively as the text matches.
This way you prune a lot of options really early on in your search for each character position in the big file, while still searching your whole solution space.
Time performance will be very good (as good as a lot of other, more complicated solutions) and you'll only need enough space to store the tree (a lot less than the whole array of strings) and a small buffer into the large file. Definitely a lot better than grepping a db 300k times...

You have several constraints you must deal with, which makes this a complex problem.
Hard drive IO
Memory space
Processing time
I would suggest writing a multithreaded/multiprocess python app. The libraries to subprocess are painless. Have each process read in a file, and the parse tree as suggested by Blindy. When it finishes, it returns the results to the parent, which writes them to a file.
This will use up as many resources as you can throw at it, while allowing for expansion. If you stick it on a beowulf cluster, it will transparently share the processes across your cpus for you.
The only sticking point is the hard drive IO. Break it into chunks on different hard drives, and as each process finishes, start a new one and load a file. If you're on linux, all of the files can coexist in the same filesystem namespace, and your program won't know the difference.

The ugly brute-force solution won't work.
Time one grep through your documents and extrapolate the time it takes for 300k greps (and possibly try parallelizing it if you have many machines available), is it feasible? My guess is that 300k searches won't be feasible. E.g., greping one search through ~50 Mb of files took me about ~5s, so for 10 Gb, you'd expect ~1000s, and then repeating 300k times means you'd be done in about 10 years with one computer. You can parallelize to get some improvements (limited by disk io on one computer), but still will be quite limited. I assume you want the task to be finished in hours rather than months, so this isn't likely a solution.
So you are going to need to index the documents somehow. Lucene (say through pythonsolr) or Xapian should be suitable for your purpose. Index the documents, then search the indexed documents.

You should use group pattern matching algorithms which use dynamic algorithms to reuse evaluation. I.e. Aho–Corasick . Implementations
http://code.google.com/p/graph-expression/wiki/RegexpOptimization
http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html

I don't know if this idea is extremely stupid or not, please let me know...
Divide the files to be searched into reasonably sized numbers 10/100/1000... and for each "chunk" use an indexing SW available for SW. Here I'm thinking about ctags gnu global or perhaps the ptx utility or using a technique described in this SO post.
Using this technique, you "only" need to search through the index files for the target strings.

Related

Optimizing a counter that loops through documents to check how many words from the document appear in a list

I am using a lexicon of positive and negative words, and I want to count how many positive and negative words appear in each document from a large corpus. The corpus has almost 2 million documents, so the code I'm running is taking too long to count all these occurrences.
I have tried using numpy, but get a memory error when trying to convert the list of documents into an array.
This is the code I am currently running to count just the positive words in each document.
reviews_pos_wc = []
for review in reviews_upper:
pos_words = 0
for word in review:
if word in pos_word_list:
pos_words += 1
reviews_pos_wc.append(pos_words)
After running this for half an hour, it only gets through 300k documents.
I have done a search for similar questions on this website. I found someone else doing a similar thing, but not nearly on the same scale as they only used one document. The answer suggested using the Counter class, but I thought this would just add more overhead.
It appears that your central problem is that you don't have the hardware needed to do the job you want in the time you want. For instance, your RAM appears insufficient to hold the names of 2M documents in both list and array form.
I do see a couple of possibilities. Note that "vectorization" is not a magic solution to large problems; it's merely a convenient representation that allows certain optimizations to occur among repeated operations.
Regularize your file names, so that you can represent their names in fewer bytes. Iterate through a descriptive expression, rather than the full file names. This could give you freedom to vectorize something later.
Your variable implies that your lexicon is a list. This has inherently linear access. Change this to a data structure amenable to faster search, such as a set (hash function) or some appropriate search tree. Even a sorted list with an interpolation search would speed up your work.
Do consider using popular modules (such as Collections); let the module developers optimize the common operations on your behalf. Write a prototype and time its performance: given the simplicity of your processing, the coding shouldn't take long.
Does that give you some ideas for experimentation? I'm hopeful that my first paragraph proves to be unrealistically pessimistic (i.e. that something does provide a solution, especially the lexicon set).

Fast multiple search and replace in Python

For a single large text (~4GB) I need to search for ~1million phrases and replace them with complementary phrases. Both the raw text and the replacements can easily fit in memory. The naive solution will literally takes years to finish as a single replacement takes about a minute.
Naive solution:
for search, replace in replacements.iteritems():
text = text.replace(search, replace)
The regex method using re.subis x10 slower:
for search, replace in replacements.iteritems():
text = re.sub(search, replace, text)
At any rate, this seems like a great place use Boyer-Moore string, or Aho-Corasick; but these methods as they are generally implemented only work for searching the string and not also replacing it.
Alternatively, any tool (outside of Python) that can do this quickly would also be appreciated.
Thanks!
There's probably a better way than this:
re.sub('|'.join(replacements), lambda match: replacements[match.group()], text)
This does one search pass, but it's not a very efficient search. The re2 module may speed this up dramatically.
Outside of python, sed is usually used for this sort of thing.
For example (taken from here), to replace the word ugly with beautiful in the file sue.txt:
sed -i 's/ugly/beautiful/g' /home/bruno/old-friends/sue.txt
You haven't posted any profiling of your code, you should try some timings before you do any premature optimization. Searching and replacing text in a 4GB file is a computationally-intensive operation.
ALTERNATIVE
Ask: should I be doing this at all? -
You discuss below doing an entire search and replace of the Wikipedia corpus in under 10ms. This rings some alarm bells as it doesn't sound like great design. Unless there's an obvious reason not to you should be modifying whatever code you use to present and/or load that to do the search and replace as a subset of the data is being loaded/viewed. It's unlikely you'll be doing many operations on the entire 4GB of data so restrict your search and replace operations to what you're actually working on. Additionally, your timing is still very imprecise because you don't know how big the file you're working on is.
On a final point, you note that:
the speedup has to be algorithmic not chaining millions of sed calls
But you indicated that the data you're working with was a "single large text (~4GB)" so there shouldn't be any chaning involved if I understand what you mean by that correctly.
UPDATE:
Below you indicate that to do the operation on a ~4KB file (I'm assuming) takes 90s, this seems very strange to me - sed operations don't normally take anywhere close to that. If the file is actually 4MB (I'm hoping) then it should take 24 hours to evaluate (not ideal but probably acceptable?)
I had this use case as well, where I needed to do ~100,000 search and replace operations on the full text of Wikipedia. Using sed, awk, or perl would take years. I wasn't able to find any implementation of Aho-Corasick that did search-and-replace, so I wrote my own: fsed. This tool happens to be written in Python (so you can hack into the code if you like), but it's packaged up as a command line utility that runs like sed.
You can get it with:
pip install fsed
they are generally implemented only work for searching the string and not also replacing it
Perfect, that's exactly what you need. Searching with an ineffective algorithm in a 4G text is bad enough, but doing several replacing is probably even worse... you potentially have to move gigabytes of text to make space for the expansion/shrinking caused by the size difference of source and target text.
Just find the locations, then join the pieces with the replacements parts.
So a dumb analogy would be be "_".join( "a b c".split(" ") ), but of course you don't want to create copies the way split does.
Note: any reason to do this in python?

Reportlab Wrapper

I'm looking for a Reportlab wrapper which does the heavy lifting for me.
I found this one, which looks promising.
It looks cumbersome to me to deal with the low-level api of Reportlab (especially positioning of elements, etc) and a library should facilitate at least this part.
My code for creating .pdfs is currently a maintain hell which consists of positioning elements, taking care which things should stick together, and logic to deal with varying length of input strings.
For example while creating pdf invoices, I have to give the user the ability to adjust the distance between two paragraphs. Currently I grab this info from the UI and then re-calculate the position of paragraph A and B based upon the input.
Besides that I look for a wrapper to help me with this, it would be great if someone could point me to / provide a best-practice example on how to deal with positioning of elements, varying lengh of input strings etc.
For future reference:
Having tested the lib PDFDocument, I can only recommend it. It takes away a lot of complexity, provides a lot of helper functions, and helps to keep your code clean. I found this resource really helpful to get started.

Increasing efficiency of Python copying large datasets

I'm having a bit of trouble with an implementation of random forests I'm working on in Python. Bare in mind, I'm well aware that Python is not intended for highly efficient number crunching. The choice was based more on wanting to get a deeper understanding of and additional experience in Python. I'd like to find a solution to make it "reasonable".
With that said, I'm curious if anyone here can make some performance improvement suggestions to my implementation. Running it through the profiler, it's obvious the most time is being spent executing the list "append" command and my dataset split operation. Essentially I have a large dataset implemented as a matrix (rather, a list of lists). I'm using that dataset to build a decision tree, so I'll split on columns with the highest information gain. The split consists of creating two new dataset with only the rows matching some critera. The new dataset is generated by initializing two empty lista and appending appropriate rows to them.
I don't know the size of the lists in advance, so I can't pre-allocate them, unless it's possible to preallocate abundant list space but then update the list size at the end (I haven't seen this referenced anywhere).
Is there a better way to handle this task in python?
Without seeing your codes, it is really hard to give any specific suggestions since optimisation is code-dependent process that varies case by case. However there are still some general things:
review your algorithm, try to reduce the number of loops. It seems
you have a lot of loops and some of them are deeply embedded in
other loops (I guess).
if possible use higher performance utility modules such as itertools
instead of naive codes written by yourself.
If you are interested, try PyPy (http://pypy.org/), it is a
performance-oriented implementation of Python.

Loop through list from specific point?

I was wondering if there was a way to keep extremely large lists in the memory and then process those lists from specific points. Since these lists will have as many as almost 400 billion numbers before processing we need to split them up but I haven't the slightest idea (since I can't find an example) of where to start when trying to process a list from a specific point in Python. Edit: Right now we are not trying to create multiple-dimensions but if it's easier then I'll for sure do it.
Even if your numbers are bytes, 400GB (or 400TB if you use billion in the long-scale meaning) does not normally fit in RAM. Therefore I guess numpy.memmap or h5py may be what you're looking for.
Further to the #lazyr's point, if you use the numpy.memmap method, then my previous discussion on views into numpy arrays might well be useful.
This is also the way you should be thinking if you have stacks of memory and everything actually is in RAM.

Categories