For a single large text (~4GB) I need to search for ~1million phrases and replace them with complementary phrases. Both the raw text and the replacements can easily fit in memory. The naive solution will literally takes years to finish as a single replacement takes about a minute.
Naive solution:
for search, replace in replacements.iteritems():
text = text.replace(search, replace)
The regex method using re.subis x10 slower:
for search, replace in replacements.iteritems():
text = re.sub(search, replace, text)
At any rate, this seems like a great place use Boyer-Moore string, or Aho-Corasick; but these methods as they are generally implemented only work for searching the string and not also replacing it.
Alternatively, any tool (outside of Python) that can do this quickly would also be appreciated.
Thanks!
There's probably a better way than this:
re.sub('|'.join(replacements), lambda match: replacements[match.group()], text)
This does one search pass, but it's not a very efficient search. The re2 module may speed this up dramatically.
Outside of python, sed is usually used for this sort of thing.
For example (taken from here), to replace the word ugly with beautiful in the file sue.txt:
sed -i 's/ugly/beautiful/g' /home/bruno/old-friends/sue.txt
You haven't posted any profiling of your code, you should try some timings before you do any premature optimization. Searching and replacing text in a 4GB file is a computationally-intensive operation.
ALTERNATIVE
Ask: should I be doing this at all? -
You discuss below doing an entire search and replace of the Wikipedia corpus in under 10ms. This rings some alarm bells as it doesn't sound like great design. Unless there's an obvious reason not to you should be modifying whatever code you use to present and/or load that to do the search and replace as a subset of the data is being loaded/viewed. It's unlikely you'll be doing many operations on the entire 4GB of data so restrict your search and replace operations to what you're actually working on. Additionally, your timing is still very imprecise because you don't know how big the file you're working on is.
On a final point, you note that:
the speedup has to be algorithmic not chaining millions of sed calls
But you indicated that the data you're working with was a "single large text (~4GB)" so there shouldn't be any chaning involved if I understand what you mean by that correctly.
UPDATE:
Below you indicate that to do the operation on a ~4KB file (I'm assuming) takes 90s, this seems very strange to me - sed operations don't normally take anywhere close to that. If the file is actually 4MB (I'm hoping) then it should take 24 hours to evaluate (not ideal but probably acceptable?)
I had this use case as well, where I needed to do ~100,000 search and replace operations on the full text of Wikipedia. Using sed, awk, or perl would take years. I wasn't able to find any implementation of Aho-Corasick that did search-and-replace, so I wrote my own: fsed. This tool happens to be written in Python (so you can hack into the code if you like), but it's packaged up as a command line utility that runs like sed.
You can get it with:
pip install fsed
they are generally implemented only work for searching the string and not also replacing it
Perfect, that's exactly what you need. Searching with an ineffective algorithm in a 4G text is bad enough, but doing several replacing is probably even worse... you potentially have to move gigabytes of text to make space for the expansion/shrinking caused by the size difference of source and target text.
Just find the locations, then join the pieces with the replacements parts.
So a dumb analogy would be be "_".join( "a b c".split(" ") ), but of course you don't want to create copies the way split does.
Note: any reason to do this in python?
Related
I have the following text
text = "This is a string with C1234567 and CM123456, CM123, F1234567 and also M1234, M123456"
And I would like to extract this list of substrings
['C1234567', 'CM123456', 'F1234567']
This is what I came up with
new_string = re.compile(r'\b(C[M0-9]\d{6}|[FM]\d{7})\b')
new_string.findall(text)
However, I was wondering if there's a way to do this faster since I'm interested in performing this operation tens of thousands of times.
I thought I could use ^ to match the beginning of string, but the regex expression I came up with
new_string = re.compile(r'\b(^C[M0-9]\d{6}|^[FM]\d{7})\b')
Doesn't return anything anymore. I know this is a very basic question, but I'm not sure how to use the ^ properly.
Good and bad news. Bad news, regex looks pretty good, going to be hard to improve. Good news, I have some ideas :) I would try to do a little outside the box thinking if you are looking for performance. I do Extract Transform Load work, and a lot with Python.
You are already doing the re.compile (big help)
The regex engine is left to right, so short circuit where you can. Doesn't seem to apply here
If you have a big chunk of data that you are going to be looping over multiple times, clean it up front ONCE of stuff you KNOW won't match. Think of an HTML page, you only want stuff in HEAD stuff to get HEAD and need to run loops of many regexes over that section. Extract that section, only do that section, not the whole page. Seems obvious, isn't always :)
Use some metrics, give cProfile a try. Maybe there is some logic around where you are regexing that you can speed up. At least you can find your bottleneck, maybe the regex isn't the problem at all.
I'm working on to get twitter trends using tweepy in python and I'm able find out world top 50 trends so for sample I'm getting results like these
#BrazilianFansAreTheBest, #PSYPagtuklas, Federer, Corcuera, Ouvindo ANTI,
艦これ改, 영혼의 나이, #TodoDiaéDiaDe, #TronoChicas, #이사람은_분위기상_군주_장수_책사,
#OTWOLKnowYourLimits, #BaeIn3Words, #NoEntiendoPorque
(Please ignore non English words)
So here I need to parse every hashtag and convert them into proper English words, Also I checked how people write hashtag and found below ways -
#thisisawesome
#ThisIsAwesome
#thisIsAwesome
#ThisIsAWESOME
#ThisISAwesome
#ThisisAwesome123
(some time hashtags have numbers as well)
So keeping all these in mind I thought if I'm able to split below string then all above cases will be covered.
string ="pleaseHelpMeSPLITThisString8989"
Result = please, Help, Me, SPLIT, This, String, 8989
I tried something using re.sub but it is not giving me desired results.
Regex is the wrong tool for the job. You need a clearly-defined pattern in order to write a good regex, and in this case, you don't have one. Given that you can have Capitalized Words, CAPITAL WORDS, lowercase words, and numbers, there's no real way to look at, say, THATSand and differentiate between THATS and or THAT Sand.
A natural-language approach would be a better solution, but again, it's inevitably going to run into the same problem as above - how do you differentiate between two (or more) perfectly valid ways to parse the same inputs? Now you'd need to get a trie of common sentences, build one for each language you plan to parse, and still need to worry about properly parsing the nonsensical tags twitter often comes up with.
The question becomes, why do you need to split the string at all? I would recommend finding a way to omit this requirement, because it's almost certainly going to be easier to change the problem than it is to develop this particular solution.
Brief version
I have a collection of python code (part of instructional materials). I'd like to build an index of when the various python keywords, builtins and operators are used (and especially: when they are first used). Does it make sense to use ast to get proper tokenization? Is it overkill? Is there a better tool? If not, what would the ast code look like? (I've read the docs but I've never used ast).
Clarification: This is about indexing python source code, not English text that talks about python.
Background
My materials are in the form of ipython Notebooks, so even if I had an indexing tool I'd need to do some coding anyway to get the source code out. And I don't have an indexing tool; googling "python index" doesn't turn up anything with the intended sense of "index".
I thought, "it's a simple task, let's write a script for the whole thing so I can do it exactly the way I want". But of course I want to do it right.
The dumb solution: Read the file, tokenize on whitespaces and word boundaries, index. But this gets confused by the contents of strings (when does for really get introduced for the first time?), and of course attached operators are not separated: text+="this" will be tokenized as ["text", '+="', "this"]
Next idea: I could use ast to parse the file, then walk the tree and index what I see. This looks like it would involve ast.parse() and ast.walk(). Something like this:
for source in source_files:
with open(source) as fp:
code = fp.read()
tree = ast.parse(code)
for node in tree.walk():
... # Get node's keyword, identifier etc., and line number-- how?
print(term, source, line) # I do know how to make an index
So, is this a reasonable approach? Is there a better one? How should this be done?
Did you search on "index" alone, or for "indexing tool"? I would think that your main problem would be to differentiate a language concept from its natural language use.
I expect that your major difficulty here is not traversing the text, but in the pattern-matching to find these things. For instance, how do you recognize introducing for loops? This would be the word for "near" the word loop, with a for command "soon" after. That command would be a line beginning with for and ending with a colon.
That is just one pattern, albeit one with many variations. However, consider what it takes to differentiate that from a list comprehension, and that from a generation comprehension (both explicit and built-in).
Will you have directed input? I'm thinking that a list of topics and keywords is essential, not all of which are in the language's terminal tokens -- although a full BNF grammar would likely include them.
Would you consider a mark-up indexing tool? Sometimes, it's easier to place a mark at each critical spot, doing it by hand, and then have the mark-up tool extract an index from that. Such tools have been around for at least 30 years. These are also found with a search for "indexing tools", adding "mark-up" or "tagging" to the search.
Got it. I thought you wanted to parse both, using the code as the primary key for introduction. My mistake. Too much contact with the Society for Technical Communication. :-)
Yes, AST is overkill -- internally. Externally -- it works, it gives you a tree including those critical non-terminal tokens (such as "list comprehension"), and it's easy to get given a BNF and the input text.
This would give you a sequential list of parse trees. Your coding would consist of traversing the tress to make an index of each new concept from your input list. Once you find each concept, you index the instance, remove it from the input list, and continue until you run out of sample code or input items.
I've seen a couple variations of the "efficiently search for strings within file(s)" question on Stackoverflow but not quite like my situation.
I've got one text file which contains a relatively large number (>300K) of strings. The vast majority of these strings are multiple words (for ex., "Plessy v. Ferguson", "John Smith", etc.).
From there, I need to search through a very large set of text files (a set of legal docs totaling >10GB) and tally the instances of those strings.
Because of the number of search strings, the strings having multiple words, and the size of the search target, a lot of the "standard" solutions seem fall to the wayside.
Some things simplify the problem a little -
I don't need sophisticated tokenizing / stemming / etc. (e.g. the only instances I care about are "Plessy v. Ferguson", don't need to worry about "Plessy", "Plessy et. al." etc.)
there will be some duplicates (for ex., multiple people named "John Smith"), however, this isn't a very statistically significant issue for this dataset so... if multiple John Smith's get conflated into a single tally, that's ok for now.
I only need to count these specific instances; I don't need to return search results
10 instances in 1 file count the same as 1 instance in each of 10 files
Any suggestions for quick / dirty ways to solve this problem?
I've investigated NLTK, Lucene & others but they appear to be overkill for the problem I'm trying to solve. Should I suck it up and import everything into a DB? bruteforce grep it 300k times? ;)
My preferred dev tool is Python.
The docs to be searched are primarily legal docs like this - http://www.lawnix.com/cases/plessy-ferguson.html
The intended results are tallys for how often the case is referenced across those docs -
"Plessey v. Ferguson: 15"
An easy way to solve this is to build a trie with your queries (simply a prefix tree, list of nodes with a single character inside), and when you search through your 10gb file you go through your tree recursively as the text matches.
This way you prune a lot of options really early on in your search for each character position in the big file, while still searching your whole solution space.
Time performance will be very good (as good as a lot of other, more complicated solutions) and you'll only need enough space to store the tree (a lot less than the whole array of strings) and a small buffer into the large file. Definitely a lot better than grepping a db 300k times...
You have several constraints you must deal with, which makes this a complex problem.
Hard drive IO
Memory space
Processing time
I would suggest writing a multithreaded/multiprocess python app. The libraries to subprocess are painless. Have each process read in a file, and the parse tree as suggested by Blindy. When it finishes, it returns the results to the parent, which writes them to a file.
This will use up as many resources as you can throw at it, while allowing for expansion. If you stick it on a beowulf cluster, it will transparently share the processes across your cpus for you.
The only sticking point is the hard drive IO. Break it into chunks on different hard drives, and as each process finishes, start a new one and load a file. If you're on linux, all of the files can coexist in the same filesystem namespace, and your program won't know the difference.
The ugly brute-force solution won't work.
Time one grep through your documents and extrapolate the time it takes for 300k greps (and possibly try parallelizing it if you have many machines available), is it feasible? My guess is that 300k searches won't be feasible. E.g., greping one search through ~50 Mb of files took me about ~5s, so for 10 Gb, you'd expect ~1000s, and then repeating 300k times means you'd be done in about 10 years with one computer. You can parallelize to get some improvements (limited by disk io on one computer), but still will be quite limited. I assume you want the task to be finished in hours rather than months, so this isn't likely a solution.
So you are going to need to index the documents somehow. Lucene (say through pythonsolr) or Xapian should be suitable for your purpose. Index the documents, then search the indexed documents.
You should use group pattern matching algorithms which use dynamic algorithms to reuse evaluation. I.e. Aho–Corasick . Implementations
http://code.google.com/p/graph-expression/wiki/RegexpOptimization
http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html
I don't know if this idea is extremely stupid or not, please let me know...
Divide the files to be searched into reasonably sized numbers 10/100/1000... and for each "chunk" use an indexing SW available for SW. Here I'm thinking about ctags gnu global or perhaps the ptx utility or using a technique described in this SO post.
Using this technique, you "only" need to search through the index files for the target strings.
I've got the entire contents of a text file (at least a few KB) in string myStr.
Will the following code create a copy of the string (less the first character) in memory?
myStr = myStr[1:]
I'm hoping it just refers to a different location in the same internal buffer. If not, is there a more efficient way to do this?
Thanks!
Note: I'm using Python 2.5.
At least in 2.6, slices of strings are always new allocations; string_slice() calls PyString_FromStringAndSize(). It doesn't reuse memory--which is a little odd, since with invariant strings, it should be a relatively easy thing to do.
Short of the buffer API (which you probably don't want), there isn't a more efficient way to do this operation.
As with most garbage collected languages, strings are created as often as needed, which is very often. The reason for this is because tracking substrings as described would make garbage collection more difficult.
What is the actual algorithm you are trying to implement. It might be possible to give you advice for ways to get better results if we knew a bit more about it.
As for an alternative, what is it you really need to do? Could you use a different way of looking at the issue, such as just keeping an integer index into the string? Could you use a array.array('u')?
One (albeit slightly hacky) solution would be something like this:
f = open("test.c")
f.read(1)
myStr = f.read()
print myStr
It will skip the first character, and then read the data into your string variable.
Depending on what you are doing, itertools.islice may be a suitable memory-efficient solution (should one become necessary).