I've got the entire contents of a text file (at least a few KB) in string myStr.
Will the following code create a copy of the string (less the first character) in memory?
myStr = myStr[1:]
I'm hoping it just refers to a different location in the same internal buffer. If not, is there a more efficient way to do this?
Thanks!
Note: I'm using Python 2.5.
At least in 2.6, slices of strings are always new allocations; string_slice() calls PyString_FromStringAndSize(). It doesn't reuse memory--which is a little odd, since with invariant strings, it should be a relatively easy thing to do.
Short of the buffer API (which you probably don't want), there isn't a more efficient way to do this operation.
As with most garbage collected languages, strings are created as often as needed, which is very often. The reason for this is because tracking substrings as described would make garbage collection more difficult.
What is the actual algorithm you are trying to implement. It might be possible to give you advice for ways to get better results if we knew a bit more about it.
As for an alternative, what is it you really need to do? Could you use a different way of looking at the issue, such as just keeping an integer index into the string? Could you use a array.array('u')?
One (albeit slightly hacky) solution would be something like this:
f = open("test.c")
f.read(1)
myStr = f.read()
print myStr
It will skip the first character, and then read the data into your string variable.
Depending on what you are doing, itertools.islice may be a suitable memory-efficient solution (should one become necessary).
Related
I am required to access the leftmost element of a Python list, while popping elements from the left. One example of this usage could be the merge operation of a merge sort.
There are two good ways of doing it.
collections.deque
Using indexes, keep an increasing integer as I pop.
Both these methods seems efficient in complexity terms (Should mention at the beginning of the program I need to convert the list to deque, meaning some extra O(n)).
So in the terms of style and speed, which is the best for Python usage ? I did not see an official recommendation or did not encounter a similar question which is why I am asking this.
Thanks in advance.
Related: Time complexity
Definitely collections.deque. First, it's already written so you don't have to. Second, it's written in C so probably it is going to be much faster than another Python re-implementation. Third, using index leaves the head of the list unused, whereas queue is more clever. Appending to a standard list only to increase the starting index on each left pop is very inefficient because you would need to realloc the list more often.
For a single large text (~4GB) I need to search for ~1million phrases and replace them with complementary phrases. Both the raw text and the replacements can easily fit in memory. The naive solution will literally takes years to finish as a single replacement takes about a minute.
Naive solution:
for search, replace in replacements.iteritems():
text = text.replace(search, replace)
The regex method using re.subis x10 slower:
for search, replace in replacements.iteritems():
text = re.sub(search, replace, text)
At any rate, this seems like a great place use Boyer-Moore string, or Aho-Corasick; but these methods as they are generally implemented only work for searching the string and not also replacing it.
Alternatively, any tool (outside of Python) that can do this quickly would also be appreciated.
Thanks!
There's probably a better way than this:
re.sub('|'.join(replacements), lambda match: replacements[match.group()], text)
This does one search pass, but it's not a very efficient search. The re2 module may speed this up dramatically.
Outside of python, sed is usually used for this sort of thing.
For example (taken from here), to replace the word ugly with beautiful in the file sue.txt:
sed -i 's/ugly/beautiful/g' /home/bruno/old-friends/sue.txt
You haven't posted any profiling of your code, you should try some timings before you do any premature optimization. Searching and replacing text in a 4GB file is a computationally-intensive operation.
ALTERNATIVE
Ask: should I be doing this at all? -
You discuss below doing an entire search and replace of the Wikipedia corpus in under 10ms. This rings some alarm bells as it doesn't sound like great design. Unless there's an obvious reason not to you should be modifying whatever code you use to present and/or load that to do the search and replace as a subset of the data is being loaded/viewed. It's unlikely you'll be doing many operations on the entire 4GB of data so restrict your search and replace operations to what you're actually working on. Additionally, your timing is still very imprecise because you don't know how big the file you're working on is.
On a final point, you note that:
the speedup has to be algorithmic not chaining millions of sed calls
But you indicated that the data you're working with was a "single large text (~4GB)" so there shouldn't be any chaning involved if I understand what you mean by that correctly.
UPDATE:
Below you indicate that to do the operation on a ~4KB file (I'm assuming) takes 90s, this seems very strange to me - sed operations don't normally take anywhere close to that. If the file is actually 4MB (I'm hoping) then it should take 24 hours to evaluate (not ideal but probably acceptable?)
I had this use case as well, where I needed to do ~100,000 search and replace operations on the full text of Wikipedia. Using sed, awk, or perl would take years. I wasn't able to find any implementation of Aho-Corasick that did search-and-replace, so I wrote my own: fsed. This tool happens to be written in Python (so you can hack into the code if you like), but it's packaged up as a command line utility that runs like sed.
You can get it with:
pip install fsed
they are generally implemented only work for searching the string and not also replacing it
Perfect, that's exactly what you need. Searching with an ineffective algorithm in a 4G text is bad enough, but doing several replacing is probably even worse... you potentially have to move gigabytes of text to make space for the expansion/shrinking caused by the size difference of source and target text.
Just find the locations, then join the pieces with the replacements parts.
So a dumb analogy would be be "_".join( "a b c".split(" ") ), but of course you don't want to create copies the way split does.
Note: any reason to do this in python?
If I instantiate/update a few lists very, very few times, in most cases only once, but I check for the existence of an object in that list a bunch of times, is it worth it to convert the lists into dictionaries and then check by key existence?
Or in other words is it worth it for me to convert lists into dictionaries to achieve possible faster object existence checks?
Dictionary lookups are faster the list searches. Also a set would be an option. That said:
If "a bunch of times" means "it would be a 50% performance increase" then go for it. If it doesn't but makes the code better to read then go for it. If you would have fun doing it and it does no harm then go for it. Otherwise it's most likely not worth it.
You should be using a set, since from your description I am guessing you wouldn't have a value to associate. See Python: List vs Dict for look up table for more info.
Usually it's not important to tune every line of code for utmost performance.
As a rule of thumb, if you need to look up more than a few times, creating a set is usually worthwhile.
However consider that pypy might do the linear search 100 times faster than CPython, then a "few" times might be "dozens". In other words, sometimes the constant part of the complexity matters.
It's probably safest to go ahead and use a set there. You're less likely to find that a bottleneck as the system scales than the other way around.
If you really need to microtune everything, keep in mind that the implementation, cpu cache, etc... can affect it, so you may need to remicrotune differently for different platforms, and if you need performance that badly, Python was probably a bad choice - although maybe you can pull the hotspots out into C. :)
random access (look up) in dictionary are faster but creating hash table consumes more memory.
more performance = more memory usages
it depends on how many items in your list.
I am developing AI to perform MDP, I am getting states(just integers in this case) and assigning it a value, and I am going to be doing this a lot. So I am looking for a data structure that can hold(no need for delete) that information and will have a very fast get/update function. Is there something faster than the regular dictionary? I am looking for anything really so native python, open sourced, I just need fast getting.
Using a Python dictionary is the way to go.
You're saying that all your keys are integers? In that case, it might be faster to use a list and just treat the list indices as the key values. However, you'd have to make sure that you never delete or add list items; just start with as many as you think you'll need, setting them all equal to None, as shown:
mylist = [None for i in xrange(totalitems)]
Then, when you need to "add" an item, just set the corresponding value.
Note that this probably won't actually gain you much in terms of actual efficiency, and it might be more confusing than just using a dictionary.
For 10,000 items, it turns out (on my machine, with my particular test case) that accessing each one and assigning it to a variable takes about 334.8 seconds with a list and 565 seconds with a dictionary.
If you want a rapid prototype, use python. And don't worry about speed.
If you want to write fast scientific code (and you can't build on fast native libraries, like LAPACK for linear algebra stuff) write it in C, C++ (maybe only to call from Python). If fast instead of ultra-fast is enough, you can also use Java or Scala.
I was wondering if there was a way to keep extremely large lists in the memory and then process those lists from specific points. Since these lists will have as many as almost 400 billion numbers before processing we need to split them up but I haven't the slightest idea (since I can't find an example) of where to start when trying to process a list from a specific point in Python. Edit: Right now we are not trying to create multiple-dimensions but if it's easier then I'll for sure do it.
Even if your numbers are bytes, 400GB (or 400TB if you use billion in the long-scale meaning) does not normally fit in RAM. Therefore I guess numpy.memmap or h5py may be what you're looking for.
Further to the #lazyr's point, if you use the numpy.memmap method, then my previous discussion on views into numpy arrays might well be useful.
This is also the way you should be thinking if you have stacks of memory and everything actually is in RAM.