I was working on a problem recently that required me to go through a very large folder (~600,000 files) and return the list of filenames that matched a certain criterion. The original version was a normal list comprehension stored in a variable. This isn't the actual code but it gives the gist.
def filter_files(file_path):
filtered = [f.path for f in os.scandir(file_path) if f.path.endswith('.png')]
return filtered
When monitoring this one it would start out fast then progressively get slower and slower. I presume because it's just trying to store so much data in the variable.
I then rewrote it to be:
def filter_files(file_path):
return [f.path for f in os.scandir(file_path) if f.path.endswith('.png')]
And called it like:
def test(file_path):
filtered = filter_files(file_path)
This one never slows down. It maintains the same speed the entire time.
My question is what under the hood causes this difference? The data is still being stored in a variable and it's still being processed as a list comprehension. What about writing the comprehension in a return avoids the issues of the first version? Thanks!
There is no difference between those two pieces of code. None at all. Both of them are creating a list, and then managing a reference to that list.
The likely cause of your issue is caching. In the first case, the file system has to keep going out to the disk over and over and over to fetch more entries. After you finished that, the directory was in the file cache and could be read immediately. Reboot and try again, and you'll see the second one takes the same amount of time.
I am processing some text files to search for patterns and count it. As files are very large, processing time is an important issue. I have a python code that gets the counters updated and stored in mongodb. In order to make it work faster I am trying to reduce the number of db operations.
Original version was incrementing every single ocurrence:
mlcol.find_one_and_update(
{"connip": conip},
{"$inc":{ts:1}},
upsert=True
)
As this took to long, what I did was to keep the counters in memory, in dictionaries and periodically go through this data to store it:
for conip in conCounter.keys():
d = conCounter[conip]
for ts in d.keys():
mlcol.find_one_and_update(
{"connip": conip},
{"$inc":{ts:d[ts]}},
upsert=True
)
This way the process is much faster, but I see that it still takes very long to update individually every single counter.
Is there a way to launch multiple updates in a single command?
Any other idea to make this go faster?
As explained by Alex Blex, creating an index and a Bulk execution solved the issue:
mlcol.create_index("connip")
bulk=mlcol.initialize_unordered_bulk_op()
for conip in conCounter.keys():
d = conCounter[conip]
for ts in d.keys():
bulk.find({"connip": conip}).upsert().update({"$inc":{ts:d[ts]}})
res=bulk.execute()
I am writing a piece of code that involves generation of new parameter values over a double FOR loop and store these values to a file. The loop iteration count can go as high as 10,000 * 100,000. I have stored the variable values in a string, which gets appended on every iteration with newer values. Finally, at the end of loop I write the complete string in a txt file.
op=open("output file path","w+")
totresult = ""
for n seconds: #this user input parameter can be upto 100,000
result = ""
for car in (cars running): #number of cars can be 10000
#Code to check if given car is in range to another car
.
.
#if car in range with another car
if distance < 1000:
result = getDetailsofOtherCar()
totresult = totalresult + carName + result
#end of loops
op.write(totresult)
op.close()
My question here is, is there a better pythonic way to perform this kind of logging. As I am guessing the string gets very bulky in the later iterations and may be causing delay in execution. Is the use of string the best possible option to store the values. Or should I consider other python data structures like list, array. I came across Logging python module but would like to get an opinion before switching to it.
I tried looking up for similar issues but found nothing similar to my current doubt.
Open to any suggestions
Thank you
Edit: code added
You can write to the file as you go e.g.
with open("output.txt", "w") as log:
for i in range(10):
for j in range(10):
log.write(str((i,j)))
Update: whether or not directly streaming the records is faster than concatenating them in a memory buffer depends crucially on how big the buffer becomes, which in turn depends on the number of records and the size of each record. On my machine this seems to kick in around 350MB.
I have experienced that in other languages. Now I have the same problem in Python. I have a dictionary that has a lot of CRUD actions. One would assume that deleting elements from a dictionary should decrease the memory footprint of it. It's not the case. Once a dictionary grows in size (doubling usually), it never(?) releases allocated memory back. I have run this experiment:
import random
import sys
import uuid
a= {}
for i in range(0, 100000):
a[uuid.uuid4()] = uuid.uuid4()
if i % 1000 == 0:
print sys.getsizeof(a)
for i in range(0, 100000):
e = random.choice(a.keys())
del a[e]
if i % 1000 == 0:
print sys.getsizeof(a)
print len(a)
The last line of the first loop is 6291736. The last line of the second loop is 6291736 as well. And the size of the dictionary is 0.
So how to tackle this issue? Is there a way to force release of memory?
PS: don't really need to do random - I played with the range of the second loop.
The way to do this "rehashing" so it uses less memory is to create a new dictionary and copy the content over.
The Python dictionary implementation is explained really well in this video:
https://youtu.be/C4Kc8xzcA68
There is an atendee asking this same question (https://youtu.be/C4Kc8xzcA68?t=1593), and the answer given by the speaker is:
Resizes are only calculated upon insertion; as a dictionary shrinks it just gains a lot of dummy entries and as you refill it will just start reusing those to store keys. [...] you have to copy the keys and values out to a new dictionary
Actually a dictionary can shrink upon resize, but the resize only happens upon a key insert not removal. Here's a comment from the CPython source for dictresize:
Restructure the table by allocating a new table and reinserting all
items again. When entries have been deleted, the new table may
actually be smaller than the old one.
By the way, since the other answer quotes Brandon Rhodes talk on the dictionary at PyCon 2010, and the quote seems to be at odds with the above (which has been there for years), I thought I would include the full quote, with the missing part in bold.
Resizes are only calculated upon insertion. As a dictionary shrinks,
it just gains a lot of dummy entries and as you refill it, it will
just start re-using those to store keys. It will not resize until you
manage to make it two-thirds full again at its larger size. So it
does not resize as you delete keys. You have to do an insert to get
it to figure out it needs to shrink.
So he does say the resizing operation can "figure out [the dictionary] needs to shrink". But that only happens on insert. Apparently when copying over all the keys during resize, the dummy keys can get removed, reducing the size of the backing array.
It isn't clear, however, how to get this to happen, which is why Rhodes says to just copy over everything to a new dictionary.
It's said that Python automatically manages memory. I'm confused because I have a Python program consistently uses more than 2GB of memory.
It's a simple multi-thread binary data downloader and unpacker.
def GetData(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
data = response.read() // data size is about 15MB
response.close()
count = struct.unpack("!I", data[:4])
for i in range(0, count):
UNPACK FIXED LENGTH OF BINARY DATA HERE
yield (field1, field2, field3)
class MyThread(threading.Thread):
def __init__(self, total, daterange, tickers):
threading.Thread.__init__(self)
def stop(self):
self._Thread__stop()
def run(self):
GET URL FOR EACH REQUEST
data = []
items = GetData(url)
for item in items:
data.append(';'.join(item))
f = open(filename, 'w')
f.write(os.linesep.join(data))
f.close()
There are 15 threads running. Each request gets 15MB of data and unpack it and saved to local text file. How could this program consume more than 2GB of memory? Do I need to do any memory recycling jobs in this case? How can I see how much memory each objects or functions use?
I would appreciate all your advices or tips on how to keep a python program running in a memory efficient mode.
Edit: Here is the output of "cat /proc/meminfo"
MemTotal: 7975216 kB
MemFree: 732368 kB
Buffers: 38032 kB
Cached: 4365664 kB
SwapCached: 14016 kB
Active: 2182264 kB
Inactive: 4836612 kB
Like others have said, you need at least the following two changes:
Do not create a huge list of integers with range
# use xrange
for i in xrange(0, count):
# UNPACK FIXED LENGTH OF BINARY DATA HERE
yield (field1, field2, field3)
do not create a huge string as the full file body to be written at once
# use writelines
f = open(filename, 'w')
f.writelines((datum + os.linesep) for datum in data)
f.close()
Even better, you could write the file as:
items = GetData(url)
f = open(filename, 'w')
for item in items:
f.write(';'.join(item) + os.linesep)
f.close()
The major culprit here is as mentioned above the range() call. It will create a list with 15 million members, and that will eat up 200 MB of your memory, and with 15 processes, that's 3GB.
But also don't read in the whole 15MB file into data(), read bit by bit from the response. Sticking those 15MB into a variable will use up 15MB of memory more than reading bit by bit from the response.
You might want to consider simply just extracting data until you run out if indata, and comparing the count of data you extracted with what the first bytes said it should be. Then you need neither range() nor xrange(). Seems more pythonic to me. :)
Consider using xrange() instead of range(), I believe that xrange is a generator whereas range() expands the whole list.
I'd say either don't read the whole file into memory, or don't keep the whole unpacked structure in memory.
Currently you keep both in memory, at the same time, this is going to be quite big. So you've got at least two copies of your data in memory, plus some metadata.
Also the final line
f.write(os.linesep.join(data))
May actually mean you've temporarily got a third copy in memory (a big string with the entire output file).
So I'd say you're doing it in quite an inefficient way, keeping the entire input file, entire output file and a fair amount of intermediate data in memory at once.
Using the generator to parse it is quite a nice idea. Consider writing each record out after you've generated it (it can then be discarded and the memory reused), or if that causes too many write requests, batch them into, say, 100 rows at once.
Likewise, reading the response could be done in chunks. As they're fixed records this should be reasonably easy.
The last line should surely be f.close()? Those trailing parens are kinda important.
You can make this program more memory efficient by not reading all 15MB from the TCP connection, but instead processing each line as it is read. This will make the remote servers wait for you, of course, but that's okay.
Python is just not very memory efficient. It wasn't built for that.
You could do more of your work in compiled C code if you convert this to a list comprehension:
data = []
items = GetData(url)
for item in items:
data.append(';'.join(item))
to:
data = [';'.join(items) for items in GetData(url)]
This is actually slightly different from your original code. In your version, GetData returns a 3-tuple, which comes back in items. You then iterate over this triplet, and append ';'.join(item) for each item in it. This means that you get 3 entries added to data for every triplet read from GetData, each one ';'.join'ed. If the items are just strings, then ';'.join will give you back a string with every other character a ';' - that is ';'.join("ABC") will give back "A;B;C". I think what you actually wanted was to have each triplet saved back to the data list as the 3 values of the triplet, separated by semicolons. That is what my version generates.
This may also help somewhat with your original memory problem, as you are no longer creating as many Python values. Remember that a variable in Python has much more overhead than one in a language like C. Since each value is itself an object, and add the overhead of each name reference to that object, you can easily expand the theoretical storage requirement several-fold. In your case, reading 15Mb X 15 = 225Mb + the overhead of each item of each triple being stored as a string entry in your data list could quickly grow to your 2Gb observed size. At minimum, my version of your data list will have only 1/3 the entries in it, plus the separate item references are skipped, plus the iteration is done in compiled code.
There are 2 obvious places where you keep large data objects in memory (data variable in GetData() and data in MyThread.run() - these two will take about 500Mb) and probably there are other places in the skipped code. There are both easy to make memory efficient. Use response.read(4) instead of reading whole response at once and do it the same way in code behind UNPACK FIXED LENGTH OF BINARY DATA HERE. Change data.append(...) in MyThread.run() to
if not first:
f.write(os.linesep)
f.write(';'.join(item))
These changes will save you a lot of memory.
Make sure you delete the threads after they are stopped. (using del)