It's said that Python automatically manages memory. I'm confused because I have a Python program consistently uses more than 2GB of memory.
It's a simple multi-thread binary data downloader and unpacker.
def GetData(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
data = response.read() // data size is about 15MB
response.close()
count = struct.unpack("!I", data[:4])
for i in range(0, count):
UNPACK FIXED LENGTH OF BINARY DATA HERE
yield (field1, field2, field3)
class MyThread(threading.Thread):
def __init__(self, total, daterange, tickers):
threading.Thread.__init__(self)
def stop(self):
self._Thread__stop()
def run(self):
GET URL FOR EACH REQUEST
data = []
items = GetData(url)
for item in items:
data.append(';'.join(item))
f = open(filename, 'w')
f.write(os.linesep.join(data))
f.close()
There are 15 threads running. Each request gets 15MB of data and unpack it and saved to local text file. How could this program consume more than 2GB of memory? Do I need to do any memory recycling jobs in this case? How can I see how much memory each objects or functions use?
I would appreciate all your advices or tips on how to keep a python program running in a memory efficient mode.
Edit: Here is the output of "cat /proc/meminfo"
MemTotal: 7975216 kB
MemFree: 732368 kB
Buffers: 38032 kB
Cached: 4365664 kB
SwapCached: 14016 kB
Active: 2182264 kB
Inactive: 4836612 kB
Like others have said, you need at least the following two changes:
Do not create a huge list of integers with range
# use xrange
for i in xrange(0, count):
# UNPACK FIXED LENGTH OF BINARY DATA HERE
yield (field1, field2, field3)
do not create a huge string as the full file body to be written at once
# use writelines
f = open(filename, 'w')
f.writelines((datum + os.linesep) for datum in data)
f.close()
Even better, you could write the file as:
items = GetData(url)
f = open(filename, 'w')
for item in items:
f.write(';'.join(item) + os.linesep)
f.close()
The major culprit here is as mentioned above the range() call. It will create a list with 15 million members, and that will eat up 200 MB of your memory, and with 15 processes, that's 3GB.
But also don't read in the whole 15MB file into data(), read bit by bit from the response. Sticking those 15MB into a variable will use up 15MB of memory more than reading bit by bit from the response.
You might want to consider simply just extracting data until you run out if indata, and comparing the count of data you extracted with what the first bytes said it should be. Then you need neither range() nor xrange(). Seems more pythonic to me. :)
Consider using xrange() instead of range(), I believe that xrange is a generator whereas range() expands the whole list.
I'd say either don't read the whole file into memory, or don't keep the whole unpacked structure in memory.
Currently you keep both in memory, at the same time, this is going to be quite big. So you've got at least two copies of your data in memory, plus some metadata.
Also the final line
f.write(os.linesep.join(data))
May actually mean you've temporarily got a third copy in memory (a big string with the entire output file).
So I'd say you're doing it in quite an inefficient way, keeping the entire input file, entire output file and a fair amount of intermediate data in memory at once.
Using the generator to parse it is quite a nice idea. Consider writing each record out after you've generated it (it can then be discarded and the memory reused), or if that causes too many write requests, batch them into, say, 100 rows at once.
Likewise, reading the response could be done in chunks. As they're fixed records this should be reasonably easy.
The last line should surely be f.close()? Those trailing parens are kinda important.
You can make this program more memory efficient by not reading all 15MB from the TCP connection, but instead processing each line as it is read. This will make the remote servers wait for you, of course, but that's okay.
Python is just not very memory efficient. It wasn't built for that.
You could do more of your work in compiled C code if you convert this to a list comprehension:
data = []
items = GetData(url)
for item in items:
data.append(';'.join(item))
to:
data = [';'.join(items) for items in GetData(url)]
This is actually slightly different from your original code. In your version, GetData returns a 3-tuple, which comes back in items. You then iterate over this triplet, and append ';'.join(item) for each item in it. This means that you get 3 entries added to data for every triplet read from GetData, each one ';'.join'ed. If the items are just strings, then ';'.join will give you back a string with every other character a ';' - that is ';'.join("ABC") will give back "A;B;C". I think what you actually wanted was to have each triplet saved back to the data list as the 3 values of the triplet, separated by semicolons. That is what my version generates.
This may also help somewhat with your original memory problem, as you are no longer creating as many Python values. Remember that a variable in Python has much more overhead than one in a language like C. Since each value is itself an object, and add the overhead of each name reference to that object, you can easily expand the theoretical storage requirement several-fold. In your case, reading 15Mb X 15 = 225Mb + the overhead of each item of each triple being stored as a string entry in your data list could quickly grow to your 2Gb observed size. At minimum, my version of your data list will have only 1/3 the entries in it, plus the separate item references are skipped, plus the iteration is done in compiled code.
There are 2 obvious places where you keep large data objects in memory (data variable in GetData() and data in MyThread.run() - these two will take about 500Mb) and probably there are other places in the skipped code. There are both easy to make memory efficient. Use response.read(4) instead of reading whole response at once and do it the same way in code behind UNPACK FIXED LENGTH OF BINARY DATA HERE. Change data.append(...) in MyThread.run() to
if not first:
f.write(os.linesep)
f.write(';'.join(item))
These changes will save you a lot of memory.
Make sure you delete the threads after they are stopped. (using del)
Related
I have a python script that reads a flat file and writes the records to a JSON file. Would it be faster to do a write all at once:
dict_array = []
for record in records:
dict_array.append(record)
# writes one big array to file
out_file.write(json.dumps(dict_array))
Or write to the file as the iterator yields each record?
for record in records:
out_file.write(json.dumps(record) + '\n')
The amount of records in records is around 81,000.
Also, the format of JSON can be one big array of objects (case 1) or line-separated objects (case 2).
Your two solutions aren't doing the same thing. The first one writes a valid JSON object. The second writes a probably-valid (but you have to be careful) JSONlines (and probably also NDJSON/LDJSON and NDJ) file. So, the way you process the data later is going to be very different. And that's the most important thing here—do you want a JSON file, or a JSONlines file?
But since you asked about performance: It depends.
Python files are buffered by default, so doing a whole bunch of small writes is only a tiny bit slower than doing one big write. But it is a tiny bit slower, not zero.
On the other hand, building a huge list in memory means allocation, and copies, that are otherwise unneeded. This is almost certainly going to be more significant, unless your values are really tiny and your list is also really short.
Without seeing your data, I'd give about 10:1 odds that the iterative solution will turn out faster, but why rely on that barely-educated guess? If it matters, measure with your actual data with timeit.
On the third hand, 81,000 typical JSON records is basically nothing, so unless you're doing this zillions of times, it's probably not even worth measuring. If you spend an hour figuring out how to measure it, running the tests, and interpreting the results (not to mention the time you spent on SO) to save 23 milliseconds per day for about a week and then nothing ever again… well, to a programmer, that's always attractive, but still it's not always a good idea.
import json
dict_array = []
records = range(10**5)
start = time.time()
for record in records:
dict_array.append(record)
out_file.write(json.dumps(dict_array))
end = time.time()
print(end-start)
#0.07105851173400879
start = time.time()
for record in records:
out_file.write(json.dumps(record) + '\n')
end = time.time()
print(end-start)
#1.1138122081756592
start = time.time()
out_file.write(json.dumps([record for record in records]))
end = time.time()
print(end-start)
#0.051038265228271484
I don't know what records is, but based on these tests, list comprehension is fastest, followed by constructing a list and writing it all at once, followed by writing one record at a time. Depending on what records is, just doing out_file.write(json.dumps(records))) may be even faster.
EDIT:
I need help to turn the code below, especially the list, into a generator so that I can save memory on my computer.
I converted doclist into an iterable object, and deleted slist and seuslist, which previously was a large list of names.
https://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/
seuslist1 = open('/Users/AJ/Desktop/Dropbox/DOS_Python/docs/US/socialentrepreneurship_US_list.txt', mode= 'r+')
seuslist = seuslist1.read()
slist = seuslist.split('\n')
slist = slist[:len(slist)-1] #I have to take out the last entry because of a weird space. Also explore using OSwalk later.
#I switched to just using a list of docs because it's easier to deal with than a dictionary
doclist = []
for i, doc in enumerate(slist):
string = 'docs/US/', doc
string = ''.join(string)
doclist.append(open(string, mode='r+').read())
#clear these variables to free up memory. Turn doclist into an generator object to save memory.
doclist = iter(doclist)
del seuslist
del slist
seuslist1.close()
Your basic problem, as you've noted, is that you're keeping all the contents of all those files in a single enormous list. Luckily, turning that list into a generator is quite simple. To keep things readable and Pythonic, we'll rename doclist to simply docs, since it's no longer a list.
# Use a generator expression to quickly create a generator.
# This will iterate over ever entry in slist.
# For each entry: build the path, open the file, read it, and yield the contents
docs = (open(path).read() for path in ('docs/US/'+entry for entry in slist))
for doc in docs:
print(len(doc)) # Do something useful here.
A couple of things to bear in mind when using generators like this.
First, it will help you with your memory problems, because you only ever have one file's contents in memory at once (unless you store it elsewhere, but that's probably a bad idea, because of the aforementioned memory issues).
Second, each file is loaded only when the iteration (for doc in docs) progresses to the next step. This means that if your process takes a long time on each iteration (or even if it doesn't), you could modify files while the process is running, for better or for worse.
Third, the generator expression here isn't the most robust thing ever, since you've got those bare open calls, any one of which could throw an Exception and kill the remainder of your processing. To make it sturdier, you'd want to write an actual generator function like in Calpratt's answer, so you can use context managers, wrap up Exceptions on a per-file basis, and so on.
Finally, remember that a generator may only be used once as-is! Once you exhaust it, it's done. This is usually fine, but you need to make sure you extract all the information you'll need the first time through (besides, you don't want to be re-reading all those files over and over anyway!).
Try something like:
main_file = '/Users/AJ/Desktop/Dropbox/DOS_Python/docs/US/socialentrepreneurship_US_list.txt'
def data_from_file_generator():
with open(main_file, mode= 'r+') as path_file:
for my_path in path_file:
with open("docs/US/" + my_path, mode='r+') as data_file:
yield data_file.read()
Python noob, strong c background.
I'm writing a simple server that reads blocks of size 1024 bytes from a socket. I need to concatenate the blocks into one large file (it's a video). For starters I have tried something like this
movie = bytearray()
while numblocksrxd != numBlocks:
data=conn.recv(1024)
numblocksrxd+=1
movie = movie+data
I've quickly realized that this code creates a new instance of movie each time I assign to it, which results in increasingly larger mem copies as it grows (I think). If I were doing this in C I'd simply malloc the space I needed and fill it in as it came. How would I handle this in python?
movie += data
Augmented assignments are generally done without creating a new object if their target is mutable. In this case, bytearray supports in-place +=, so this code won't create a new object. When the bytearray's internal buffer runs out of room, it'll allocate a new one, but the allocation takes amortized constant time and the movie object managing the buffer won't be replaced.
I doubt you really need more efficiency than just appending onto a byte-string like this. Yes, it will reallocate the array and copy it, but it uses the usual exponential expansion tricks so the amortized time is still constant.
If you really need more efficiency (and I doubt you do, but…), you can just store a list of byte strings instead of one big one. If you can use that as-is, great. If not, you can concatenate them at the end with join. That's almost always the fastest way to build up a big string in Python. So:
movie = []
while numblocksrxd != numBlocks:
data=conn.recv(1024)
numblocksrxd+=1
movie.append(data)
movie = b''.join(movie)
Or you can, of course, also use the same trick you'd use in C:
movie = bytearray(numBlocks * 1024)
while numblocksrxd != numBlocks:
data=conn.recv(1024)
numblocksrxd+=1
movie[numblocksrxd*1024:(numblocksrxd+1)*1024] = data
I'm working on writing a small log scraping program in Python, which processes a rolling logfile and stores the offsets in the file for lines of interest.
My original solution was running rather quickly on large files but I didn't have a method for clearing the storage which means that if the program were to continue running, the memory usage would steadily increase until the program consumed all of the memory available to it. My solution was to use collections.deque with maxlen set so that the list would operate as a circular buffer, discarding the oldest loglines as more came in.
While this fixes the memory issue, I'm faced with a major performance loss in calling items from the deque by index. As an example, this code runs much slower than the old equivalent, where self.loglines was not a deque. Is there a way to improve it's speed, or make a circular buffer where random-access is a constant time operation (instead of, I assume, O(n))?
def get_logline(self, lineno):
"""Gets the logline at the given line number.
Arguments:
lineno - The index of the logline to retrieve
Returns: A string containing the logline at the given line number
"""
start = self.loglines[lineno].start
end = self.loglines[lineno+1].start
size = end - start
if self._logfile.closed:
self._logfile = open(self.logpath, "r")
self._logfile.seek(start)
logline = self._logfile.read(size)
return logline
As with all double-linked lists, random access in a collections.deque is O(n). Consider using a list of bounded lists so that clearing of old entries (del outer[0]) can still proceed in a timely manner even with hundreds of thousands of entries.
I am using python to parse the incoming comma separated string. I want to do some calculation afterwards on the data.
The length of the string is: 800 characters with 120 comma separated fields.
There such 1.2 million strings to process.
for v in item.values():
l.extend(get_fields(v.split(',')))
#process l
get_fields uses operator.itemgetter() to extract around 20 fields out of 120.
This entire operation takes about 4-5 minutes excluding the time to bring in the data.
In the later part of the program I insert these lines into sqlite memory table for further use.
But overall 4-5 minutes time for just parsing and getting a list is not good for my project.
I run this processing in around 6-8 threads.
Does switching to C/C++ might help?
Are you loading a dict with your file records? Probably better to process the data directly:
datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()
l = sum(get_fields(v.split(',')) for v in file, [])
This avoids creating any overall data structures, and only accumulated the desired values as returned by get_fields.
Your program might be slowing down trying to allocate enough memory for 1.2M strings. In other words, the speed problem might not be due to the string parsing/manipulation, but rather in the l.extend. To test this hypothsis, you could put a print statement in the loop:
for v in item.values():
print('got here')
l.extend(get_fields(v.split(',')))
If the print statements get slower and slower, you can probably conclude l.extend is the culprit. In this case, you may see significant speed improvement if you can move the processing of each line into the loop.
PS: You probably should be using the csv module to take care of the parsing for you in a more high-level manner, but I don't think that will affect the speed very much.