EDIT:
I need help to turn the code below, especially the list, into a generator so that I can save memory on my computer.
I converted doclist into an iterable object, and deleted slist and seuslist, which previously was a large list of names.
https://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/
seuslist1 = open('/Users/AJ/Desktop/Dropbox/DOS_Python/docs/US/socialentrepreneurship_US_list.txt', mode= 'r+')
seuslist = seuslist1.read()
slist = seuslist.split('\n')
slist = slist[:len(slist)-1] #I have to take out the last entry because of a weird space. Also explore using OSwalk later.
#I switched to just using a list of docs because it's easier to deal with than a dictionary
doclist = []
for i, doc in enumerate(slist):
string = 'docs/US/', doc
string = ''.join(string)
doclist.append(open(string, mode='r+').read())
#clear these variables to free up memory. Turn doclist into an generator object to save memory.
doclist = iter(doclist)
del seuslist
del slist
seuslist1.close()
Your basic problem, as you've noted, is that you're keeping all the contents of all those files in a single enormous list. Luckily, turning that list into a generator is quite simple. To keep things readable and Pythonic, we'll rename doclist to simply docs, since it's no longer a list.
# Use a generator expression to quickly create a generator.
# This will iterate over ever entry in slist.
# For each entry: build the path, open the file, read it, and yield the contents
docs = (open(path).read() for path in ('docs/US/'+entry for entry in slist))
for doc in docs:
print(len(doc)) # Do something useful here.
A couple of things to bear in mind when using generators like this.
First, it will help you with your memory problems, because you only ever have one file's contents in memory at once (unless you store it elsewhere, but that's probably a bad idea, because of the aforementioned memory issues).
Second, each file is loaded only when the iteration (for doc in docs) progresses to the next step. This means that if your process takes a long time on each iteration (or even if it doesn't), you could modify files while the process is running, for better or for worse.
Third, the generator expression here isn't the most robust thing ever, since you've got those bare open calls, any one of which could throw an Exception and kill the remainder of your processing. To make it sturdier, you'd want to write an actual generator function like in Calpratt's answer, so you can use context managers, wrap up Exceptions on a per-file basis, and so on.
Finally, remember that a generator may only be used once as-is! Once you exhaust it, it's done. This is usually fine, but you need to make sure you extract all the information you'll need the first time through (besides, you don't want to be re-reading all those files over and over anyway!).
Try something like:
main_file = '/Users/AJ/Desktop/Dropbox/DOS_Python/docs/US/socialentrepreneurship_US_list.txt'
def data_from_file_generator():
with open(main_file, mode= 'r+') as path_file:
for my_path in path_file:
with open("docs/US/" + my_path, mode='r+') as data_file:
yield data_file.read()
Related
I was working on a problem recently that required me to go through a very large folder (~600,000 files) and return the list of filenames that matched a certain criterion. The original version was a normal list comprehension stored in a variable. This isn't the actual code but it gives the gist.
def filter_files(file_path):
filtered = [f.path for f in os.scandir(file_path) if f.path.endswith('.png')]
return filtered
When monitoring this one it would start out fast then progressively get slower and slower. I presume because it's just trying to store so much data in the variable.
I then rewrote it to be:
def filter_files(file_path):
return [f.path for f in os.scandir(file_path) if f.path.endswith('.png')]
And called it like:
def test(file_path):
filtered = filter_files(file_path)
This one never slows down. It maintains the same speed the entire time.
My question is what under the hood causes this difference? The data is still being stored in a variable and it's still being processed as a list comprehension. What about writing the comprehension in a return avoids the issues of the first version? Thanks!
There is no difference between those two pieces of code. None at all. Both of them are creating a list, and then managing a reference to that list.
The likely cause of your issue is caching. In the first case, the file system has to keep going out to the disk over and over and over to fetch more entries. After you finished that, the directory was in the file cache and could be read immediately. Reboot and try again, and you'll see the second one takes the same amount of time.
Any suggestions to make this script run faster? I usually have more than two to ten million lines for this script.
while True:
line=f.readline()
if not line:break
linker='CTGTAGGCACCATCAAT'
line_1=line[:-1]
for i in range(len(linker)):
if line_1[-i:]==linker[:i]:
new_line='>\n'+line_1[:-i]+'\n'
seq_1.append(new_line_1)
if seq_2.count(line)==0:
seq_2.append(line)
else:pass
else:pass
First of all it seems that you are creating lots of string objects in the inner loop. You could try to build list of prefixes first:
linker = 'CTGTAGGCACCATCAAT'
prefixes = []
for i in range(len(linker)):
prefixes.append(linker[:i])
Additionally you could use string's method endswith instead of creating new object in the condition in the inner loop:
while True:
line=f.readilne()
if not line:
break
for prefix in prefixes:
if line_1.endswith(prefix):
new_line='>\n%s\n' % line_1[:-len(prefix)]
seq_1.append(new_line_1)
if seq_2.count(line)==0:
seq_2.append(line)
I am not sure about indexes there (like len(prefix)). Also don't know how much faster it could be.
I am not sure what your code is meant to do, but general approach is:
Avoid unnecessary operations, conditions etc.
Move everything you can out of the loop.
Try to do as few levels of loop as possible.
Use common Python practices where possible (they are generally more efficient).
But most important: try to simplify and optimize the algorithm, not necessarily the code implementing it.
Judging from the code and applying some of the above rules code may look like this:
seq_2 = set() # seq_2 is a set now (maybe seq_1 should be also?)
linker = 'CTGTAGGCACCATCAAT' # assuming same linker for every line
linker_range = range(len(linker)) # result of the above assumption
for line in f:
line_1=line[:-1]
for i in linker_range:
if line_1[-i:] == linker[:i]:
# deleted new_line_1
seq_1.append('>\n' + line_1[:-i] + '\n') # do you need this line?
seq_2.add(line) # will ignore if already in set
Probably a large part of the problem is the seq_2.count(line) == 0 test for whether line is in seq_2. This will walk over each element of seq_2 and test whether it's equal to line -- which will take longer and longer as seq_2 grows. You should use a set instead, which will give you constant-time tests for whether it's present through hashing. This will throw away the order of seq_2 -- if you need to keep the order, you could use both a set and a list (test if it's in the set and if not, add to both).
This probably doesn't affect the speed, but it's much nicer to loop for line in f instead of your while True loop with line = f.readline() and the test for when to break. Also, the else: pass statements are completely unnecessary and could be removed.
The definition of linker should be moved outside the loop, since it doesn't get changed. #uhz's suggestion about pre-building prefixes and using endswith are also probably good.
About twise faster than all theese variants (at least at python 2.7.2)
seq_2 = set()
# Here I use generator. So I escape .append lookup and list resizing
def F(f):
# local memory
local_seq_2 = set()
# lookup escaping
local_seq_2_add = local_seq_2.add
# static variables
linker ='CTGTAGGCACCATCAAT'
linker_range = range(len(linker))
for line in f:
line_1=line[:-1]
for i in linker_range:
if line_1[-i:] == linker[:i]:
local_seq_2_add(line)
yield '>\n' + line_1[:-i] + '\n'
# push local memory to the global
global seq_2
seq_2 = local_seq_2
# here we consume all data
seq_1 = tuple(F(f))
Yes, it's ugly and non-pythonic, but it is the fastest way to do the job.
You can also upgrade this code with with open('file.name') as f: inside generator or add some other logic.
Note:
This place '>\n' + line_1[:-i] + '\n' - is doubtful. On some machines it is the fastest way to concat strings. On some machines the fastest way is '>\n'%s'\n'%line_1[:-i] or ''.join(('>\n',line_1[:-i],'\n')) (in version without lookup, of course). I dont know what will be best for you.
It is strange, but new formatter '{}'.format(..) on my computer shows the slowest result.
I have a file which has about 25000 lines, and it's a s19 format file.
each line is like: S214 780010 00802000000010000000000A508CC78C 7A
There are no spaces in the actual file, the first part 780010 is the address of this line, and I want it to be a dict's key value, and I want the data part 00802000000010000000000A508CC78C be the value of this key. I wrote my code like this:
def __init__(self,filename):
infile = file(filename,'r')
self.all_lines = infile.readlines()
self.dict_by_address = {}
for i in range(0, self.get_line_number()):
self.dict_by_address[self.get_address_of_line(i)] = self.get_data_of_line(i)
infile.close()
get_address_of_line() and get_data_of_line() are all simply string slicing functions. get_line_number() iterates over self.all_lines and returns an int
problem is, the init process takes me over 1 min, is the way I construct the dict wrong or python just need so long to do this?
And by the way, I'm new to python:) maybe the code looks more C/C++ like, any advice of how to program like python is appreciated:)
How about something like this? (I made a test file with just a line S21478001000802000000010000000000A508CC78C7A so you might have to adjust the slicing.)
>>> with open('test.test') as f:
... dict_by_address = {line[4:10]:line[10:-3] for line in f}
...
>>> dict_by_address
{'780010': '00802000000010000000000A508CC78C'}
This code should be tremendously faster than what you have now. EDIT: As #sth pointed out, this doesn't work because there are no spaces in the actual file. I'll add a corrected version at the end.
def __init__(self,filename):
self.dict_by_address = {}
with open(filename, 'r') as infile:
for line in infile:
_, key, value, _ = line.split()
self.dict_by_address[key] = value
Some comments:
Best practice in Python is to use a with statement, unless you are using an old Python that doesn't have it.
Best practice is to use open() rather than file(); I don't think Python 3.x even has file().
You can use the open file object as an iterator, and when you iterate it you get one line from the input. This is better than calling the .readlines() method, which slurps all the data into a list; then you use the data one time and delete the list. Since the input file is large, that means you are probably causing swapping to virtual memory, which is always slow. This version avoids building and deleting the giant list.
Then, having created a giant list of input lines, you use range() to make a big list of integers. Again it wastes time and memory to build a list, use it once, then delete the list. You can avoid this overhead by using xrange() but even better is just to build the dictionary as you go, as part of the same loop that is reading lines from the file.
It might be better to use your special slicing functions to pull out the "address" and "data" fields, but if the input is regular (always follows the pattern of your example) you can just do what I showed here. line.split() splits the line on white space, giving a list of four strings. Then we unpack it into four variables using "destructuring assignment". Since we only want to save two of the values, I used the variable name _ (a single underscore) for the other two. That's not really a language feature, but it is an idiom in the Python community: when you have data you don't care about you can assign it to _. This line will raise an exception if there are ever any number of values other than 4, so if it is possible to have blank lines or comment lines or whatever, you should add checks and handle the error (at least wrap that line in a try:/except).
EDIT: corrected version:
def __init__(self,filename):
self.dict_by_address = {}
with open(filename, 'r') as infile:
for line in infile:
key = extract_address(line)
value = extract_data(line)
self.dict_by_address[key] = value
What is the best way to read a file and break out the lines by a delimeter.
Data returned should be a list of tuples.
Can this method be beaten? Can this be done faster/using less memory?
def readfile(filepath, delim):
with open(filepath, 'r') as f:
return [tuple(line.split(delim)) for line in f]
Your posted code reads the entire file and builds a copy of the file in memory as a single list of all the file contents split into tuples, one tuple per line. Since you ask about how to use less memory, you may only need a generator function:
def readfile(filepath, delim):
with open(filepath, 'r') as f:
for line in f:
yield tuple(line.split(delim))
BUT! There is a major caveat! You can only iterate over the tuples returned by readfile once.
lines_as_tuples = readfile(mydata,','):
for linedata in lines_as_tuples:
# do something
This is okay so far, and a generator and a list look the same. But let's say your file was going to contain lots of floating point numbers, and your iteration through the file computed an overall average of those numbers. You could use the "# do something" code to calculate the overall sum and number of numbers, and then compute the average. But now let's say you wanted to iterate again, this time to find the differences from the average of each value. You'd think you'd just add another for loop:
for linedata in lines_as_tuples:
# do another thing
# BUT - this loop never does anything because lines_as_tuples has been consumed!
BAM! This is a big difference between generators and lists. At this point in the code now, the generator has been completely consumed - but there is no special exception raised, the for loop simply does nothing and continues on, silently!
In many cases, the list that you would get back is only iterated over once, in which case a conversion of readfile to a generator would be fine. But if what you want is a more persistent list, which you will access multiple times, then just using a generator will give you problems, since you can only iterate over a generator once.
My suggestion? Make readlines a generator, so that in its own little view of the world, it just yields each incremental bit of the file, nice and memory-efficient. Put the burden of retention of the data onto the caller - if the caller needs to refer to the returned data multiple times, then the caller can simply build its own list from the generator - easily done in Python using list(readfile('file.dat', ',')).
Memory use could be reduced by using a generator instead of a list and a list instead of a tuple, so you don't need to read the whole file into memory at once:
def readfile(path, delim):
return (ln.split(delim) for ln in open(f, 'r'))
You'll have to rely on the garbage collector to close the file, though. As for returning tuples: don't do it if it's not necessary, since lists are a tiny fraction faster, constructing the tuple has a minute cost and (importantly) your lines will be split into variable-size sequences, which are conceptually lists.
Speed can be improved only by going down to the C/Cython level, I guess; str.split is hard to beat since it's written in C, and list comprehensions are AFAIK the fastest loop construct in Python.
More importantly, this is very clear and Pythonic code. I wouldn't try optimizing this apart from the generator bit.
It's said that Python automatically manages memory. I'm confused because I have a Python program consistently uses more than 2GB of memory.
It's a simple multi-thread binary data downloader and unpacker.
def GetData(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
data = response.read() // data size is about 15MB
response.close()
count = struct.unpack("!I", data[:4])
for i in range(0, count):
UNPACK FIXED LENGTH OF BINARY DATA HERE
yield (field1, field2, field3)
class MyThread(threading.Thread):
def __init__(self, total, daterange, tickers):
threading.Thread.__init__(self)
def stop(self):
self._Thread__stop()
def run(self):
GET URL FOR EACH REQUEST
data = []
items = GetData(url)
for item in items:
data.append(';'.join(item))
f = open(filename, 'w')
f.write(os.linesep.join(data))
f.close()
There are 15 threads running. Each request gets 15MB of data and unpack it and saved to local text file. How could this program consume more than 2GB of memory? Do I need to do any memory recycling jobs in this case? How can I see how much memory each objects or functions use?
I would appreciate all your advices or tips on how to keep a python program running in a memory efficient mode.
Edit: Here is the output of "cat /proc/meminfo"
MemTotal: 7975216 kB
MemFree: 732368 kB
Buffers: 38032 kB
Cached: 4365664 kB
SwapCached: 14016 kB
Active: 2182264 kB
Inactive: 4836612 kB
Like others have said, you need at least the following two changes:
Do not create a huge list of integers with range
# use xrange
for i in xrange(0, count):
# UNPACK FIXED LENGTH OF BINARY DATA HERE
yield (field1, field2, field3)
do not create a huge string as the full file body to be written at once
# use writelines
f = open(filename, 'w')
f.writelines((datum + os.linesep) for datum in data)
f.close()
Even better, you could write the file as:
items = GetData(url)
f = open(filename, 'w')
for item in items:
f.write(';'.join(item) + os.linesep)
f.close()
The major culprit here is as mentioned above the range() call. It will create a list with 15 million members, and that will eat up 200 MB of your memory, and with 15 processes, that's 3GB.
But also don't read in the whole 15MB file into data(), read bit by bit from the response. Sticking those 15MB into a variable will use up 15MB of memory more than reading bit by bit from the response.
You might want to consider simply just extracting data until you run out if indata, and comparing the count of data you extracted with what the first bytes said it should be. Then you need neither range() nor xrange(). Seems more pythonic to me. :)
Consider using xrange() instead of range(), I believe that xrange is a generator whereas range() expands the whole list.
I'd say either don't read the whole file into memory, or don't keep the whole unpacked structure in memory.
Currently you keep both in memory, at the same time, this is going to be quite big. So you've got at least two copies of your data in memory, plus some metadata.
Also the final line
f.write(os.linesep.join(data))
May actually mean you've temporarily got a third copy in memory (a big string with the entire output file).
So I'd say you're doing it in quite an inefficient way, keeping the entire input file, entire output file and a fair amount of intermediate data in memory at once.
Using the generator to parse it is quite a nice idea. Consider writing each record out after you've generated it (it can then be discarded and the memory reused), or if that causes too many write requests, batch them into, say, 100 rows at once.
Likewise, reading the response could be done in chunks. As they're fixed records this should be reasonably easy.
The last line should surely be f.close()? Those trailing parens are kinda important.
You can make this program more memory efficient by not reading all 15MB from the TCP connection, but instead processing each line as it is read. This will make the remote servers wait for you, of course, but that's okay.
Python is just not very memory efficient. It wasn't built for that.
You could do more of your work in compiled C code if you convert this to a list comprehension:
data = []
items = GetData(url)
for item in items:
data.append(';'.join(item))
to:
data = [';'.join(items) for items in GetData(url)]
This is actually slightly different from your original code. In your version, GetData returns a 3-tuple, which comes back in items. You then iterate over this triplet, and append ';'.join(item) for each item in it. This means that you get 3 entries added to data for every triplet read from GetData, each one ';'.join'ed. If the items are just strings, then ';'.join will give you back a string with every other character a ';' - that is ';'.join("ABC") will give back "A;B;C". I think what you actually wanted was to have each triplet saved back to the data list as the 3 values of the triplet, separated by semicolons. That is what my version generates.
This may also help somewhat with your original memory problem, as you are no longer creating as many Python values. Remember that a variable in Python has much more overhead than one in a language like C. Since each value is itself an object, and add the overhead of each name reference to that object, you can easily expand the theoretical storage requirement several-fold. In your case, reading 15Mb X 15 = 225Mb + the overhead of each item of each triple being stored as a string entry in your data list could quickly grow to your 2Gb observed size. At minimum, my version of your data list will have only 1/3 the entries in it, plus the separate item references are skipped, plus the iteration is done in compiled code.
There are 2 obvious places where you keep large data objects in memory (data variable in GetData() and data in MyThread.run() - these two will take about 500Mb) and probably there are other places in the skipped code. There are both easy to make memory efficient. Use response.read(4) instead of reading whole response at once and do it the same way in code behind UNPACK FIXED LENGTH OF BINARY DATA HERE. Change data.append(...) in MyThread.run() to
if not first:
f.write(os.linesep)
f.write(';'.join(item))
These changes will save you a lot of memory.
Make sure you delete the threads after they are stopped. (using del)