Consider a large list of named items (first line) returned from a large csv file (80 MB) with possible interrupted spacing
name_line = ['a',,'b',,'c' .... ,,'cb','cc']
I am reading the remainder of the data in line by line and I only need to process data with a corresponding name. Data might look like
data_line = ['10',,'.5',,'10289' .... ,,'16.7','0']
I tried it two ways. One is popping the empty columns from each line of the read
blnk_cols = [1,3, ... ,97]
while data:
...
for index in blnk_cols: data_line.pop(index)
the other is compiling the items associated with a name from L1
good_cols = [0,2,4, ... ,98,99]
while data:
...
data_line = [data_line[index] for index in good_cols]
in the data I am using there will definitely be more good lines then bad lines although it might be as high as half and half.
I used the cProfile and pstats package to determine my weakest links in speed which suggested the pop was the current slowest item. I switched to the list comp and the time almost doubled.
I imagine one fast way would be to slice the array retrieving only good data, but this would be complicated for files with alternating blank and good data.
what I really need is to be able to do
data_line = data_line[good_cols]
effectively passing a list of indices into a list to get back those items.
Right now my program is running in about 2.3 seconds for a 10 MB file and the pop accounts for about .3 seconds.
Is there a faster way to access certain locations in a list. In C it would just be de-referencing an array of pointers to the correct indices in the array.
Additions:
name_line in file before read
a,b,c,d,e,f,g,,,,,h,i,j,k,,,,l,m,n,
name_line after read and split(",")
['a','b','c','d','e','f','g','','','','','h','i','j','k','','','','l','m','n','\n']
Try a generator expression,
data_line = (data_line[i] for i in good_cols)
Also read here about
Generator Expressions vs. List Comprehension
as the top answer tells you: 'Basically, use a generator expression if all you're doing is iterating once'.
So you should benefit from this.
Related
I have a large flat file which I need to parse using a list which contains the variable name, the starting point, and the length of the variable along with the type. e.g.
columns = [['LOAD_CYCLE', 131, 6, 'int'],
['OPERATOR', 59, 8, 'Char (8)'],
['APP_DATE', 131, 8, 'Date'],
['UNIQUE_KEY', 245, 25, 'Char (25)']]
This list contains 1,600 items. The only really important columns are the starting point and the length of the variable. This is used to split each line in the flat file into a list of variables, which is used to create a new file to be inserted into a database. The data type is important, but I can always do that section later.
Currently, my method is to read the file in chunks (it is a very large file; over 6GB), and then process the chunk piece by piece:
line = data_file.read(chunk*1000)
for x in range(1000):
offset = chunk*x
for item in columns:
piece = line[item[1]+offset:item[1]+item[2]+offset].replace('\n','')
#Depending on the data type, a piece may undergo one or two checks before being
#added to a list which is then written to an output file
The time consuming part is iterating through the columns. Is this the only way to do this? Or is there perhaps a more efficient way to split the string? Something involving maps?
This seems like a great case for the struct module. Assuming you're using CPython, this effectively moves the loop over the columns into C.
First, you need to build up the format string.
Since your columns appear to be specified in arbitrary order, rather than ordered by starting point, and may have gaps between them, this isn't quite trivial… but it shouldn't be too hard. Something like this:
sorted_columns = sorted(columns, key=operator.itemgetter(1))
formats = []
offset = 0
for name, start, length, vtype in sortedcolumns:
# add padding bytes
if start > offset:
formats.append('{}x'.format(start-offset))
formats.append('{}s'.format(length))
format = struct.Struct('=' + ''.join(formats))
Then:
offset = chunk*x
values = format.unpack_from(line, offset)
And now you have a tuple of 1600 items.
Of course to do anything with that tuple, you may have to iterate over it anyway. But maybe you can do that in C as well. For example, if you're just inserting the values into a SQL database, then creating a giant SQL statement with 1600 parameters (in the same order as in sorted_columns) and passing the tuple as the arguments may take care of that for you:
cursor.execute(giant_insert_sql, values)
But if you need to do something more complicated to each value, then you'll need to do something like one of the following:
Use NumPy and/or Pandas to vectorize the loop. (Note that they can also be used to just load the whole file into memory, vectorizing the outer loop as well as the inner one, if you've got the RAM… but that shouldn't be nearly as much of a performance gain.)
Run your existing code in PyPy instead of CPython.
Use Numba to JIT the code within CPython.
Write a C extension to replace your inner loop—which, if you're lucky, may be as simple as just moving your Python code to a function in a separate file and compiling it with Cython.
the following is code I have written that tries to open individual files, which are long strips of data and read them into an array. Essentially I have files that run over 15 times (24 hours to 360 hours), and each file has an iteration of 50, hence the two loops. I then try to open the files into an array. When I try to print a specific element in the array, I get the error "'file' object has no attribute 'getitem'". Any ideas what the problem is? Thanks.
#!/usr/bin/python
############################################
#
import csv
import sys
import numpy as np
import scipy as sp
#
#############################################
level = input("Enter a level: ");
LEVEL = str(level);
MODEL = raw_input("Enter a model: ");
NX = 360;
NY = 181;
date = 201409060000;
DATE = str(date);
#############################################
FileList = [];
data = [];
for j in range(1,51,1):
J = str(j);
for i in range(24,384,24):
I = str(i);
fileName = '/Users/alexg/ECMWF_DATA/DAT_FILES/'+MODEL+'_'+LEVEL+'_v_'+J+'_FT0'+I+'_'+DATE+'.dat';
FileList.append(fileName);
fo = open(fileName,"rb");
data.append(fo);
fo.close();
print data[1][1];
print FileList;
EDITED TO ADD:
Below, find the CORRECT array that the python script should be producing (sorry it wont let me post this inline yet):
http://i.stack.imgur.com/ItSxd.png
The problem I now run into, is that the first three values in the first row of the output matrix are:
-7.090874
-7.004936
-6.920952
These values are actually the first three values of the 11th row in the array below, which is the how it should look (performed in MATLAB). The next three values the python script outputs (as what it believes to be the second row) are:
-5.255577
-5.159874
-5.064171
These values should be found in the 22nd row. In other words, python is placing the 11th row of values in the first position, the 22nd in the second and so on. I don't have a clue as to why, or where in the code I'm specifying it do this.
You're appending the file objects themselves to data, not their contents:
fo = open(fileName,"rb");
data.append(fo);
So, when you try to print data[1][1], data[1] is a file object (a closed file object, to boot, but it would be just as broken if still open), so data[1][1] tries to treat that file object as if it were a sequence, and file objects aren't sequences.
It's not clear what format your data are in, or how you want to split it up.
If "long strips of data" just means "a bunch of lines", then you probably wanted this:
data.append(list(fo))
A file object is an iterable of lines, it's just not a sequence. You can copy any iterable into a sequence with the list function. So now, data[1][1] will be the second line in the second file.
(The difference between "iterable" and "sequence" probably isn't obvious to a newcomer to Python. The tutorial section on Iterators explains it briefly, the Glossary gives some more information, and the ABCs in the collections module define exactly what you can do with each kind of thing. But briefly: An iterable is anything you can loop over. Some iterables are sequences, like list, which means they're indexable collections that you can access like spam[0]. Others are not, like file, which just reads one line at a time into memory as you loop over it.)
If, on the other hand, you actually imported csv for a reason, you more likely wanted something like this:
reader = csv.reader(fo)
data.append(list(reader))
Now, data[1][1] will be a list of the columns from the second row of the second file.
Or maybe you just wanted to treat it as a sequence of characters:
data.append(fo.read())
Now, data[1][1] will be the second character of the second file.
There are plenty of other things you could just as easily mean, and easy ways to write each one of them… but until you know which one you want, you can't write it.
Suppose you have a data file which includes several data sets separated by the string "--" in the following format:
--
<x0_val> <y0_val>
<x1_val> <y1_val>
<x2_val> <y2_val>
--
<x0_val> <y0_val>
<x1_val> <y1_val>
<x2_val> <y2_val>
...
How can you read the whole file into an array of arrays so that you can plot all data sets afterwards to the same picture with a for loop looping over the outer array ?
genfromtxt('data.dat', delimiter=("--"))
gives lots of
Line #1550 (got 1 columns instead of 2)
I will update ...
I would first split the file into multiple files, which can reside in memory as objects or on the filesystems as new files.
You can locate the string -- with the module re.
Then you can use the link I posted above.
If you're 100% certain that you have no negative values in your file, you can try a quick:
np.genfromtxt(your_file, comments="-")
The comments="-" will force genfromtxt to ignore all the characters after -, which of course will give weird results if you have negative variables. Moreover, the result will be just a lump of your dataset in a single array
Otherwise, the safest route is to iterate on your file and store the lines that do not match -- in one list per block, something along the lines:
blocks = []
current = []
for line in your_file:
if line.startswith("-"):
blocks.append(np.array(current))
current = []
else:
current += line.split()
You may have to get rid of the first block if empty.
You could also check a mmap based solution already posted.
What is the best way to read a file and break out the lines by a delimeter.
Data returned should be a list of tuples.
Can this method be beaten? Can this be done faster/using less memory?
def readfile(filepath, delim):
with open(filepath, 'r') as f:
return [tuple(line.split(delim)) for line in f]
Your posted code reads the entire file and builds a copy of the file in memory as a single list of all the file contents split into tuples, one tuple per line. Since you ask about how to use less memory, you may only need a generator function:
def readfile(filepath, delim):
with open(filepath, 'r') as f:
for line in f:
yield tuple(line.split(delim))
BUT! There is a major caveat! You can only iterate over the tuples returned by readfile once.
lines_as_tuples = readfile(mydata,','):
for linedata in lines_as_tuples:
# do something
This is okay so far, and a generator and a list look the same. But let's say your file was going to contain lots of floating point numbers, and your iteration through the file computed an overall average of those numbers. You could use the "# do something" code to calculate the overall sum and number of numbers, and then compute the average. But now let's say you wanted to iterate again, this time to find the differences from the average of each value. You'd think you'd just add another for loop:
for linedata in lines_as_tuples:
# do another thing
# BUT - this loop never does anything because lines_as_tuples has been consumed!
BAM! This is a big difference between generators and lists. At this point in the code now, the generator has been completely consumed - but there is no special exception raised, the for loop simply does nothing and continues on, silently!
In many cases, the list that you would get back is only iterated over once, in which case a conversion of readfile to a generator would be fine. But if what you want is a more persistent list, which you will access multiple times, then just using a generator will give you problems, since you can only iterate over a generator once.
My suggestion? Make readlines a generator, so that in its own little view of the world, it just yields each incremental bit of the file, nice and memory-efficient. Put the burden of retention of the data onto the caller - if the caller needs to refer to the returned data multiple times, then the caller can simply build its own list from the generator - easily done in Python using list(readfile('file.dat', ',')).
Memory use could be reduced by using a generator instead of a list and a list instead of a tuple, so you don't need to read the whole file into memory at once:
def readfile(path, delim):
return (ln.split(delim) for ln in open(f, 'r'))
You'll have to rely on the garbage collector to close the file, though. As for returning tuples: don't do it if it's not necessary, since lists are a tiny fraction faster, constructing the tuple has a minute cost and (importantly) your lines will be split into variable-size sequences, which are conceptually lists.
Speed can be improved only by going down to the C/Cython level, I guess; str.split is hard to beat since it's written in C, and list comprehensions are AFAIK the fastest loop construct in Python.
More importantly, this is very clear and Pythonic code. I wouldn't try optimizing this apart from the generator bit.
It's said that Python automatically manages memory. I'm confused because I have a Python program consistently uses more than 2GB of memory.
It's a simple multi-thread binary data downloader and unpacker.
def GetData(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
data = response.read() // data size is about 15MB
response.close()
count = struct.unpack("!I", data[:4])
for i in range(0, count):
UNPACK FIXED LENGTH OF BINARY DATA HERE
yield (field1, field2, field3)
class MyThread(threading.Thread):
def __init__(self, total, daterange, tickers):
threading.Thread.__init__(self)
def stop(self):
self._Thread__stop()
def run(self):
GET URL FOR EACH REQUEST
data = []
items = GetData(url)
for item in items:
data.append(';'.join(item))
f = open(filename, 'w')
f.write(os.linesep.join(data))
f.close()
There are 15 threads running. Each request gets 15MB of data and unpack it and saved to local text file. How could this program consume more than 2GB of memory? Do I need to do any memory recycling jobs in this case? How can I see how much memory each objects or functions use?
I would appreciate all your advices or tips on how to keep a python program running in a memory efficient mode.
Edit: Here is the output of "cat /proc/meminfo"
MemTotal: 7975216 kB
MemFree: 732368 kB
Buffers: 38032 kB
Cached: 4365664 kB
SwapCached: 14016 kB
Active: 2182264 kB
Inactive: 4836612 kB
Like others have said, you need at least the following two changes:
Do not create a huge list of integers with range
# use xrange
for i in xrange(0, count):
# UNPACK FIXED LENGTH OF BINARY DATA HERE
yield (field1, field2, field3)
do not create a huge string as the full file body to be written at once
# use writelines
f = open(filename, 'w')
f.writelines((datum + os.linesep) for datum in data)
f.close()
Even better, you could write the file as:
items = GetData(url)
f = open(filename, 'w')
for item in items:
f.write(';'.join(item) + os.linesep)
f.close()
The major culprit here is as mentioned above the range() call. It will create a list with 15 million members, and that will eat up 200 MB of your memory, and with 15 processes, that's 3GB.
But also don't read in the whole 15MB file into data(), read bit by bit from the response. Sticking those 15MB into a variable will use up 15MB of memory more than reading bit by bit from the response.
You might want to consider simply just extracting data until you run out if indata, and comparing the count of data you extracted with what the first bytes said it should be. Then you need neither range() nor xrange(). Seems more pythonic to me. :)
Consider using xrange() instead of range(), I believe that xrange is a generator whereas range() expands the whole list.
I'd say either don't read the whole file into memory, or don't keep the whole unpacked structure in memory.
Currently you keep both in memory, at the same time, this is going to be quite big. So you've got at least two copies of your data in memory, plus some metadata.
Also the final line
f.write(os.linesep.join(data))
May actually mean you've temporarily got a third copy in memory (a big string with the entire output file).
So I'd say you're doing it in quite an inefficient way, keeping the entire input file, entire output file and a fair amount of intermediate data in memory at once.
Using the generator to parse it is quite a nice idea. Consider writing each record out after you've generated it (it can then be discarded and the memory reused), or if that causes too many write requests, batch them into, say, 100 rows at once.
Likewise, reading the response could be done in chunks. As they're fixed records this should be reasonably easy.
The last line should surely be f.close()? Those trailing parens are kinda important.
You can make this program more memory efficient by not reading all 15MB from the TCP connection, but instead processing each line as it is read. This will make the remote servers wait for you, of course, but that's okay.
Python is just not very memory efficient. It wasn't built for that.
You could do more of your work in compiled C code if you convert this to a list comprehension:
data = []
items = GetData(url)
for item in items:
data.append(';'.join(item))
to:
data = [';'.join(items) for items in GetData(url)]
This is actually slightly different from your original code. In your version, GetData returns a 3-tuple, which comes back in items. You then iterate over this triplet, and append ';'.join(item) for each item in it. This means that you get 3 entries added to data for every triplet read from GetData, each one ';'.join'ed. If the items are just strings, then ';'.join will give you back a string with every other character a ';' - that is ';'.join("ABC") will give back "A;B;C". I think what you actually wanted was to have each triplet saved back to the data list as the 3 values of the triplet, separated by semicolons. That is what my version generates.
This may also help somewhat with your original memory problem, as you are no longer creating as many Python values. Remember that a variable in Python has much more overhead than one in a language like C. Since each value is itself an object, and add the overhead of each name reference to that object, you can easily expand the theoretical storage requirement several-fold. In your case, reading 15Mb X 15 = 225Mb + the overhead of each item of each triple being stored as a string entry in your data list could quickly grow to your 2Gb observed size. At minimum, my version of your data list will have only 1/3 the entries in it, plus the separate item references are skipped, plus the iteration is done in compiled code.
There are 2 obvious places where you keep large data objects in memory (data variable in GetData() and data in MyThread.run() - these two will take about 500Mb) and probably there are other places in the skipped code. There are both easy to make memory efficient. Use response.read(4) instead of reading whole response at once and do it the same way in code behind UNPACK FIXED LENGTH OF BINARY DATA HERE. Change data.append(...) in MyThread.run() to
if not first:
f.write(os.linesep)
f.write(';'.join(item))
These changes will save you a lot of memory.
Make sure you delete the threads after they are stopped. (using del)