pandas.read_csv slow when reading file with variable length string - python

I have an issue which I think I somewhat solved but I would like to learn more about it or learn about better solutions.
The problem: I have tab separated files with ~600k lines (and one comment line), of which one field (out of 8 fields) contains a string of variable length, anything between 1 and ~2000 characters.
Reading that file with the following function is terribly slow:
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
names=list_of_names)
However, perhaps I don't care so much about most of the string (field name of this string is 'motif') and I'm okay with truncating it if it's too long using:
def truncate_motif(motif):
if len(motif) > 8:
return motif[:8] + '~'
else:
return motif
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
converters={'motif': truncate_motif},
names=list_of_names)
This suddenly is lots faster.
So my questions are:
Why is reading this file so slow? Does it have to do with allocating memory?
Why does the converter function help here? It has to execute an additional function for every row, but is still lots faster...
What else can be done?

You didn't mention what means slow to you, but if:
your file contains ca. 600k rows,
each row contains 1-2000 characters (let's say 1000 in average, so each line has ca. 1000B),
then this file's size is: 600 000 * 1000B ~ 570 MB. It's a lot, especially if you don't have much RAM memory.
It helps, because suddenly the average size of one line is not 1000B, but ca. 6-7B (considering new max = 8B). The system does not read and keep whole strings, but only checks their length and cut if it's needed. Sounds logic to me!
In such cases, when you have to load lots of data, it's good to use chunks.
for chunk in pd.read_csv(tgfile, chunksize=10000):
process(chunk)
The chunksize parameter says how many rows contains one chunk. It's good to check if it improves performance in your case!

Related

reduce calculation precision to speed up execution

I have a data acquisition system which takes measurements for a few minutes and generates a csv file with 10 million rows and 10 columns. Then I import this csv file in Python (csv.reader), perform a bunch of operations on the acquired numeric data (but ‘only’ 10000 rows at a time, otherwise the computer memory would be overwhelmed). In the end, I export the results in another much smaller csv file (csv.writer).
The problem is that the runtime is very long and I want to speed it up. When I open the original csv file with Notepad I see that the numbers have up to 16 digits each, like 0.0015800159870059, 12.0257771094508 etc. I know that the accuracy of the DAQ is 0.1% at best and most of the trailing digits are noise. Is there an elegant way of forcing Python to operate globally with only 7-8 digits from start to finish, to speed up the calculations? I know about error propagation and I’m going to try different settings for the number of digits to see what the optimum is.
Please note that it is not enough for me to build a temporary csv file with ‘truncated’ data (e.g. containing 0.0015800, 12.0257771 etc) and simply import that into Python. The calculations in Python should use reduced precision as well. I looked into decimal module, with no success so far.
with open(‘datafile’,newline='') as DAQfile:
reader=csv.reader(DAQfile,delimiter=',')
for row in reader:
… calculate stuff…
with open('results.csv','w',newline='') as myfile:
mywriter = csv.writer(myfile)
…write stuff…
Adding some details, based on the comments so far:
The program calculates the peak of the moving average of the 'instantaneous power'. The data in the csv file can be described like this, where 'col' means column, V means voltage and I means current: col1=time, col2=V1, col3=I1, col4=V2, col5=I2 and so on until col11=V10, col12=I10. So each row represents a data sample taken by the DAQ.
The instantaneous power is Pi=V1*I1+V2*I2+...+V11*I11
To calculate moving average over 10000 rows at a time, I built a buffer (initialized with Buffer=[0]*10000). This buffer will store the Pi's for 10000 consecutive rows and will be updated every time when csv.reader moves to the next row. The buffer works exactly like a shift register.
This way the memory usage is insignificant (verified). In summary, the calculations are multiplications, additions, min(a,b) function (to detect the peak of the moving average) and del/append for refreshing the buffer. The moving average itself is iterative too, something like newavg=oldavg+(newlast-oldfirst)/bufsize.
My thinking is that it does not make any sense to let Python work with all those decimals when I know that most of the trailing figures are garbage.
Forgot to mention that the size of the csv file coming from the DAQ is just under 1Gb.
Yes, there is a way - use NumPy. First, there are tons of vector/vector operations, which could be performed with one command
a = b + c
will efficiently sum two vector.
Second, which is the answer to your question, you could specify 4bytes float type, greatly reducing memory reqs and increasing speed.
You should read your file directly using
from numpy import genfromtxt
data = genfromtxt('datafile.csv', dtype=numpy.float32, delimiter=',')
...
data would made up from standard 32bits floats, circa 7digits precision.
CSV file could be read by parts/bunches
numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None,
skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None,
usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_',
autostrip=False, case_sensitive=True, defaultfmt='f%i',
unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None,
encoding='bytes')
here is full list of parameters. If max_rows is set to, say, 10, only 10 rows will be read. Default is to read whole file. You could read anything in the middle of the files by skipping some initial records, via skip_header option.
Use DyZ's comment. if there is a way to speed up the calculations, (i.e. using << or >> for multiplications or divisions respectiveley if the second operand or the dividend is a power of 2, you should take it.
example:
>>> 22 * 16
352
>>> 22 << 4
352
in that scenario, I did the exact same operation with marginal decrease in time. However, if that will equate to 100 trillion calculations, the difference is much more notable.

Is it faster to use indices or names in Panda's "usecols"?

Say we have a (huge) csv file with a header "data".
I load this file with pandas (imported as pd):
pd_data = pd.read_csv(csv_file)
Now, say the file has multiple headers and many rows, so I would like to increase reading speed and reduce memory usage.
Is there a difference in performance between using:
(1) pd.read_csv(csv_file, usecols=['data'])
versus
(2) pd.read_csv(csv_file, usecols=0)
(let us assume that 'data' is always column 0).
The reason for using (1) is the piece of mind that no matter where 'data' is actually stored in the file we will load the right column.
Using (2) does require us to trust that column 0 is indeed 'data'.
The reason why I think (2) is faster is because (1) may have some overhead in looking for the column (unless it says "if match on first column, stop searching", kind of thing, of course).
Questions:
Which approach is faster among (1) and (2)?
Is there an even faster approach?
Thank you.

How to reshape a text file in Python so that the rows/columns swap position?

EDIT - SOLVED
I have a text file which contains 100 rows of data, each with 500 columns of values. I need to simply swap these so that my file contains 500 rows of data and 100 columns of values.
Everything in the first column [:,0] will be row #1, everything in the second column [:,1] will be row #2 - and so on until the end of the file.
I have searched for solutions and stumbled across np.reshape, but for the life of me I haven't been able to find an example to use on an existing text file. I should mention I'm also not overly skilled with Python to figure this out on my own.
Alternatively, here is the code I used to create this text file in the first place - if it is simpler to just fix something in here which will reshape it in the first place then I am open for suggestions.
diffs = []
for number in range (1,101):
filea = pl.loadtxt('file' + str(number) + 'a')
fileb = pl.loadtxt('file' + str(number) + 'b')
diff = fileb[:,1] - filea[:,1]
diffs.append(diff)
np.savetxt('diffs.txt', (diffs))
Here, I have 100 a files and 100 b files. They each contain 500 rows and 2 columns. I am finding the difference between the values in the second column for each and looking to get them all in a file which maintains the 500 rows, but instead has 100 columns with the diff for 1b-1a 2b-2a etc through to 100b-100a.
Hopefully I have explained myself in a way which can be understood. Thanks in advance for any help.
SOLUTION:
reshape = np.loadtxt('diffs.txt')
diffs2 = np.transpose(reshape)
np.savetxt('diffs2.txt', (diffs2))
np.transpose is the normal tool for switching rows and columns in numpy. It can handle larger dimensions as well. Just be ware that it doesn't change anything when the array is 1d.
If a more general solution is desired (and the data is stored in files not in Python memory), I have recently written a command line utility that may be useful. Compiled binaries may become available soon, for now a Nim compier is required to build it. I wrote this to learn some Nim, and it may not be optimally performant.

pandas: handling a DataFrame with a large number of strings

I want to read and handle a large CSV file (data_file) having the following 2-columns structure:
id params
1 '14':'blah blah','25':'more cool stuff'
2 '157':'yes, more stuff','15':'and even more'
3 '14':'blah blah','25':'more cool stuff'
4 '15':'different here'
5 '157':'yes, more stuff','15':'and even more'
6 '100':'exhausted'
This file contains 30.000.000 lines (5 Gb on disk). (The actual strings are encoded in UTF-8; for simplicity, I gave them in ascii here). Note that some of the values in the 2nd column are repeated.
I read this using pandas.read_csv():
df = pandas.read_csv(open(data_file, 'rb'), delimiter='\t',
usecols=['id', 'params'],dtype={'id':'u4', 'params':'str'})
Once the file is read, the data frame df uses 1.2 Gb of RAM.
So far so good.
Now comes the processing part. I want to have the params string column with this format:
blah blah||more cool stuff
yes, more stuff||and even more
blah blah||more cool stuff
different here
yes, more stuff||and even more
exhausted
I wrote:
def clean_keywords(x):
return "||".join(x.split("'")[1:][::2])
df['params'] = df['params'].map(clean_keywords)
This code works in the sense it gives the correct result. But:
More than 6.8 Gb of RAM are used while performing the map operation.
After computation has been finished, 5.5 Gb of RAM are used by df (after gc.collect()), although the string computed in params column is shorter than the one that was read.
Can someone explain this and propose an alternative way of performing the above operation using pandas (I use python 3.4, pandas 0.16.2, win64) ?
Answering my own question.
It turns out that pandas.read_csv() is clever. When the file is read, strings are made unique. But when these string are processed and stored in the column, they are no longer unique. Hence the RAM usage increases. In order to avoid this, one has to maintain the uniqueness manually. I did it this way:
unique_strings = {}
def clean_keywords(x):
s = "||".join(x.split("'")[1:][::2])
return unique_strings.setdefault(s, s)
df['params'] = df['params'].map(clean_keywords)
With this solution, RAM max. usage was only 2.8 Gb and went down slightly under the initial RAM usage after reading data (1.2 Gb), as expected.

Python fast string parsing, manipulation

I am using python to parse the incoming comma separated string. I want to do some calculation afterwards on the data.
The length of the string is: 800 characters with 120 comma separated fields.
There such 1.2 million strings to process.
for v in item.values():
l.extend(get_fields(v.split(',')))
#process l
get_fields uses operator.itemgetter() to extract around 20 fields out of 120.
This entire operation takes about 4-5 minutes excluding the time to bring in the data.
In the later part of the program I insert these lines into sqlite memory table for further use.
But overall 4-5 minutes time for just parsing and getting a list is not good for my project.
I run this processing in around 6-8 threads.
Does switching to C/C++ might help?
Are you loading a dict with your file records? Probably better to process the data directly:
datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()
l = sum(get_fields(v.split(',')) for v in file, [])
This avoids creating any overall data structures, and only accumulated the desired values as returned by get_fields.
Your program might be slowing down trying to allocate enough memory for 1.2M strings. In other words, the speed problem might not be due to the string parsing/manipulation, but rather in the l.extend. To test this hypothsis, you could put a print statement in the loop:
for v in item.values():
print('got here')
l.extend(get_fields(v.split(',')))
If the print statements get slower and slower, you can probably conclude l.extend is the culprit. In this case, you may see significant speed improvement if you can move the processing of each line into the loop.
PS: You probably should be using the csv module to take care of the parsing for you in a more high-level manner, but I don't think that will affect the speed very much.

Categories