Dask Dataframe: Get row count? - python

Simple question: I have a dataframe in dask containing about 300 mln records. I need to know the exact number of rows that the dataframe contains. Is there an easy way to do this?
When I try to run dataframe.x.count().compute() it looks like it tries to load the entire data into RAM, for which there is no space and it crashes.

# ensure small enough block size for the graph to fit in your memory
ddf = dask.dataframe.read_csv('*.csv', blocksize="10MB")
ddf.shape[0].compute()
From the documentation:
blocksize <str, int or None>
Optional Number of bytes by which to cut up
larger files. Default value is computed based on available physical
memory and the number of cores, up to a maximum of 64MB. Can be a
number like 64000000` or a string like ``"64MB". If None, a single
block is used for each file.

If you only need the number of rows -
you can load a subset of the columns while selecting the columns with lower memory usage (such as category/integers and not string/object), there after you can run len(df.index)

Related

pandas.read_csv slow when reading file with variable length string

I have an issue which I think I somewhat solved but I would like to learn more about it or learn about better solutions.
The problem: I have tab separated files with ~600k lines (and one comment line), of which one field (out of 8 fields) contains a string of variable length, anything between 1 and ~2000 characters.
Reading that file with the following function is terribly slow:
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
names=list_of_names)
However, perhaps I don't care so much about most of the string (field name of this string is 'motif') and I'm okay with truncating it if it's too long using:
def truncate_motif(motif):
if len(motif) > 8:
return motif[:8] + '~'
else:
return motif
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
converters={'motif': truncate_motif},
names=list_of_names)
This suddenly is lots faster.
So my questions are:
Why is reading this file so slow? Does it have to do with allocating memory?
Why does the converter function help here? It has to execute an additional function for every row, but is still lots faster...
What else can be done?
You didn't mention what means slow to you, but if:
your file contains ca. 600k rows,
each row contains 1-2000 characters (let's say 1000 in average, so each line has ca. 1000B),
then this file's size is: 600 000 * 1000B ~ 570 MB. It's a lot, especially if you don't have much RAM memory.
It helps, because suddenly the average size of one line is not 1000B, but ca. 6-7B (considering new max = 8B). The system does not read and keep whole strings, but only checks their length and cut if it's needed. Sounds logic to me!
In such cases, when you have to load lots of data, it's good to use chunks.
for chunk in pd.read_csv(tgfile, chunksize=10000):
process(chunk)
The chunksize parameter says how many rows contains one chunk. It's good to check if it improves performance in your case!

reduce calculation precision to speed up execution

I have a data acquisition system which takes measurements for a few minutes and generates a csv file with 10 million rows and 10 columns. Then I import this csv file in Python (csv.reader), perform a bunch of operations on the acquired numeric data (but ‘only’ 10000 rows at a time, otherwise the computer memory would be overwhelmed). In the end, I export the results in another much smaller csv file (csv.writer).
The problem is that the runtime is very long and I want to speed it up. When I open the original csv file with Notepad I see that the numbers have up to 16 digits each, like 0.0015800159870059, 12.0257771094508 etc. I know that the accuracy of the DAQ is 0.1% at best and most of the trailing digits are noise. Is there an elegant way of forcing Python to operate globally with only 7-8 digits from start to finish, to speed up the calculations? I know about error propagation and I’m going to try different settings for the number of digits to see what the optimum is.
Please note that it is not enough for me to build a temporary csv file with ‘truncated’ data (e.g. containing 0.0015800, 12.0257771 etc) and simply import that into Python. The calculations in Python should use reduced precision as well. I looked into decimal module, with no success so far.
with open(‘datafile’,newline='') as DAQfile:
reader=csv.reader(DAQfile,delimiter=',')
for row in reader:
… calculate stuff…
with open('results.csv','w',newline='') as myfile:
mywriter = csv.writer(myfile)
…write stuff…
Adding some details, based on the comments so far:
The program calculates the peak of the moving average of the 'instantaneous power'. The data in the csv file can be described like this, where 'col' means column, V means voltage and I means current: col1=time, col2=V1, col3=I1, col4=V2, col5=I2 and so on until col11=V10, col12=I10. So each row represents a data sample taken by the DAQ.
The instantaneous power is Pi=V1*I1+V2*I2+...+V11*I11
To calculate moving average over 10000 rows at a time, I built a buffer (initialized with Buffer=[0]*10000). This buffer will store the Pi's for 10000 consecutive rows and will be updated every time when csv.reader moves to the next row. The buffer works exactly like a shift register.
This way the memory usage is insignificant (verified). In summary, the calculations are multiplications, additions, min(a,b) function (to detect the peak of the moving average) and del/append for refreshing the buffer. The moving average itself is iterative too, something like newavg=oldavg+(newlast-oldfirst)/bufsize.
My thinking is that it does not make any sense to let Python work with all those decimals when I know that most of the trailing figures are garbage.
Forgot to mention that the size of the csv file coming from the DAQ is just under 1Gb.
Yes, there is a way - use NumPy. First, there are tons of vector/vector operations, which could be performed with one command
a = b + c
will efficiently sum two vector.
Second, which is the answer to your question, you could specify 4bytes float type, greatly reducing memory reqs and increasing speed.
You should read your file directly using
from numpy import genfromtxt
data = genfromtxt('datafile.csv', dtype=numpy.float32, delimiter=',')
...
data would made up from standard 32bits floats, circa 7digits precision.
CSV file could be read by parts/bunches
numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None,
skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None,
usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_',
autostrip=False, case_sensitive=True, defaultfmt='f%i',
unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None,
encoding='bytes')
here is full list of parameters. If max_rows is set to, say, 10, only 10 rows will be read. Default is to read whole file. You could read anything in the middle of the files by skipping some initial records, via skip_header option.
Use DyZ's comment. if there is a way to speed up the calculations, (i.e. using << or >> for multiplications or divisions respectiveley if the second operand or the dividend is a power of 2, you should take it.
example:
>>> 22 * 16
352
>>> 22 << 4
352
in that scenario, I did the exact same operation with marginal decrease in time. However, if that will equate to 100 trillion calculations, the difference is much more notable.

(Py)Spark combineByKey mergeCombiners output type != mergeCombinerVal type

I'm trying to optimize one piece of Software written in Python using Pandas DF . The algorithm takes a pandas DF as input, can't be distributed and it outputs a metric for each client.
Maybe it's not the best solution, but my time-efficient approach is to load all files in parallel and then build a DF for each client
This works fine BUT very few clients have really HUGE amount of data. So I need to save memory when creating their DF.
In order to do this I'm performing a groupBy() (actually a combineByKey, but logically it's a groupBy) and then for each group (aka now a single Row of an RDD) I build a list and from it, a pandas DF.
However this makes many copies of the data (RDD rows, List and pandas DF...) in a single task/node and crashes and I would like to remove that many copies in a single node.
I was thinking on a "special" combineByKey with the following pseudo-code:
def createCombiner(val):
return [val]
def mergeCombinerVal(x,val):
x.append(val);
return x;
def mergeCombiners(x,y):
#Not checking if y is a pandas DF already, but we can do it too
if (x is a list):
pandasDF= pd.Dataframe(data=x,columns=myCols);
pandasDF.append(y);
return pandasDF
else:
x.append(y);
return x;
My question here, docs say nothing, but someone knows if it's safe to assume that this will work? (return dataType of merging two combiners is not the same than the combiner). I can control datatype on mergeCombinerVal too if the amount of "bad" calls is marginal, but it would be very inefficient to append to a pandas DF row by row.
Any better idea to perform want i want to do?
Thanks!,
PS: Right now I'm packing Spark rows, would switching from Spark rows to python lists without column names help reducing memory usage?
Just writing my comment as answer
At the end I've used regular combineByKey, its faster than groupByKey (idk the exact reason, I guess it helps packing the rows, because my rows are small size, but there are maaaany rows), and also allows me to group them into a "real" Python List (groupByKey groups into some kind of Iterable which Pandas doesn't support and forces me to create another copy of the structure, which doubles memory usage and crashes), which helps me with memory management when packing them into Pandas/C datatypes.
Now I can use those lists to build a dataframe directly without any extra transformation (I don't know what structure is Spark's groupByKey "list", but pandas won't accept it in the constructor).
Still, my original idea should have given a little less memory usage (at most 1x DF + 0.5x list, while now I have 1x DF + 1x list), but as user8371915 said it's not guaranteed by the API/docs..., better not to put that into production :)
For now, my biggest clients fit into a reasonable amount of memory. I'm processing most of my clients in a very parallel low-memory-per-executor job and the biggest ones in a not-so-parallel high-memory-per-executor job. I decide based on a pre-count I perform

Reassigning to Python pandas series: garbage collection

I am using a for loop that loads tsv files of about 1gb each into a pandas series. They are always assigned to the same variable and then I use Series.add() to add them to a series that contains the total of the numbers in the series.
Update: To clarify, all the tsv have more or or less the same index, so the length of the total series does not really change only the values get added up.
I would expect the the memory the "old" series to get freed occasionally so that the memory usage stays within bounds. However, the memory usage grows until the 62GB memory of the machine are exhausted.
Has anyone any ideas how to solve the problem? I tried deleting the variable explicitly within the loop and I tried to call gc.collect() in the loop. Both didn't help. I am using Python 2.73.
More details:
In the tsv files first two columns are an index (chromosome and position) and the third column are integer.
the code is:
total = pd.read_csv(coverage_file1,sep='\t',index_col=[0,1],header=None,names= ['depth'],squeeze=True)
for file in coverage_files:
series = pd.read_csv(file,sep='\t',index_col=[0,1],header=None,names=['depth'],squeeze=True)
total = total.add(series,fill_value=0).astype(int)
del series # I tried with and without this and the next line
gc.collect()
total.to_csv(args.out,sep='\t',header=None)
But you still accumulating data in total, while seriesbeing garbage collected. Maybe optimize algorithm? It looks to me you just want to join files of the same format, if it so there is no need to use pandas for this.

Speeding slicing of big numpy array

I have a big array ( 1000x500000x6 ) that is stored in a pyTables file. I am doing some calculations on it that are fairly optimized in terms of speed, but what is taking the most time is the slicing of the array.
At the beginning of the script, I need to get a subset of the rows : reduced_data = data[row_indices, :, :] and then, for this reduced dataset, I need to access:
columns one by one: reduced_data[:,clm_indice,:]
a subset of the columns: reduced_data[:,clm_indices,:]
Getting these arrays takes forever. Is there any way to speed that up ? storing the data differently for example ?
You can try choosing the chunkshape of your array wisely, see: http://pytables.github.com/usersguide/libref.html#tables.File.createCArray
This option controls in which order the data is physically stored in the file, so it might help to speed up access.
With some luck, for your data access pattern, something like chunkshape=(1000, 1, 6) might work.

Categories