correct pattern for dask compute minimum? - python

Is this the correct way to call compute()?
def call_minmax_duration(data):
mmin = dd.DataFrame.min(data).compute()
mmax = dd.DataFrame.max(data).compute()
return mmin, mmax

Two things.
Your data variable should be a dask.dataframe object, such as might be created by dd.from_pandas(...) or dd.read_csv(...)
Second, it's probably better to compute both results at once that way shared intermediates only need to be computed once
Example
import dask.dataframe as dd
df = dd.read_csv('2016-*-*.csv')
dd.compute(df.mycolumn.min(), df.mycolumn.max())

Related

How to save outputs (from xarray) from python dask delayed into a pandas dataframe

I am very new to to trying to parallelize my python code. I am trying to perform some analysis on an xarray, then fill in a pandas dataframe with the results. The columns of the dataframe are independent, so I think it should be trivial to parallelise using dask delayed, but can't work out how. My xarrays are quite big, so this loop takes a while, and is big in memory. It could also be chunked by time, instead, if that's easier (this might help with memory)!
Here is the un-parallelized version:
from time import sleep
import time
import pandas as pd
import dask.dataframe as dd
data1 = np.random.rand(4, 3,3)
data2=np.random.randint(4,size=(3,3))
locs1 = ["IA", "IL", "IN"]
locs2 = ['a', 'b', 'c']
times = pd.date_range("2000-01-01", periods=4)
xarray1 = xr.DataArray(data1, coords=[times, locs1, locs2], dims=["time", "space1", "space2"])
xarray2= xr.DataArray(data2, coords=[locs1, locs2], dims=[ "space1", "space2"])
def delayed_where(xarray1,xarray2,id):
sleep(1)
return xarray1.where(xarray2==id).mean(axis=(1,2)).to_dataframe(id)
final_df=pd.DataFrame(columns=range(4),index=times)
for column in final_df:
final_df[column]=delayed_where(xarray1,xarray2,column)
I would like to parallelize the for loop, but have tried:
final_df_delayed=pd.DataFrame(columns=range(4),index=times)
for column in final_df:
final_df_delayed[column]=delayed(delayed_where)(xarray1,xarray2,column)
final_df.compute()
Or maybe something with dask delayed?
final_df_dd=dd.from_pandas(final_df, npartitions=2)
for column in final_df:
final_df_dd[column]=delayed(delayed_where)(xarray1,xarray2,column)
final_df_dd.compute()
But none of these work. Can anyone help?
You're using delayed correctly, but it's not possible to construct a dask dataframe in the way you specified.
from dask import delayed
import dask
#delayed
def delayed_where(xarray1,xarray2,id):
sleep(1)
return xarray1.where(xarray2==id).mean(axis=(1,2)).to_dataframe(id)
#delayed
def form_df(list_col_results):
final_df=pd.DataFrame(columns=range(4),index=times)
for n, column in enumerate(final_df):
final_df[column]=list_col_results[n]
return final_df
delayed_cols = [delayed_where(xarray1,xarray2, col) for col in final_df.columns]
delayed_df = form_df(delayed_cols)
delayed_df.compute()
Note that the enumeration is a clumsy way to get correct order of the columns, but your actual problem might guide you to a better way of specifying this (e.g. by explicitly specifying each column as an individual argument).

Apply Function or Lambda to Pandas GROUPBY

I would like to apply a specific function (in this case a logit model) to a dataframe which can be grouped (by the variable "model"). I know the task can be performed through a loop, however I believe this to be inefficient at best. Example code below:
import pandas as pd
import numpy as np
import statsmodels.api as sm
df1=pd.DataFrame(np.random.randint(0,100,size=(100,10)),columns=list('abcdefghij'))
df2=pd.DataFrame(np.random.randint(0,100,size=(100,10)),columns=list('abcdefghij'))
df1['model']=1
df1['target']=np.random.randint(2,size=100)
df2['model']=2
df2['target']=np.random.randint(2,size=100)
data=pd.concat([df1,df2])
### Clunky, but works...
for i in range(1,2+1):
lm=sm.Logit(data[data['model']==i]['target'],
sm.add_constant(data[data['model']==i].drop(['target'],axis=1))).fit(disp=0)
print(lm.summary2())
### Can this work?
def elegant(self):
lm=sm.Logit(data['target'],
sm.add_constant(data.drop(['target'],axis=1))).fit(disp=0)
better=data.groupby(['model']).apply(elegant)
If the above groupby can work, is this a more efficient way to perform than looping?
This could work:
def elegant(df):
lm = sm.Logit(df['target'],
sm.add_constant(df.drop(['target'],axis=1))).fit(disp=0)
return lm
better = data.groupby('model').apply(elegant)
Using .apply you passe the dataframe groups to the function elegant so elegant has to take a dataframe as the first argument here. Also your function needs to return the result of your calculation lm.
For more complexe functions the following structure can be used:
def some_fun(df, kw_param=1):
# some calculations to df using kw_param
return df
better = data.groupby('model').apply(lambda group: some_func(group, kw_param=99))

dask.DataFrame.apply and variable length data

I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this:
def generate_varibale_length_series(x):
'''returns pd.Series with variable length'''
n_columns = np.random.randint(100)
return pd.Series(np.random.randn(n_columns))
#apply this function to a dask.DataFrame
pdf = pd.DataFrame(dict(A=[1,2,3,4,5,6]))
ddf = dd.from_pandas(pdf, npartitions = 3)
result = ddf.apply(generate_varibale_length_series, axis = 1).compute()
Apparently, this works fine.
Concerning this, I have two questions:
Is this supposed to work always or am I just lucky here? Is dask expecting, that all partitions have the same amount of columns?
In case the metadata inference fails, how can I provide metadata, if the number of columns is not known beforehand?
Background / usecase: In my dataframe each row represents a simulation trail. The function I want to apply extracts time points of certain events from it. Since I do not know the number of events per trail in advance, I do not know how many columns the resulting dataframe will have.
Edit:
As MRocklin suggested, here an approach that uses dask delayed to compute result:
#convert ddf to delayed objects
ddf_delayed = ddf.to_delayed()
#delayed version of pd.DataFrame.apply
delayed_apply = dask.delayed(lambda x: x.apply(generate_varibale_length_series, axis = 1))
#use this function on every delayed object
apply_on_every_partition_delayed = [delayed_apply(d) for d in ddf.to_delayed()]
#calculate the result. This gives a list of pd.DataFrame objects
result = dask.compute(*apply_on_every_partition_delayed)
#concatenate them
result = pd.concat(result)
Short answer
No, dask.dataframe does not support this
Long answer
Dask.dataframe expects to know the columns of every partition ahead of time and it expects those columns to match.
However, you can still use Dask and Pandas together through dask.delayed, which is far more capable of handling problems like these.
http://dask.pydata.org/en/latest/delayed.html

Separating pandas dataframe by offset string

Lets say I have a pandas.DataFrame that has hourly data for 3 days:
import pandas as pd
import numpy as np
import datetime as dt
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(np.random.randn(3*24,2),index=dates,columns=list('AB'))
I would like to get every, let's say, 6 hours of data and independently fit a curve to that data. Since pandas' resample function has a how keyword that is supposed to be any numpy array function, I thought that I could maybe try to use resample to do that with polyfit, but apparently there is no way (right?).
So the only alternative way I thought of doing that is separating df into a sequence of DataFrames, so I am trying to create a function that would work such as
l=splitDF(df, '6H')
and it would return to me a list of dataframes, each one with 6 hours of data (except maybe the first and last ones). So far I got nothing that could work except something like the following manual method:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
cont=0
for date in data.index:
... check for date in res_index ...
... and start cutting at those points ...
But this method would be extremely slow and there is probably a faster way to do it. Is there a fast (maybe even pythonic) way of doing this?
Thank you!
EDIT
A better method (that needs some improvement but it's faster) would be the following:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
pdate=res_index[0]
for date in res_index:
out.append(data[pdate:date][:-1])
pdate=date
out.append(data[pdate:])
return out
But still seems to me that there should be a better method.
Ok, so this sounds like a textbook case for using groupby. Here's my thinking:
import pandas as pd
#let's define a function that'll group a datetime-indexed dataframe by hour-interval/date
def create_date_hour_groups(df, hr):
new_df = df.copy()
hr_int = int(hr)
new_df['hr_group'] = new_df.index.hour/hr_int
new_df['dt_group'] = new_df.index.date
return new_df
#now we define a wrapper for polyfit to pass to groupby.apply
def polyfit_x_y(df, x_col='A', y_col='B', poly_deg=3):
df_new = df.copy()
coef_array = pd.np.polyfit(df_new[x_col], df_new[y_col], poly_deg)
poly_func = pd.np.poly1d(coef_array)
df_new['poly_fit'] = poly_func(df[x_col])
return df_new
#to the actual stuff
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(pd.np.random.randn(3*24,2),index=dates,columns=list('AB'))
df = create_date_hour_groups(df, 6)
df_fit = df.groupby(['dt_group', 'hr_group'],
as_index=False).apply(polyfit_x_y)
How about?
np.array_split(df,len(df)/6)

What is the most efficient way of counting occurrences in pandas?

I have a large (about 12M rows) DataFrame df:
df.columns = ['word','documents','frequency']
The following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
However, this is taking an unexpectedly long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?
df.word.describe()
ran pretty well, so I really did not expect this Occurrences_of_Words DataFrame to take very long to build.
I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()
-Source.
Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False (default is True)
An example:
>>> df['Embarked'].value_counts(dropna=False)
S 644
C 168
Q 77
NaN 2
Other possible approaches to count occurrences could be to use (i) Counter from collections module, (ii) unique from numpy library and (iii) groupby + size in pandas.
To use collections.Counter:
from collections import Counter
out = pd.Series(Counter(df['word']))
To use numpy.unique:
import numpy as np
i, c = np.unique(df['word'], return_counts = True)
out = pd.Series(c, index = i)
To use groupby + size:
out = pd.Series(df.index, index=df['word']).groupby(level=0).size()
One very nice feature of value_counts that's missing in the above methods is that it sorts the counts. If having the counts sorted is absolutely necessary, then value_counts is the best method given its simplicity and performance (even though it still gets marginally outperformed by other methods especially for very large Series).
Benchmarks
(if having the counts sorted is not important):
If we look at runtimes, it depends on the data stored in the DataFrame columns/Series.
If the Series is dtype object, then the fastest method for very large Series is collections.Counter, but in general value_counts is very competitive.
However, if it is dtype int, then the fastest method is numpy.unique:
Code used to produce the plots:
import perfplot
import numpy as np
import pandas as pd
from collections import Counter
def creator(n, dt='obj'):
s = pd.Series(np.random.randint(2*n, size=n))
return s.astype(str) if dt=='obj' else s
def plot_perfplot(datatype):
perfplot.show(
setup = lambda n: creator(n, datatype),
kernels = [lambda s: s.value_counts(),
lambda s: pd.Series(Counter(s)),
lambda s: pd.Series((ic := np.unique(s, return_counts=True))[1], index = ic[0]),
lambda s: pd.Series(s.index, index=s).groupby(level=0).size()
],
labels = ['value_counts', 'Counter', 'np_unique', 'groupby_size'],
n_range = [2 ** k for k in range(5, 25)],
equality_check = lambda *x: (d:= pd.concat(x, axis=1)).eq(d[0], axis=0).all().all(),
xlabel = '~len(s)',
title = f'dtype {datatype}'
)
plot_perfplot('obj')
plot_perfplot('int')

Categories