dask.DataFrame.apply and variable length data - python

I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this:
def generate_varibale_length_series(x):
'''returns pd.Series with variable length'''
n_columns = np.random.randint(100)
return pd.Series(np.random.randn(n_columns))
#apply this function to a dask.DataFrame
pdf = pd.DataFrame(dict(A=[1,2,3,4,5,6]))
ddf = dd.from_pandas(pdf, npartitions = 3)
result = ddf.apply(generate_varibale_length_series, axis = 1).compute()
Apparently, this works fine.
Concerning this, I have two questions:
Is this supposed to work always or am I just lucky here? Is dask expecting, that all partitions have the same amount of columns?
In case the metadata inference fails, how can I provide metadata, if the number of columns is not known beforehand?
Background / usecase: In my dataframe each row represents a simulation trail. The function I want to apply extracts time points of certain events from it. Since I do not know the number of events per trail in advance, I do not know how many columns the resulting dataframe will have.
Edit:
As MRocklin suggested, here an approach that uses dask delayed to compute result:
#convert ddf to delayed objects
ddf_delayed = ddf.to_delayed()
#delayed version of pd.DataFrame.apply
delayed_apply = dask.delayed(lambda x: x.apply(generate_varibale_length_series, axis = 1))
#use this function on every delayed object
apply_on_every_partition_delayed = [delayed_apply(d) for d in ddf.to_delayed()]
#calculate the result. This gives a list of pd.DataFrame objects
result = dask.compute(*apply_on_every_partition_delayed)
#concatenate them
result = pd.concat(result)

Short answer
No, dask.dataframe does not support this
Long answer
Dask.dataframe expects to know the columns of every partition ahead of time and it expects those columns to match.
However, you can still use Dask and Pandas together through dask.delayed, which is far more capable of handling problems like these.
http://dask.pydata.org/en/latest/delayed.html

Related

efficient way to find unique values within time windows in python?

I have a large pandas dataframe that countains data similar to the image attached.
I want to get a count of how many unique TN exist within each 2 second window of the data. I've done this with a simple loop, but it is incredibly slow. Is there a better technique I can use to get this?
My original code is:
uniqueTN = []
tmstart = 5400; tmstop = 86400
for tm in range(int(tmstart), int(tmstop), 2):
df = rundf[(rundf['time']>=(tm-2))&rundf['time']<tm)]
uniqueTN.append(df['TN'].unique())
This solution would be fine it the set of data was not so large.
Here is how you can implement groupby() method and nunique().
rundf['time'] = (rundf['time'] // 2) * 2
grouped = rundf.groupby('time')['TN'].nunique()
Another alternative is to use the resample() method of pandas and then the nunique() method.
grouped = rundf.resample('2S', on='time')['TN'].nunique()

Using pandas .shift on multiple columns with different shift lengths

I have created a function that parses through each column of a dataframe, shifts up the data in that respective column to the first observation (shifting past '-'), and stores that column in a dictionary. I then convert the dictionary back to a dataframe to have the appropriately shifted columns. The function is operational and takes about 10 seconds on a 12x3000 dataframe. However, when applying it to 12x25000 it is extremely extremely slow. I feel like there is a much better way to approach this to increase the speed - perhaps even an argument of the shift function that I am missing. Appreciate any help.
def create_seasoned_df(df_orig):
"""
Creates a seasoned dataframe with only the first 12 periods of a loan
"""
df_seasoned = df_orig.reset_index().copy()
temp_dic = {}
for col in cols:
to_shift = -len(df_seasoned[df_seasoned[col] == '-'])
temp_dic[col] = df_seasoned[col].shift(periods=to_shift)
df_seasoned = pd.DataFrame.from_dict(temp_dic, orient='index').T[:12]
return df_seasoned
Try using this code with apply instead:
def create_seasoned_df(df_orig):
df_seasoned = df_orig.reset_index().apply(lambda x: x.shift(df_seasoned[col].eq('-').sum()), axis=0)
return df_seasoned

How to implement dask mappartitions on multiple dataframe lambda function?

I've implemented a fuzzy string matching algo between two dataframes just using pandas. My issue is how do I convert this to a dask operation using multiple cores? My program runs about 3-4 days on pure python, and I want to parallelize the operations to optimize time cost. I've already used the multiprocessing package to extract the number of cores using the code below:
numCores = multiprocessing.cpu_count()
fields = ['id','phase','new']
emb = pd.read_csv('my_csv.csv', skipinitialspace=True, usecols=fields)
Then I had to subdivide the dataframe emb into two dataframes (emb1, emb2) based on numeric values associated per string. As in I'm matching a dataframe with all elements having value to 3 to their corresponding value 2 in the other dataframe by matched string.The code for pure pandas operation is below.
emb1 = emb[emb.phase.isin([3.0])]
emb1.set_index('id',inplace=True)
emb2 = emb[emb.phase.isin([2.0,1.5])]
emb2.set_index('id',inplace=True)
def fuzzy_match(x, choices, scorer, cutoff):
return process.extractOne(x, choices=choices, scorer=scorer, score_cutoff=cutoff)
FuzzyWuzzyResults = pd.DataFrame(emb1.sort_index().loc[:,'strings'].apply(fuzzy_match, args = (emb2.loc[:,'strings'],fuzz.ratio,90)))
I sort of tried doing a dask implementation using this code:
emb1 = dd.from_pandas(emb1, npartitions=numCores)
emb2 = dd.from_pandas(emb2, npartitions=numCores)
But running the lambda function for two dataframes is confusing me. Any ideas?
So I just fixed my code to remove the manual partition of the dataframe and used groupby instead.
Here's the code:
for i in [2.0,1.5]:
FuzzyWuzzyResults = emb.map_partitions(lambda df: df.groupby('phase').get_group(3.0)['drugs'].apply(fuzzy_match, args=(df.groupby('phase').get_group(i)['drugs'],fuzz.ratio,90)), meta=('results')).compute()
Not sure whether it's accurate, but at least it's running, and that too on all CPU cores.

fastest way to generate column with random elements based on another column

I have a dataframe of ~20M lines
I have a column called A that gives me an id (there are ~10K ids in total).
The value of this id defines a random distribution's parameters.
Now I want to generate a column B, that is randomly drawn from the distribution that is defined by the value in the column A
What is the fastest way to do this? Doing something with iterrows or apply is extremely slow. Another possiblity is to group by A, and generate all my data for each value of A (so I only draw from one distribution). But then I don't end up with a Dataframe but with a "groupBy" object, and I don't know how to go back to having the initial dataframe, plus my new column.
I think this approach is similar to what you were describing, where you generate the samples for each id. On my machine, it appears this would take around 5 minutes to run. I assume you can trivially get the ids.
import numpy as np
num_ids = 10000
num_rows = 20000000
ids = np.arange(num_ids)
loc_params = np.random.random(num_ids)
A = np.random.randint(0, num_ids, num_rows)
B = np.zeros(A.shape)
for idx in ids:
A_idxs = A == idx
B[A_idxs] = np.random.normal(np.sum(A_idxs), loc_params[idx])
This question is pretty vague, but how would this work for you?
df['B'] = df.apply(lambda row: distribution(row.A), axis=1)
Editing from question edits (apply is too slow):
You could create a mapping dictionary for the 10k ids to their generated value, then do something like
df['B'] = df['A'].map(dictionary)
I'm unsure if this will be faster than apply, but it will require fewer calls to your random distribution generator

Shifting all rows in dask dataframe

In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like row-shifts may be difficult to manage with Dask's chunked system, but I don't know of a better way to compare each row with the subsequent one.
What I'd like to be able to do is this:
import numpy as np
import pandas as pd
import dask.DataFrame as dd
with pd.HDFStore(path) as store:
data = dd.from_hdf(store, 'sim')[col1]
shifted = data.shift(1)
idx = data.apply(np.sign) != shifted.apply(np.sign)
in order to create a boolean series indicating the locations of sign changes in the data. (I am aware that method would also catch changes from a signed value to zero)
I would then use the boolean series to index a different Dask dataframe for plotting.
Rolling functions
Currently dask.dataframe does not implement the shift operation. It could though if you raise an issue. In principle this is not so dissimilar from rolling operations that dask.dataframe does support, like rolling_mean, rolling_sum, etc..
Actually, if you were to create a Pandas function that adheres to the same API as these pandas.rolling_foo functions then you can use the dask.dataframe.rolling.wrap_rolling function to turn your pandas style rolling function into a dask.dataframe rolling function.
dask.dataframe.rolling_sum = wrap_rolling(pandas.rolling_sum)
The following code might help to shift down the series.
s = dd_df['column'].rolling(window=2).sum() - dd_df['column']
Edit (03/09/2019):
When you are rolling and finding the sum, for a particular row,
result[i] = row[i-1] + row[i]
Then by subtracting the old value of the column from the result, you are doing the following operation:
final_row[i] = result[i] - row[i]
Which equals:
final_row[i] = row[i-1] + row[i] - row[i]
Which ultimately results in the whole column getting shifted down once.
Tip:
If you want to shift it down multiple rows, you should actually execute the whole operation again that many times with the same window.

Categories