dask groupby apply then merge back to dataframe - python

I would I go about creating a new column that is the result of a groupby and apply of another column while keeping the order of the dataframe (or at least be able to sort it back).
example:
I want to normalize a signal column by group
import dask
import numpy as np
import pandas as pd
from dask import dataframe
def normalize(x):
return ((x - x.mean())/x.std())
data = np.vstack([np.arange(2000), np.random.random(2000), np.round(np.linspace(0, 10, 2000))]).T
df = dataframe.from_array(data, columns=['index', 'signal', 'id_group'], chunksize=100)
df = df.set_index('index')
normalized_signal = df.groupby('id_group').signal.apply(normalize, meta=pd.Series(name='normalized_signal_by_group'))
normalized_signal.compute()
I do get the right series, but the index is shuffled.
I do I get this series back in the dataframe?
I tried
df['normalized_signal'] = normalized_signal
df.compute()
but I get
ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
I also tried a merge, but my final dataframe ends up shuffled with no easy way to resort along the index
df2 = df.merge(normalized_signal.to_frame(), left_index=True, right_index=True, how='left')
df2.compute()
It works when I compute the series than sort_index() in pandas but that doesn't seem efficient.
df3 = df.merge(normalized_signal.to_frame().compute().sort_index(), left_index=True, right_index=True, how='left')
df3.compute()
The equivalent pandas way is :
df4 = df.compute()
df4['normalized_signal_by_group'] = df4.groupby('id_group').signal.transform(normalize)
df4

Unfortunately transform is not implemented in dask yet. My (ugly) workaround is:
import numpy as np
import pandas as pd
import dask.dataframe as dd
pd.options.mode.chained_assignment = None
def normalize(x):
return ((x - x.mean())/x.std())
def dask_norm(gp):
gp["norm_signal"] = normalize(gp["signal"].values)
return(gp.as_matrix())
data = np.vstack([np.arange(2000), np.random.random(2000), np.round(np.linspace(0, 10, 2000))]).T
df = dd.from_array(data, columns=['index', 'signal', 'id_group'], chunksize=100)
df1 = df.groupby("id_group").apply(dask_norm, meta=pd.Series(name="a") )
df2 = df1.to_frame().compute()
df3 = pd.concat([pd.DataFrame(a) for a in df2.a.values])
df3.columns = ["index", "signal", "id_group", "normalized_signal_by_group"]
df3.sort_values("index", inplace=True)

Related

Pandas - Operate on a column, filtered by another column in the dataset

I have a dataframe with several columns with dates - formatted as datetime.
I am trying to get the min/max value of a date, based on another date column being NaN
For now, I am doing this in two separate steps:
temp_df = df[(df['date1'] == np.nan)]
max_date = max(temp_df['date2'])
temp_df = None
I get the result I want, but I am using an unnecesary temporary dataframe.
How can I do this without it?
Is there any reference material to read on this?
Thanks
Here is an MCVE that can be played with to obtain statistics from other columns where the value in one isnull() (NaN or NaT). This can be done in a one-liner.
import pandas as pd
import numpy as np
print(pd.__version__)
# sample date columns
daterange1 = pd.date_range('2017-01-01', '2018-01-01', freq='MS')
daterange2 = pd.date_range('2017-04-01', '2017-07-01', freq='MS')
daterange3 = pd.date_range('2017-06-01', '2018-02-01', freq='MS')
df1 = pd.DataFrame(data={'date1': daterange1})
df2 = pd.DataFrame(data={'date2': daterange2})
df3 = pd.DataFrame(data={'date3': daterange3})
# jam them together, making NaT's in non-overlapping ranges
df = pd.concat([df1, df2, df3], axis=0, sort=False)
df.reset_index(inplace=True)
max_date = df[(df['date1'].isnull())]['date2'].max()
print(max_date)

Efficiently reconstruct DataFrame using oversampled index

I have two DataFrames: df1 and df2
both df1 and df2 are derived from the same original data set, which has a DatetimeIndex.
df2 still has a DatetimeIndex.
Whereas, df1 has been oversampled and now has an int index with the prior DatetimeIndex as a 'Date' column within it.
I need to reconstruct a df2 so that it aligns with df1, i.e. I'll need to oversample the rows that are oversampled and then order them and set them onto the same int index that df1 has.
Currently, I'm using these two functions below, but they are painfully slow. Is there any way to speed this up? I haven't been able to find any built-in function that does this. Is there?
def align_data(idx_col,data):
new_data = pd.DataFrame(index=idx_col.index,columns=data.columns)
for label,group in idx_col.groupby(idx_col):
if len(group.index) > 1:
slice = expanded(data.loc[label],len(group.index)).values
else:
slice = data.loc[label]
new_data.loc[group.index] = slice
return new_data
def expanded(row,l):
return pd.DataFrame(data=[row for i in np.arange(l)],index=np.arange(l),columns=row.index)
A test can be generated using the code below:
import pandas as pd
import numpy as np
import datetime as dt
dt_idx = pd.DatetimeIndex(start='1990-01-01',end='2018-07-02',freq='B')
df1 = pd.DataFrame(data=np.zeros((len(dt_idx),20)),index=dt_idx)
df1.index.name = 'Date'
df2 = df1.copy()
df1 = pd.concat([df1,df1.sample(len(dt_idx)/2)],axis=0)
df1.reset_index(drop=False,inplace=True)
t = dt.datetime.now()
df2_aligned = align_data(df1['Date'],df2)
print(dt.datetime.now()-t)

Subtract one dataframe from another excluding the first column Pandas

I have to dataframes with the same columns. My task should be to subtract the df_tot from df_nap without touching the first column ('A').
What is the easiest solution for it?
Thank you!
import numpy as np
import pandas as pd
df_tot = pd.DataFrame(np.random.randint(10, size=(3,4)), columns=list('ABCD'))
df_nap = pd.DataFrame(np.random.randint(10, size=(3,4)), columns=list('ABCD'))
Simply subtract the entire DataFrames, then reassign the desired values to the Wavelength column.
result = df_tot - df_nap
result['Wavelength'] = df_tot['Wavelength']
For example,
import numpy as np
import pandas as pd
df_tot = pd.DataFrame(np.random.randint(10, size=(3,4)), columns=list('ABCD'))
df_nap = pd.DataFrame(np.random.randint(10, size=(3,4)), columns=list('ABCD'))
# df_tot['A'] = df_nap['A'] # using column A as the "Wavelength" column
result = df_tot - df_nap
result['A'] = df_tot['A']
Alternatively, or if Wavelength column were not numeric, you could
subtract everything except the Wavelength, then reassign that column:
result = df_tot.drop('Wavelength', axis=1) - df_nap.drop('Wavelength', axis=1)
result['Wavelength'] = df_tot['Wavelength']
Set the common index for both dataframes before using pd.DataFrame.sub:
df_tot = df_tot.set_index('Wavelength')
df_nap = df_nap.set_index('Wavelength')
res = df_tot.sub(df_nap)
If you require 'Wavelength' as a series rather than an index, you can call reset_index on the result:
res = res.reset_index()
However, there are certain benefits attached to storing a unique row-identifier as an index rather than a series. For example, more efficient lookup and merge functionality.
you can also use join and iloc:
df_tot.iloc[:,:1].join(df_tot.iloc[:,1:]-df_nap.iloc[:,1:])
but this implies to have the same order of columns and 'wavelength' being the first one

Pandas dataframe resample without aggregation

I have a dataframe defined as follows:
import datetime
import pandas as pd
import random
import numpy as np
todays_date = datetime.datetime.today().date()
index = pd.date_range(todays_date - datetime.timedelta(10), periods=10, freq='D')
index = index.append(index)
idname = ['A']*10 + ['B']*10
values = random.sample(xrange(100), 20)
data = np.vstack((idname, values)).T
tmp_df = pd.DataFrame(data, columns=['id', 'value'])
tmp_index = pd.DataFrame(index, columns=['date'])
tmp_df = pd.concat([tmp_index, tmp_df], axis=1)
tmp_df = tmp_df.set_index('date')
Note that there are 2 values for each date. I would like to resample the dataframe tmp_df on a weekly basis but keep the two separate values. I tried tmp_df.resample('W-FRI') but it doesn't seem to work.
The solution you're looking for is groupby, which lets you perform operations on dataframe slices (here 'A' and 'B') independently:
df.groupby('id').resample('W-FRI')
Note: your code produces an error (No numeric types to aggregate) because the 'value' column is not converted to int. You need to convert it first:
df['value'] = pd.to_numeric(df['value'])

Is it possible to add new columns to DataFrame in Pandas (python)?

Consider the following code:
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B', 'C']
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs
data = np.array([np.arange(10)]*3).T
df = pd.DataFrame(data, index=index, columns=columns)
df
Here we create an empty DataFrame in Python using Pandas and then fill it to any extent. However, is it possible to add columns dynamically in a similar manner, i.e., for columns = ['A','B', 'C'], it must be possible to add columns D,E,F etc till a specified number.
I think the
pandas.DataFrame.append
method is what you are after.
e.g.
output_frame=input_frame.append(appended_frame)
There are additional examples in the documentation Pandas merge join and concatenate documentation

Categories