Pandas: Dataframe itertuples boolean series groupby optimization - python

I'm new in python.
I have data frame (DF) example:
id
type
1
A
1
B
2
C
2
B
I would like to add a column example A_flag group by id.
In the end I have data frame (DF):
id
type
A_flag
1
A
1
1
B
1
2
C
0
2
B
0
I can do this in two step:
DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)
It's working, but it's very slowy for big data frame.
Is there any way to optimize this case ?
Thank's for help.

Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.
df['type'].eq('A')
Then, you can attach it to the groupby statement for second step, as follows:
df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)
Result
print(df)
id type A_flag
0 1 A 1
1 1 B 1
2 2 C 0
3 2 B 0
In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m by:
m = df['type'].eq('A') & df['type1'].gt(1) | (df['type2'] != 0)
Then, use it in step 2 as follows:
m.groupby(df['id']).transform('max').astype(int)

Related

Recover a standard, single-index data frame after using pandas groupby+apply

I want to apply a custom reduction function to each group in a Python dataframe. The function reduces the group to a single row by performing operations that combine several of the columns of the group.
I've implemented this like so:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={
"afac": np.random.random(size=1000),
"bfac": np.random.random(size=1000),
"class":np.random.randint(low=0,high=5,size=1000)
})
def f(group):
total_area = group['afac'].sum()
per_area = (group['afac']/total_area).values
per_pop = group['bfac'].values
return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop)]})
aggdf = df.groupby('class').apply(f)
My input data frame df looks like:
>>> df
afac bfac class
0 0.689969 0.992403 0
1 0.688756 0.728763 1
2 0.086045 0.499061 1
3 0.078453 0.198435 2
4 0.621589 0.812233 4
But my code gives this multi-indexed data frame:
>>> aggdf
per_apop
class
0 0 0.553292
1 0 0.503112
2 0 0.444281
3 0 0.517646
4 0 0.503290
I've tried various ways of getting back to a "normal" data frame, but none seem to work.
>>> aggdf.reset_index()
class level_1 per_apop
0 0 0 0.553292
1 1 0 0.503112
2 2 0 0.444281
3 3 0 0.517646
4 4 0 0.503290
>>> aggdf.unstack().reset_index()
class per_apop
0
0 0 0.553292
1 1 0.503112
2 2 0.444281
3 3 0.517646
4 4 0.503290
How can I perform this operation and get a normal data frame afterwards?
Update: The output data frame should have columns for class and per_apop. Ideally, the function f can return multiple columns and possibly multiple rows. Perhaps using
return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop),2], 'sue':[1,3]})
You can select which level to reset as well as if you want to retain the index using reset_index. In your case, you ended up with a multi-index that has 2 levels: class and one that is unnamed. reset_index allows you to reset the entire index (default) or just the levels you want. In the following example, the last level (-1) is being pulled out of the index. By also using drop=True it is dropped rather than appended as a column in the data frame.
aggdf.reset_index(level=-1, drop=True)
per_apop
class
0 0.476184
1 0.476254
2 0.509735
3 0.502444
4 0.525287
EDIT:
To push the class level of the index back to the data frame, you can simply call .reset_index() again. Ugly, but it work.
aggdf.reset_index(level=-1, drop=True).reset_index()
class per_apop
0 0 0.515733
1 1 0.497349
2 2 0.527063
3 3 0.515476
4 4 0.494530
Alternatively, you could also, reset the index, then just drop the extra column.
aggdf.reset_index().drop('level_1', axis=1)
class per_apop
0 0 0.515733
1 1 0.497349
2 2 0.527063
3 3 0.515476
4 4 0.494530
Make your self-def function return Series
def f(group):
total_area = group['afac'].sum()
per_area = (group['afac']/total_area).values
per_pop = group['bfac'].values
return pd.Series(data={'per_apop': np.sum(per_area*per_pop)})
df.groupby('class').apply(f).reset_index()
class per_apop
0 0 0.508332
1 1 0.505593
2 2 0.488117
3 3 0.481572
4 4 0.500401
Although you have a good answer, a suggestion:
test func for df.groupby(...).apply( func ) on the first group, like this:
agroupby = df.groupby(...)
for key, groupdf in agroupby: # an iterator -> (key, groupdf) ... pairs
break # get the first pair
print( "\n-- first groupdf: len %d type %s \n%s" % (
len(groupdf), type(groupdf), groupdf )) # DataFrame
test = myfunc( groupdf )
# groupdf .col [col] [[col ...]] .set_index .resample ... as usual

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

How to transform the result of a Pandas `GROUPBY` function to the original dataframe

Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.
You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True
My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4

Map each value in pandas series/dataframe to n>1 dimensions

I have a pandas series, and a function that takes a value in the series and returns a dataframe. Is there a way to apply the function to the series and collate the results in a natural way?
What I am really trying to do is to use pandas series/multiindex to keep track of the results in each step of my data analysis pipeline, where the multiindex holds the parameters used to get the values. For example, the series (s below) is the result of step 0 in my data analysis pipeline. In step 1, I want to try x more dimensions (2 below, thus the dataframe) and collate the results into another series.
Can we do better than below? Where stack() calls seem a bit excessive. Will the xarray library be a good fit for my use case?
In [112]: s
Out[112]:
a 0
b 1
c 2
dtype: int64
In [113]: d = s.apply(lambda x: pd.DataFrame([[x,x*2],[x*3,x*4]]).stack()).stack().stack()
In [114]: d
Out[114]:
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 3
1 0 2
1 4
c 0 0 2
1 6
1 0 4
1 8
dtype: int64
This should give you a DataSet of 2D arrays, and align them for you. You may want to set the dimensions prior if you want them to be named a certain way / be a certain size.
xr.Dataset(k: func(v) for k, v in series.items())

What's the best way to transform Array values in one column to columns of the original DataFrame?

I have a table where one of the columns is an Array of binary features, they are there when that feature is present.
I'd like to train a logistic model on these rows, but can't get the data in the required format where each feature value is it's own column with a 1 or 0 value.
Example:
id feature values
1 ['HasPaws', 'DoesBark', 'CanFetch']
2 ['HasPaws', 'CanClimb', 'DoesMeow']
I'd like to get it to the format of
id HasPaws DoesBark CanFetch CanClimb DoesMeow
1 1 1 1 0 0
2 1 0 0 1 0
It seems like there would be some functionality built in to accomplish this, but I can't think of what this transformation is called to do a better search on my own.
You can first convert lists to columns and then use get_dummies() method:
In [12]: df
Out[12]:
id feature_values
0 1 [HasPaws, DoesBark, CanFetch]
1 2 [HasPaws, CanClimb, DoesMeow]
In [13]: (pd.get_dummies(df.set_index('id').feature_values.apply(pd.Series),
...: prefix='', prefix_sep='')
...: .reset_index()
...: )
Out[13]:
id HasPaws CanClimb DoesBark CanFetch DoesMeow
0 1 1 0 1 1 0
1 2 1 1 0 0 1
Another option is to loop through the feature values column, and construct a series from each cell with the values in the list as index. And in this way, pandas will expand the series into a data frame with index as headers:
pd.concat([df['id'],
(df['feature values'].apply(lambda lst: pd.Series([1]*len(lst), index=lst))
.fillna(0)], axis=1)
method 1
pd.concat([df['id'], df['feature values'].apply(pd.value_counts)], axis=1).fillna(0)
method 2
df.set_index('id').squeeze().apply(pd.value_counts).reset_index().fillna(0)
method 3
pd.concat([pd.Series(1, f, name=i) for _, (i, f) in df.iterrows()],
axis=1).T.fillna(0).rename_axis('id').reset_index()

Categories