applying several functions in transform in pandas - python

After a groupby, when using agg, if a dict of columns:functions is passed, the functions will be applied in the corresponding columns. Nevertheless this syntax doesn't work with transform. Is there another way to apply several functions in transform?
Let's give an example:
import pandas as pd
df_test = pd.DataFrame([[1,2,3],[1,20,30],[2,30,50],[1,2,33],[2,4,50]],columns = ['a','b','c'])
Out[1]:
a b c
0 1 2 3
1 1 20 30
2 2 30 50
3 1 2 33
4 2 4 50
def my_fct1(series):
return series.mean()
def my_fct2(series):
return series.std()
df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2})
Out[2]:
c b
a
1 16.522712 8
2 0.000000 17
The previous example shows how to apply different function to different columns in agg, but if we want to transform the columns without aggregating them, agg can't be used anymore. Therefore:
df_test.groupby('a').transform({'b':np.cumsum,'c':np.cumprod})
Out[3]:
TypeError: unhashable type: 'dict'
How can we perform such an action with the following expected output:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500

You can still use a dict but with a bit of hack:
df_test.groupby('a').transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])
Out[427]:
b c
0 2 3
1 22 90
2 30 50
3 24 2970
4 34 2500
If you need to keep column a, you can do:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])\
.reset_index()
Out[429]:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
Another way is to use an if else to check column names:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: x.cumsum() if x.name=='b' else x.cumprod())\
.reset_index()

I think now (pandas 0.20.2) function transform is not implemented with dict - columns names with functions like agg.
If functions return Series with same lenght:
df1 = df_test.set_index('a').groupby('a').agg({'b':np.cumsum,'c':np.cumprod}).reset_index()
print (df1)
a c b
0 1 3 2
1 1 90 22
2 2 50 30
3 1 2970 24
4 2 2500 34
But if aggreagte different length need join:
df2 = df_test[['a']].join(df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2}), on='a')
print (df2)
a c b
0 1 16.522712 8
1 1 16.522712 8
2 2 0.000000 17
3 1 16.522712 8
4 2 0.000000 17

With the updates to Pandas, you can use the assign method, along with transform to either append new columns, or replace existing columns with new values :
grouper = df_test.groupby("a")
df_test.assign(b=grouper["b"].transform("cumsum"),
c=grouper["c"].transform("cumprod"))
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500

Related

Counting repeating words with numpy and pandas Python

I want to write a code where it outputs the number of repeated values in a for each different value. Then I want to make a pandas data sheet to print it. The sums code down below does not work how would I be able to make it work and get the Expected Output?
import numpy as np
import pandas as pd
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
uniques = np.unique(a)
sums = np.sum(uniques[:-1]==a[:-1])
Expected Output:
Value Repetition Count
1 1
3 3
12 3
22 1
43 4
Define a dataframe df based on the array a. Then, use .groupby() + .size() to get the size/count of unique values, as follows:
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
df = pd.DataFrame({'Value': a})
df.groupby('Value').size().reset_index(name='Repetition Count')
Result:
Value Repetition Count
0 1 1
1 3 3
2 12 3
3 22 1
4 43 4
Edit
If you want also the percentages of counts, you can use:
(df.groupby('Value', as_index=False)
.agg(**{'Repetition Count': ('Value', 'size'),
'Percent': ('Value', lambda x: round(x.size/len(a) *100, 2))})
)
Result:
Value Repetition Count Percent
0 1 1 8.33
1 3 3 25.00
2 12 3 25.00
3 22 1 8.33
4 43 4 33.33
or use .value_counts with normalize=True
pd.Series(a).value_counts(normalize=True).mul(100)
Result:
43 33.333333
12 25.000000
3 25.000000
22 8.333333
1 8.333333
dtype: float64
You can use groupby:
>>> pd.Series(a).groupby(a).count()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Or value_counts():
>>> pd.Series(a).value_counts().sort_index()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Easiest if you make a pandas dataframe from np.array and then use value_counts().
df = pd.DataFrame(data=a, columns=['col1'])
print(df.col1.value_counts())
43 4
12 3
3 3
22 1
1 1

return first column number that fulfills a condition in pandas

I have a dataset with several columns of cumulative sums. For every row, I want to return the first column number that satisfies a condition.
Toy example:
df = pd.DataFrame(np.array(range(20)).reshape(4,5).T).cumsum(axis=1)
>>> df
0 1 2 3
0 0 5 15 30
1 1 7 18 34
2 2 9 21 38
3 3 11 24 42
4 4 13 27 46
If I want to return the first column whose value is greater than 20 for instance.
Desired output:
3
3
2
2
2
Many thanks as always!
Try with idxmax
df.gt(20).idxmax(1)
Out[66]:
0 3
1 3
2 2
3 2
4 2
dtype: object
No as short as #YOBEN_S but works is the chaining of index.get_loc and first_valid_index
df[df>20].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1)
0 3
1 3
2 2
3 2
4 2
dtype: int64

Is it possible to shuffle a dataframe while using while grouping by index in pandas or sklearn? [duplicate]

My dataframe looks like this
sampleID col1 col2
1 1 63
1 2 23
1 3 73
2 1 20
2 2 94
2 3 99
3 1 73
3 2 56
3 3 34
I need to shuffle the dataframe keeping same samples together and the order of the col1 must be same as in above dataframe.
So I need it like this
sampleID col1 col2
2 1 20
2 2 94
2 3 99
3 1 73
3 2 56
3 3 34
1 1 63
1 2 23
1 3 73
How can I do this? If my example is not clear please let me know.
Assuming you want to shuffle by sampleID. First df.groupby, shuffle (import random first), and then call pd.concat:
import random
groups = [df for _, df in df.groupby('sampleID')]
random.shuffle(groups)
pd.concat(groups).reset_index(drop=True)
sampleID col1 col2
0 2 1 20
1 2 2 94
2 2 3 99
3 1 1 63
4 1 2 23
5 1 3 73
6 3 1 73
7 3 2 56
8 3 3 34
You reset the index with df.reset_index(drop=True), but it is an optional step.
I found this to be significantly faster than the accepted answer:
ids = df["sampleID"].unique()
random.shuffle(ids)
df = df.set_index("sampleID").loc[ids].reset_index()
for some reason the pd.concat was the bottleneck in my usecase. Regardless this way you avoid the concatenation.
Just to add one thing to #cs95 answer.
If you want to shuffle by sampleID but you want to have your sampleIDs ordered from 1. So here the sampleID is not that important to keep.
Here is a solution where you have just to iterate over the gourped dataframes and change the sampleID.
groups = [df for _, df in df.groupby('doc_id')]
random.shuffle(groups)
for i, df in enumerate(groups):
df['doc_id'] = i+1
shuffled = pd.concat(groups).reset_index(drop=True)
doc_id sent_id word_id
0 1 1 20
1 1 2 94
2 1 3 99
3 2 1 63
4 2 2 23
5 2 3 73
6 3 1 73
7 3 2 56
8 3 3 34

Speeding up duplicating of rows in Pandas groupby?

I have a very large data frame (hundreds of millions of rows). There are two group ID's, group_id_1 and group_id_2. The data frame looks like this:
group_id_1 group_id_2 value1 time
1 2 45 1
1 2 49 2
1 4 95 1
1 4 55 2
2 2 44 1
2 4 88 1
2 4 90 2
For each group_id_1 x group_id_2 combo, I need to duplicate the row with the latest time, and increment the time by one. In other words, my table should look like:
group_id_1 group_id_2 value1 time
1 2 45 1
1 2 49 2
1 2 49 3
1 4 95 1
1 4 55 2
1 4 55 3
2 2 44 1
2 2 44 2
2 4 88 1
2 4 90 2
2 4 90 3
Right now, I am doing:
for name, group in df.groupby(['group_id_1', 'group_id_2']):
last, = group.sort_values(by='time').tail(1)['time'].values
temp = group[group['time']==last]
temp.loc[:, 'time'] = last + 1
group = group.append(temp)
This is insanely inefficient. If I put the above code into a function, and use the .apply() method with the groupby object, it also takes an enormous amount of time.
How do I speed this process up?
You can use groupby with aggregate last, add time by add and concat to original:
df1 = df.sort_values(by='time').groupby(['group_id_1', 'group_id_2']).last().reset_index()
df1.time = df1.time.add(1)
print (df1)
group_id_1 group_id_2 value1 time
0 1 2 49 3
1 1 4 55 3
2 2 2 44 2
3 2 4 90 3
df = pd.concat([df,df1])
df = df.sort_values(['group_id_1','group_id_2']).reset_index(drop=True)
print (df)
group_id_1 group_id_2 value1 time
0 1 2 45 1
1 1 2 49 2
2 1 2 49 3
3 1 4 95 1
4 1 4 55 2
5 1 4 55 3
6 2 2 44 1
7 2 2 44 2
8 2 4 88 1
9 2 4 90 2
10 2 4 90 3
First, sort the dataframe by time (this should be more efficient than sorting each group by time):
df = df.sort_values('time')
Second, get the last row in each group (without sorting the groups to improve performance):
last = df.groupby(['group_id_1', 'group_id_2'], sort=False).last()
Third, increment the time:
last['time'] = last['time'] + 1
Fourth, concatenate:
df = pd.concat([df, last])
Fifth, sort back to the original order:
df = df.sort_values(['group_id_1', 'group_id_2'])
Explanation: concatenating and then sorting will be much faster than inserting rows one by one.

Pandas merge on aggregated columns

Let's say I create a DataFrame:
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,13,15], "b": [4,5,6,6,6], "c": ["wish", "you","were", "here", "here"]})
Like so:
a b c
0 1 4 wish
1 2 5 you
2 3 6 were
3 13 6 here
4 15 6 here
... and then group and aggregate by a couple columns ...
gb = df.groupby(['b','c']).agg({"a": lambda x: x.nunique()})
Yielding the following result:
a
b c
4 wish 1
5 you 1
6 here 2
were 1
Is it possible to merge df with the newly aggregated table gb such that I create a new column in df, containing the corresponding values from gb? Like this:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
I tried doing the simplest thing:
df.merge(gb, on=['b','c'])
But this gives the error:
KeyError: 'b'
Which makes sense because the grouped table has a Multi-index and b is not a column. So my question is two-fold:
Can I transform the multi-index of the gb DataFrame back into columns (so that it has the b and c column)?
Can I merge df with gb on the column names?
Whenever you want to add some aggregated column from groupby operation back to the df you should be using transform, this produces a Series with its index aligned with your orig df:
In [4]:
df['nc'] = df.groupby(['b','c'])['a'].transform(pd.Series.nunique)
df
Out[4]:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
There is no need to reset the index or perform an additional merge.
There's a simple way of doing this using reset_index().
df.merge(gb.reset_index(), on=['b','c'])
gives you
a_x b c a_y
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2

Categories