Applying operations on groups without aggregating - python

I want to apply an operation on multiple groups of a data frame and then fill all values of that group by the result. Lets take mean and np.cumsum as an example and the following dataframe:
df=pd.DataFrame({"a":[1,3,2,4],"b":[1,1,2,2]})
which looks like this
a b
0 1 1
1 3 1
2 2 2
3 4 2
Now I want to group the dataframe by b, then take the mean of a in each group, then apply np.cumsum to the means, and then replace all values of a by the (group dependent) result.
For the first three steps, I would start like this
df.groupby("b").mean().apply(np.cumsum)
which gives
a
b
1 2
2 5
But what I want to get is
a b
0 2 1
1 2 1
2 5 2
3 5 2
Any ideas how this can be solved in a nice way?

You can use map by Series:
df1 = df.groupby("b").mean().cumsum()
print (df1)
a
b
1 2
2 5
df['a'] = df['b'].map(df1['a'])
print (df)
a b
0 2 1
1 2 1
2 5 2
3 5 2

Related

Sum cells with duplicate column headers in pandas during import - python

I am trying to do some basic dimensional reduction. I have a CSV file that looks something like this:
A B C A B B A C
1 1 2 2 1 3 1 1
1 2 3 0 0 1 1 2
0 2 1 3 0 1 2 2
I want to import as a pandas DF but without renaming the headers to A.1 A.2 etc. Instead I want to sum the duplicates and keep the columns names. Ideally my new DF should look like this:
A B C
4 5 3
2 3 5
5 3 3
Is it possible to do this easily or would you recommend a different way? I can also use bash, R, or anything that can do the trick with a file that is 1 million lines and 1000 columns.
Thank you!
Try split the column names by . and groupby the first part:
df.groupby(df.columns.str.split('.').str[0], axis=1).sum()
Output:
A B C
0 4 5 3
1 2 3 5
2 5 3 3
Just load the dataframe normally and group by the first letter of the column name, and sum the values:
df.groupby(lambda colname: colname[0], axis=1).sum()
which gives
A B C
0 4 5 3
1 2 3 5
2 5 3 3

I want to add sub-index in python with pandas [duplicate]

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Using groupby on already grouped data in Pandas

I would like to achieve the result below in Python using Pandas.
I tried groupby and sum on the id and Group columns using the below:
df.groupby(['id','Group'])['Total'].sum()
I got the first two columns, but I'm not sure how to get the third column (Overall_Total).
How can I do it?
Initial data (before grouping)
id
Group
Time
1
a
2
1
a
2
1
a
1
1
b
1
1
b
1
1
c
1
2
e
2
2
a
4
2
e
1
2
a
5
3
c
1
3
e
4
3
a
3
3
e
4
3
a
2
3
h
4
Assuming df is your initial dataframe, please try this:
df_group = df.groupby(['id','group']).sum(['time']).rename(columns={'time':'Total'})
df_group['All_total'] = df_group.groupby(['id'])['Total'].transform('sum')

How to filter groupby object in pandas

I have df like this
A B
1 1
1 2
1 3
2 2
2 1
3 2
3 3
3 4
I would like to extract rows whose col B is not ascending like
A B
2 2
2 1
I tried
df.groupby("A").filter()...
But I stacked to extract.
If you have any solution,please let me know.
One way is to use pandas.Series.is_monotonic:
df[df.groupby('A')['B'].transform(lambda x:not x.is_monotonic)]
Output:
A B
3 2 2
4 2 1
Use GroupBy.transform with Series.diff and compare by Series.lt for at least one negative value with Series.any and filter by boolean indexing:
df1 = df[df.groupby('A')['B'].transform(lambda x: x.diff().lt(0).any())]
print (df1)
A B
3 2 2
4 2 1

Df.drop/delete duplicate rows

How can I drop the exact duplicates of a row. So if I have a data frame that looks like so:
A B C
1 2 3
3 2 2
1 2 3
now my data frame is a lot larger than this but is their a way that we can have python look at every row and if the values in the rows are the exact same as another row just drop or delete that row. I want to take in to account for the whole data frame i don't want to specify the column I want to get unique values for.
you can use DataFrame.drop_duplicates() method:
In [23]: df
Out[23]:
A B C
0 1 2 3
1 3 2 2
2 1 2 3
In [24]: df.drop_duplicates()
Out[24]:
A B C
0 1 2 3
1 3 2 2
You can get a de-duplicated dataframe with the inverse of .duplicated:
df[~df.duplicated(['A','B','C'])]
Returns:
>>> df[~df.duplicated(['A','B','C'])]
A B C
0 1 2 3
1 3 2 2

Categories