This is my table:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 2
Now, I want to group all rows by Column A and B. Column C should be summed and for column E, I want to use the value where value C is max.
I did the first part of grouping A and B and summing C. I did this with:
df = df.groupby(['A', 'B'])['C'].sum()
But at this point, I am not sure how to tell that column E should take the value where C is max.
The end result should look like this:
A B C E
0 1 1 6 4
1 3 3 8 2
Can somebody help me with this past piece?
Thanks!
Using groupby with agg after sorting by C.
In general, if you are applying different functions to different columns, DataFrameGroupBy.agg allows you to pass a dictionary specifying which operation is applied to each column:
df.sort_values('C').groupby(['A', 'B'], sort=False).agg({'C': 'sum', 'E': 'last'})
C E
A B
1 1 6 4
3 3 8 2
By sorting by column C first, and not sorting as part of groupby, we can select the last value of E per group, which will align with the maximum value of C for each group.
Related
I have a dataset of 100 rows, I want to split them into multiple of 4 and then perform operations on it, i.e., first perform operation on first four rows, then on the next four rows and so on.
Note: Rows are independent of each other.
I don't know how to do it. Can somebody pls help me, I would be extremely thankful to him/her.
i will divide df per 2 row (simple example)
and make list dfs
Example
df = pd.DataFrame(list('ABCDE'), columns=['value'])
df
value
0 A
1 B
2 C
3 D
4 E
Code
grouper for grouping
grouper = pd.Series(range(0, len(df))) // 2
grouper
0 0
1 0
2 1
3 1
4 2
dtype: int64
divide to list
g = df.groupby(grouper)
dfs = [g.get_group(x) for x in g.groups]
result(dfs):
[ value
0 A
1 B,
value
2 C
3 D,
value
4 E]
Check
dfs[0]
output:
value
0 A
1 B
I have a pandas dataframe in which I want to add a column (col_new), which values depend on a comparison of values in a existing column (col_exist).
Existing column (type=objects) contains As and Bs.
New column should count, starting with 1.
If an A follows an A, the count should rise by one.
If an A follows a B, the count should rise by one.
If a B follows an A, the count should not rise.
If a B follows a B, the count should not rise.
col_exist col_new
A 1
A 2
A 3
B 3
A 4
B 4
B 4
A 5
B 5
I am completely new to programming, so thank you in advance for your adequade answer.
Use eq and cumsum:
df['col_new'] = df['col_exist'].eq('A').cumsum()
output:
col_exist col_new
0 A 1
1 A 2
2 A 3
3 B 3
4 A 4
5 B 4
6 B 4
7 A 5
8 B 5
I have a dataframe with rows that I'd like to shuffle continuously until the value in column B is not identical across any two consecutive rows:
initial dataframe:
A | B
_______
a 1
b 1
c 2
d 3
e 3
Possible outcome:
A | B
_______
b 1
c 2
e 3
a 1
d 3
I made a function scramble meant to do this but I am having trouble passing the newly scrambled dataframe back into the function to test for matching B values:
def scamble(x):
curr_B='nothing'
for index, row in x.iterrows():
next_B=row['B']
if str(next_B) == str(curr_B):
x=x.sample(frac=1)
curr_B=next_B
curr_B=next_B
return x
df=scramble(df)
I suspect the function is finding the matching values in the next row, but I can't shuffle it continuously until there are no two sequential rows with the same B value.
Printing the output yields a dataframe shows consecutive rows with the same value in B.
If your goal is to eliminate consecutive duplicates, you can just use groupby and cumcount, then reindex your DataFrame:
df.loc[df.groupby('B').cumcount().sort_values().index]
A B
0 a 1
2 c 2
3 d 3
1 b 1
4 e 3
If you actually want randomness, then you can group on cumcount and call shuffle. This should eliminate consecutive dupes to some degree (NOT GUARANTEED) while preserving randomness and still avoiding slow iteration. Here's an example:
np.random.seed(0)
(df.groupby(df.groupby('B').cumcount(), group_keys=False)
.apply(lambda x: x.sample(frac=1))
.reset_index(drop=True))
A B
0 d 3
1 a 1
2 c 2
3 b 1
4 e 3
I have a Dataframe that looks like this:
Symbols Count
A 3
A 1
A 2
A 4
B 1
B 3
B 9
C 2
C 1
C 3
What I want to do using Pandas is to identify duplicate rows on the "Count" column but I want to count the number of times the Symbols intersect with each other on the duplicate.
By this I mean, if a Count value appears twice with two different Symbols. The Symbols are listed as having one intersection between them as they share the same Count value.
Something like this:
Symbol Symbol Number of Intersections
A B 2
B A 2
C A 3
.....
I'm sure there is a Pythonic Pandas way of doing this. But its not coming to mind.
Let's use merge to do a self merge then query, and groupby:
df_selfmerge = df.merge(df, on='Count', how="inner").query('Symbols_x != Symbols_y')
(df_selfmerge.groupby(['Symbols_x','Symbols_y'])['Count']
.count()
.reset_index()
.rename(columns={'Symbols_x':'Symbol',
'Symbols_y':'Symbol',
'Count':'Number of Intersections'}))
EDIT: Use size() is safer just incase of NaN valueas
(df_selfmerge.groupby(['Symbols_x','Symbols_y'])['Count']
.size()
.reset_index()
.rename(columns={'Symbols_x':'Symbol',
'Symbols_y':'Symbol',
0:'Number of Intersections'}))
Output:
Symbol Symbol Number of Intersections
0 A B 2
1 A C 3
2 B A 2
3 B C 2
4 C A 3
5 C B 2
I have a dataframe looks like below and I have reordered the dataframe depending on the value of column B.
a = df.sort(['B', 'A'], ascending=[True, False])
#This is my df
A,B
a,2
b,3
c,4
d,5
d,6
d,7
d,9
Then I'd like to calculate the difference between each element in column B when column A is the same. But if column A only contain single data point then the result will be zero.
So firstly I used groupby() to do so.
b = a['B'].groupby(df['A']))
Then I stuck here, I know I can use lambda x: abs(x[i] - x[i+1]) or even apply() function to finish the calculation. But I still fail to get it done.
Can anyone give me a tip or suggestion?
# What I want to see in the result
A,B
a,0
b,0
c,0
d,0 # 5 minus 5
d,1 # 6 minus 5
d,1 # 7 minus 6
d,2 # 9 minus 7
In both the 1-member and multimember group cases, taking the diff will produce a nan for the first value, which we can fillna with 0:
>>> df["B"] = df.groupby("A")["B"].diff().fillna(0)
>>> df
A B
0 a 0
1 b 0
2 c 0
3 d 0
4 d 1
5 d 1
6 d 2
This assumes there aren't NaNs already there you want to preserve. We could still make that work if we needed to.
You can do that:
df.groupby(level="A").B.diff().fillna(0)
A
a 0
b 0
c 0
d 0
d 1
d 1
d 2