Pandas groupby data and do calculation - python

I have a dataframe looks like below and I have reordered the dataframe depending on the value of column B.
a = df.sort(['B', 'A'], ascending=[True, False])
#This is my df
A,B
a,2
b,3
c,4
d,5
d,6
d,7
d,9
Then I'd like to calculate the difference between each element in column B when column A is the same. But if column A only contain single data point then the result will be zero.
So firstly I used groupby() to do so.
b = a['B'].groupby(df['A']))
Then I stuck here, I know I can use lambda x: abs(x[i] - x[i+1]) or even apply() function to finish the calculation. But I still fail to get it done.
Can anyone give me a tip or suggestion?
# What I want to see in the result
A,B
a,0
b,0
c,0
d,0 # 5 minus 5
d,1 # 6 minus 5
d,1 # 7 minus 6
d,2 # 9 minus 7

In both the 1-member and multimember group cases, taking the diff will produce a nan for the first value, which we can fillna with 0:
>>> df["B"] = df.groupby("A")["B"].diff().fillna(0)
>>> df
A B
0 a 0
1 b 0
2 c 0
3 d 0
4 d 1
5 d 1
6 d 2
This assumes there aren't NaNs already there you want to preserve. We could still make that work if we needed to.

You can do that:
df.groupby(level="A").B.diff().fillna(0)
A
a 0
b 0
c 0
d 0
d 1
d 1
d 2

Related

split pandas data frame into multiple of 4 rows

I have a dataset of 100 rows, I want to split them into multiple of 4 and then perform operations on it, i.e., first perform operation on first four rows, then on the next four rows and so on.
Note: Rows are independent of each other.
I don't know how to do it. Can somebody pls help me, I would be extremely thankful to him/her.
i will divide df per 2 row (simple example)
and make list dfs
Example
df = pd.DataFrame(list('ABCDE'), columns=['value'])
df
value
0 A
1 B
2 C
3 D
4 E
Code
grouper for grouping
grouper = pd.Series(range(0, len(df))) // 2
grouper
0 0
1 0
2 1
3 1
4 2
dtype: int64
divide to list
g = df.groupby(grouper)
dfs = [g.get_group(x) for x in g.groups]
result(dfs):
[ value
0 A
1 B,
value
2 C
3 D,
value
4 E]
Check
dfs[0]
output:
value
0 A
1 B

Shuffle rows of a DataFrame until all consecutive values in a column are different?

I have a dataframe with rows that I'd like to shuffle continuously until the value in column B is not identical across any two consecutive rows:
initial dataframe:
A | B
_______
a 1
b 1
c 2
d 3
e 3
Possible outcome:
A | B
_______
b 1
c 2
e 3
a 1
d 3
I made a function scramble meant to do this but I am having trouble passing the newly scrambled dataframe back into the function to test for matching B values:
def scamble(x):
curr_B='nothing'
for index, row in x.iterrows():
next_B=row['B']
if str(next_B) == str(curr_B):
x=x.sample(frac=1)
curr_B=next_B
curr_B=next_B
return x
df=scramble(df)
I suspect the function is finding the matching values in the next row, but I can't shuffle it continuously until there are no two sequential rows with the same B value.
Printing the output yields a dataframe shows consecutive rows with the same value in B.
If your goal is to eliminate consecutive duplicates, you can just use groupby and cumcount, then reindex your DataFrame:
df.loc[df.groupby('B').cumcount().sort_values().index]
A B
0 a 1
2 c 2
3 d 3
1 b 1
4 e 3
If you actually want randomness, then you can group on cumcount and call shuffle. This should eliminate consecutive dupes to some degree (NOT GUARANTEED) while preserving randomness and still avoiding slow iteration. Here's an example:
np.random.seed(0)
(df.groupby(df.groupby('B').cumcount(), group_keys=False)
.apply(lambda x: x.sample(frac=1))
.reset_index(drop=True))
A B
0 d 3
1 a 1
2 c 2
3 b 1
4 e 3

Set value of pandas data frame on conditional

I can't find a similar question for this query. However, I have a pandas dataframe where I want to use two of the columns to make conditional and if its true, replace the values in one of these columns.
For example. One of my columns is the 'itemname' and the other is the 'value'.
the 'itemname' may be repeated many times. I want to check for each 'itemname', if all other items with the same name have value 0, then replace these 'value' with 100.
I know this should be simple, however I cannot get my head around it.
Just to make it clearer, here
itemname value
0 a 0
1 b 100
2 c 0
3 a 0
3 b 75
3 c 90
I would like my statement to change this data frame to
itemname value
0 a 100
1 b 100
2 c 0
3 a 100
3 b 75
3 c 90
Hope that makes sense. I check if someone else has asked something similar and couldnt find something in this case.
Using transform with any:
df.loc[~df.groupby('itemname').value.transform('any'), 'value'] = 100
Using numpy.where:
s = ~df.groupby('itemname').value.transform('any')
df.assign(value=np.where(s, 100, df.value))
Using addition and multiplication:
s = ~df.groupby('itemname').value.transform('any')
df.assign(value=df.value + (100 * s))
Both produce the correct output, however, np.where and the final solution don't modify the DataFrame in place:
itemname value
0 a 100
1 b 100
2 c 0
3 a 100
3 b 75
3 c 90
Explanation
~df.groupby('itemname').value.transform('any')
0 True
1 False
2 False
3 True
3 False
3 False
Name: value, dtype: bool
Since 0 is a falsey value, we can use any, and negate the result, to find groups where all values are equal to 0.
You can use GroupBy + transform to create a mask. Then assign via pd.DataFrame.loc and Boolean indexing:
mask = df.groupby('itemname')['value'].transform(lambda x: x.eq(0).all())
df.loc[mask.astype(bool), 'value'] = 100
print(df)
itemname value
0 a 100
1 b 100
2 c 0
3 a 100
3 b 75
3 c 90
If all your values are positive or 0
Could use transform with sum and check if 0:
m = (df.groupby('itemname').transform('sum') == 0)['value']
df.loc[m, 'value'] = 100

df.groupby() modification HELP needed

This is my table:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 2
Now, I want to group all rows by Column A and B. Column C should be summed and for column E, I want to use the value where value C is max.
I did the first part of grouping A and B and summing C. I did this with:
df = df.groupby(['A', 'B'])['C'].sum()
But at this point, I am not sure how to tell that column E should take the value where C is max.
The end result should look like this:
A B C E
0 1 1 6 4
1 3 3 8 2
Can somebody help me with this past piece?
Thanks!
Using groupby with agg after sorting by C.
In general, if you are applying different functions to different columns, DataFrameGroupBy.agg allows you to pass a dictionary specifying which operation is applied to each column:
df.sort_values('C').groupby(['A', 'B'], sort=False).agg({'C': 'sum', 'E': 'last'})
C E
A B
1 1 6 4
3 3 8 2
By sorting by column C first, and not sorting as part of groupby, we can select the last value of E per group, which will align with the maximum value of C for each group.

fill in dataframe with two for loops and if condition in python

I have two DataFrames, one looks something like this:
df1:
x y Counts
a b 1
a c 3
b c 2
c d 1
The other one has both as index and as columns the list of unique values in the first two columns:
df2
a b c d
a
b
c
d
What I wouldl like to do is to fill in the second DataFrame with values from the first one, given the intersection of column and index is the same line from the first DataFrame, e.g.:
a b c d
a 0 1 3 0
b 1 0 2 0
c 3 2 0 1
d 0 0 1 0
While I try to use two for loops with a double if-condition, it makes the computer block (given that a real DataFrame contains more than 1000 rows).
The piece of code I am trying to implement (and which makes calculations apparently too 'heavy' for a computer to perform):
for i in df2.index:
for j in df2.columns:
if (i==df1.x.any() and j==df1.y.any()):
df2.loc[i,j]=df1.Counts
Important to notice, the list of unique values (i.e., index and columns in the second DataFrame) is longer than the number of rows in the first columns, in my example they coincided.
If it is of any relevance, the first dataframe represents basically combinations of words in the first and in the second column and their occurences in the text. Occurrences are basically the weights of edges.
So, I am trying to create a matrix so as to plot a graph via igraph. I chose to first create a DataFrame, then its values taken as an array pass to igraph.
As far as I could understand, python-igraph cannot use dataframe to plot a graph, a numpy array only.
Tried some of the soulutions suggested for the similar issues, nothing worked out so far.
Any suggestions to improve my question are warmly welcomed (it's my first question here).
You can do something like this:
import pandas as pd
#df = pd.read_clipboard()
#df2 = df.copy()
df3=df2.pivot(index='x',columns='y',values='Counts')
print df3
print
new=sorted((set(df3.columns.tolist()+df3.index.tolist())))
df3 = df3.reindex(new,columns=new).fillna(0).applymap(int)
print df3
output:
y b c d
x
a 1.0 3.0 NaN
b NaN 2.0 NaN
c NaN NaN 1.0
y a b c d
x
a 0 1 3 0
b 0 0 2 0
c 0 0 0 1
d 0 0 0 0
stack df2 and fillna with df1
idx = pd.Index(np.unique(df1[['x', 'y']]))
df2 = pd.DataFrame(index=idx, columns=idx)
df2.stack(dropna=False).fillna(df1.set_index(['x', 'y']).Counts) \
.unstack().fillna(0).astype(int)
a b c d
a 0 1 3 0
b 0 0 2 0
c 0 0 0 1
d 0 0 0 0

Categories