Conditional shift in pandas - python

The following pandas DataFrame is an example that I need to deal with:
Group Amount
1 1 100
2 1 300
3 1 400
4 1 700
5 2 500
6 2 900
Here's the result that I want after calculation:
Group Amount Difference
1 1 100 100
2 1 300 200
3 1 400 100
4 1 700 300
5 2 500 500
6 2 900 400
I knew that df["Difference"] = df["Amount"] - df["Amount"].shift(-1) can produce the difference between all rows, but what can I do for the problem I have like this that needs a group as condition?

groupby on 'Group' and call transform on the 'Amount' col, additionally call fillna and pass the 'Amount' column:
In [110]:
df['Difference'] = df.groupby('Group')['Amount'].transform(pd.Series.diff).fillna(df['Amount'])
df
​
Out[110]:
Group Amount Difference
1 1 100 100
2 1 300 200
3 1 400 100
4 1 700 300
5 2 500 500
6 2 900 400

Related

Create an incremental serial no for filtered rows in pandas dataframe

Can you please help me change my code from current output to expected output? I am using apply() function of dataframe. It would be great if it could also be done more efficiently using vector operation.
Current output:
col1 col2 serialno
0 100 0 4
1 100 100 0
2 100 100 0
3 200 100 4
4 200 200 0
5 300 200 4
6 300 300 0
Expected output:
col1 col2 serialno
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
My current code contains a static value (4). I need to increment it by one based on a condition (col1 != col2). In addition, I need to repeat the serial no for all rows that meet the condition (col1 == col2).
My code:
import pandas as pd
columns = ['col1']
data = ['100','100','100','200','200','300','300']
df = pd.DataFrame(data=data,columns=columns)
df['col2'] = df.col1.shift(1).fillna(0)
print(df)
start = 4
series = (df['col1']!=df['col2']).apply(lambda x: start if x==True else 0)
df['serialno'] = series
print(df)
You can try this
import itertools
start = 4
counter = itertools.count(start) # to have incremental counter
df["serialno"] = [start if (x["col1"]==x["col2"] or (start:=next(counter))) else start for _, x in df.iterrows()]
This if condition will be executed in two manner: if col1 and col2 have same value then it will not go the next condition so start value will be same and if first condition is false then our counter will be incremented by 1.
Here is how you can do it with apply function:
ID = 3
def check_value(A, B):
global ID
if A != B:
ID += 1
return ID
df['id'] = df.apply(lambda row: check_value(row['col1'], row['col2']), axis=1)
You just need to start from 3 since the first row will increment it.
print(df) will give you this:
col1 col2 id
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
Another way could be using itertools.accumulate as follows:
import pandas as pd
imoprt numpy as np
df['serialno'] = list(accumulate(np.arange(1, len(df.index)), lambda x, y: x + 1 if df.iloc[y, 0] != df.iloc[y, 1] else x, initial = 4))
df
col1 col2 serialno
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
First, I have reused the idea posted by #BehRouz above of creating a user-defined function that increments the serial no.
Second, I transform the boolean series to a dynamic serial no as shown below for reference sake.
STEP 1: Create a dataframe and initialize the incremental counter (serialno)
import pandas as pd
columns = ['col1']
data = ['100','100','100','200','200','300','300']
df = pd.DataFrame(data=data,columns=columns)
df['col2'] = df.col1.shift(1).fillna(0)
df['serialno']=0 #initialize new column
print(df)
col1 col2 serialno
0 100 0 0
1 100 100 0
2 100 100 0
3 200 100 0
4 200 200 0
5 300 200 0
6 300 300 0
STEP 2: Create a boolean series and then use the transform method and the user-defined function posted by #BehRouz. This will create a dynamic serial no each time new rows are added to the dataframe.
start = df['serialno'].max()
def getvalue(x):
global start
if x:
start += 1
else:
start
return start
df['serialno'] = (df['col1']!=df['col2']).transform(func = lambda x: getvalue(x))
print(df)
Iteration 1:
col1 col2 serialno
0 100 0 1
1 100 100 1
2 100 100 1
3 200 100 2
4 200 200 2
5 300 200 3
6 300 300 3
Iteration 2:
col1 col2 serialno
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
Iteration 3:
col1 col2 serialno
0 100 0 7
1 100 100 7
2 100 100 7
3 200 100 8
4 200 200 8
5 300 200 9
6 300 300 9

Python Pandas Dataframe Groupby Sum question

I'm new in Python and I need to combine 2 dataframe with 'id' as the primary key. I need to sum up all the Charges from df1 and df2.
df1:
[df1][1]
id Name Charge
1 A 100
1 A 100
2 B 200
2 B 200
5 C 300
6 D 400
df2:
[df2][2]
id Name Charge
1 A 100
1 A 100
2 B 200
8 X 200
output:
[output][3]
id Name Charge(TOTAL from df1 & df2)
1 A 400
2 B 600
5 C 300
6 D 400
8 X 200
Try:
pd.concat([df1, df2]).groupby(['id', 'Name'], as_index=False)['Charge'].sum()
Output:
id Name Charge
0 1 A 400
1 2 B 600
2 5 C 300
3 6 D 400
4 8 X 200
ans = pd.concat([df1, df2], axis=0).groupby(["id", "Name"]).sum().reset_index()
print(ans)
id Name Charge
0 1 A 400
1 2 B 600
2 5 C 300
3 6 D 400
4 8 X 200

How to remove rows of a data frame when specific amount are not in specific columns?

I have two data frames with four and two columns. For example:
A B C D
0 4 2 320 700
1 5 7 400 800
2 2 6 300 550
3 4 6 100 300
4 5 2 250 360
and
A B
0 2 4
1 5 7
2 2 5
I need to compare the first data frame with the second data frame and if column A and column B in the second data frame was in column A and column B in the first data frame.
(order doesn't matter. it means in the first data frame in the first row A is 4, B is 2 and in the second data frame is A is 2 and B is 4 and it's not important but both numbers should be in the columns) keep the whole row in the first data frame; otherwise remove the row. so the output will be :
A B C D
0 4 2 320 700
1 5 7 400 800
2 5 2 250 360
How can I get this output (my actual data frames are so huge and can't iterate through them so need a fast efficient way)?
I would do this by first sorting, then performing an LEFT OUTER JOIN using merge with an indicator to determine which rows to keep. Example,
u = df.loc[:, ['A', 'B']]
u.values.sort() # sort columns of `u`
df2.values.sort() # sort columns of `df2`
df[u.merge(df2, how='left', indicator='ind').eval('ind == "both"').values]
A B C D
0 4 2 320 700
1 5 7 400 800
4 5 2 250 360
More info on joins with indicator can be found in my post: Pandas Merging 101
If you don't care about the final result being sorted or not, you can simplify this to an inner join.
df[['A', 'B']] = np.sort(df[['A', 'B']])
df2[:] = np.sort(df2)
df.merge(df2, on=['A', 'B'])
A B C D
0 2 4 320 700
1 5 7 400 800
2 2 5 250 360
What I will do using frozenset + isin
yourdf=df[df[['A','B']].apply(frozenset,1).isin(df1.apply(frozenset,1))].copy()
A B C D
0 4 2 320 700
1 5 7 400 800
4 5 2 250 360
Using np.equal.outer
arr = np.equal.outer(df, df2)
df.loc[arr.any(1).all(-1).any(-1)]
Outputs
A B C D
0 4 2 320 700
1 5 7 400 800
4 5 2 250 360

Python pandas: groupby and devide by the first value of each group

I have a pandas dataframe like this.
>data
ID Distance Speed
1 100 40
1 200 20
1 200 10
2 400 20
2 500 30
2 100 40
2 600 20
2 700 90
3 800 80
3 700 10
3 400 20
I want to groupby the table by ID, and create a new column time by dividing each value in the Distance column by the first row of the Speed column of each ID group. So the result should look like this.
>data
ID Distance Speed Time
1 100 40 2.5
1 200 20 5
1 200 10 5
2 400 20 20
2 500 30 25
2 100 40 5
2 600 20 30
2 700 90 35
3 800 80 10
3 700 10 8.75
3 400 20 5
My attempt:
data['Time'] = data['Distance'] / data.loc[data.groupby('ID')['Speed'].head(1).index, 'Speed']
But the result seems to be not good. How do you do it?
Use transform with first for return same length Series as original df:
data['Time'] = data['Distance'] /data.groupby('ID')['Speed'].transform('first')
Or use drop_duplicates with map:
s = data.drop_duplicates('ID').set_index('ID')['Speed']
data['Time'] = data['Distance'] / data['ID'].map(s)
print (data)
ID Distance Speed Time
0 1 100 40 2.50
1 1 200 20 5.00
2 1 200 10 5.00
3 2 400 20 20.00
4 2 500 30 25.00
5 2 100 40 5.00
6 2 600 20 30.00
7 2 700 90 35.00
8 3 800 80 10.00
9 3 700 10 8.75
10 3 400 20 5.00

Pandas: Find values within multiple ranges defined by start- and stop-columns

I'm trying to use two columns start and stop to define multiple ranges of values in another dataframe's age column. Ranges are defined in a df called intervals:
start stop
1 3
5 7
Ages are defined in another df:
age some_random_value
1 100
2 200
3 300
4 400
5 500
6 600
7 700
8 800
9 900
10 1000
Desired output is values where age is between the ranges defined in intervals (1-3 and 5-7):
age some_random_value
1 100
2 200
3 300
5 500
6 600
7 700
I've tried using numpy.r_ but it doesn't work quite as I want it to:
df.age.loc[pd.np.r_[intervals.start, intervals.stop]]
Which yields:
age some_random_value
2 200
6 600
4 400
8 800
Any ideas are much appreciated!
I believe need parameter closed='both' in IntervalIndex.from_arrays:
intervals = pd.IntervalIndex.from_arrays(df2['start'], df2['stop'], 'both')
And then select matching values:
df = df[intervals.get_indexer(df.age.values) != -1]
print (df)
age some_random_value
0 1 100
1 2 200
2 3 300
4 5 500
5 6 600
6 7 700
Detail:
print (intervals.get_indexer(df.age.values))
[ 0 0 0 -1 1 1 1 -1 -1 -1]

Categories