Perform operations on a dataframe from groupings by ID - python

I have the following dataframe in Python:
ID
maths
value
0
add
12
1
sub
30
0
add
10
2
mult
3
0
sub
10
1
add
11
3
sub
40
2
add
21
My idea is to perform the following operations to get the result I want:
First step: Group the rows of the dataframe by ID. The order of the groups shall be indicated by the order of the original dataframe.
ID
maths
value
0
add
12
0
add
10
0
sub
10
1
sub
30
1
add
11
2
mult
3
2
add
21
3
sub
40
Second step: For each group created: Create a value for a new column 'result' where a mathematical operation indicated by the previous column of 'maths' is performed. If there is no previous row for the group, this column would have the value NaN.
ID
maths
value
result
0
add
12
NaN
0
add
10
22
0
sub
10
20
1
sub
30
NaN
1
add
11
19
2
mult
3
NaN
2
add
21
63
3
sub
40
NaN
Third step: Return the resulting dataframe.
I have tried to realise this code by making use of the pandas groupby method. But I have problems to iterate with conditions for each row and each group, and I don't know how to create the new column 'result' on a groupby object.
grouped_df = testing.groupby('ID')
for key, item in grouped_df:
print(grouped_df.get_group(key))
I don't know whether to use orderby or groupby or some other method that works for what I want to do. If you can help me with a better idea, I'd appreciate it.

ID = list("00011223")
maths = ["add","add","sub","sub","add","mult","add","sub"]
value = [12,10,10,30,11,3,21,40]
import pandas as pd
df = pd.DataFrame(list(zip(ID,maths,value)),columns = ["ID","Maths","Value"])
df["Maths"] = df.groupby(["ID"]).pipe(lambda df:df.Maths.shift(1)).fillna("add")
df["Value1"] = df.groupby(["ID"]).pipe(lambda df:df.Value.shift(1))
df["result"] = df.groupby(["Maths"]).pipe(lambda x:(x.get_group("add")["Value1"] + x.get_group("add")["Value"]).append(
x.get_group("sub")["Value1"] - x.get_group("sub")["Value"]).append(
x.get_group("mult")["Value1"] * x.get_group("mult")["Value"])).sort_index()
Here is the Output:
df
Out[168]:
ID Maths Value Value1 result
0 0 add 12 NaN NaN
1 0 add 10 12.0 22.0
2 0 add 10 10.0 20.0
3 1 add 30 NaN NaN
4 1 sub 11 30.0 19.0
5 2 add 3 NaN NaN
6 2 mult 21 3.0 63.0
7 3 add 40 NaN NaN

Related

Filtering rows that have unique value in a column using pandas

I have a df:
id value
1 10
2 15
1 10
1 10
2 13
3 10
3 20
I am trying to keep only rows that have 1 unique value in column value so that the result df looks like this:
id value
1 10
1 10
1 10
I dropped id = 2, 3 because it has more than 1 unique value in column value, 15, 13 & 10, 20 respectively.
I read this answer.
But this simply removes duplicates whereas I want to check if a given column - in this case column value has more than 1 unique value.
I tried:
df['uniques'] = pd.Series(df.groupby('id')['value'].nunique())
But this returns nan for every row since I am trying to fit n returns on n+m rows after grouping. I can write a function and apply it to every row but I was wondering if there is a smart quick filter that achieves my goal.
Use transform with groupby to align the group values to the individual rows:
df['nuniques'] = df.groupby('id')['value'].transform('nunique')
Output:
id value nuniques
0 1 10 1
1 2 15 2
2 1 10 1
3 1 10 1
4 2 13 2
5 3 10 2
6 3 20 2
If you only need to filter your data, you don't need to assign the new column:
df[df.groupby('id')['value'].transform('nunique') == 1]
Let us do filter
out = df.groupby('id').filter(lambda x : x['value'].nunique()==1)
Out[6]:
id value
0 1 10
2 1 10
3 1 10

Python to check for a particular condition in a dataframe

I am trying to work on a requirement where I have to fill values inside a dataframe as NaN if it goes beyond a certain value.
s={'2018':[1,2,3,4],'2019':[2,3,4,5],'2020':[4,6,8,9],'2021':[11,12,34,42], 'qty':[45,22,12,42],'price':[22,33,44,55]}
p=pd.DataFrame(data=s)
k=(p.qty+p.price) # Not sure if this is the right way as per the requirement.
The condition is that if column 2018 or 19 or 20 or 21 has a value greater than k, then fill that value as NaN.
Say if k=3, the fourth row in 2018 with value 4 will be NaN.
k value will be different for all columns, hence it needs to be calculated column wise and accordingly value has to be NaN.
How would I be able to do this?
Once you figure out exactly what k = (p.qty+p.price) is, you can update it. However, I think the way you want to solve this is using the gt() operator on a column by column basis. Here's my solution.
import pandas as pd
s={'2018':[1,2,3,4],'2019':[2,3,4,5],'2020':[4,6,8,9],'2021': [11,12,34,42], 'qty':[1,2,3,4], 'price':[1,2,3,4]}
p=pd.DataFrame(data=s)
k = (p.qty * p.price)
needed = p[['qty', 'price']]
p = p.where(p.gt(k, axis=0), None)
p[['qty','price']] = needed
print(p)
This Outputs:
2018 2019 2020 2021 qty price
0 NaN 2.0 4.0 11 1 1
1 NaN NaN 6.0 12 2 2
2 NaN NaN NaN 34 3 3
3 NaN NaN NaN 42 4 4
A few notes. I save and re-add the final columns. However, if you do not need those you can remove lines with the word needed. The line with the bulk of the code is p = p.where(p.gt(k, axis=0), None). In this current example, my comparisons are on the column level. So, '2019' : 2,3,4,5 gets compared to k : 1,4,9,16. Showing 2 > 1, but 3,4,5 are all less than 4,9,16, resulting in True, False, False, False. DataFrame.where(cond, other) replaces the False values with None which is python's standard for null.
It is actually very simple. What you need is to learn more about the logical statements in pandas dataframes. To solve your problem, you can try code below:
s={'2018':[1,2,3,4],'2019':[2,3,4,5],'2020':[4,6,8,9],'2021':[11,12,34,42], 'qty':[45,22,12,42],'price':[22,33,44,55]}
p=pd.DataFrame(data=s)
k = 4
p[p<k]
Output
2018
2019
2020
2021
qty
price
0
nan
nan
4
11
45
22
1
nan
nan
6
12
22
33
2
nan
4
8
34
12
44
3
4
5
9
42
42
55
Note that k = (p.qty+p.price) will return a numpy array, not a scalar value.

Pandas: Use selected amount of previous rows in apply function

lets say I have dataframe below:
index value
1 1
2 2
3 3
4 4
I want to apply a function to each row using previous two rows using "apply" statement. Lets say for example I want to multiple current row and previous 2 rows if it exists. (This could be any funtion)
Result:
index value result
1 1 nan
2 2 nan
3 3 6
4 4 24
Thank you.
You can try rolling with prod:
df['result'] = df['value'].rolling(3).apply(lambda x: x.prod())
Output:
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0
Use assign function:
df = df.assign(result = lambda x: x['value'].cumprod().tail(len(df)-2))
I presume you have more than four rows. If so, please try groupby every four rows, cumproduct, choose the last 2 and join to the original datframe.
df['value']=df.index.map(df.assign(result=df['value'].cumprod(0)).groupby(df.index//4).result.tail(2).to_dict())
If just four rows then this should you;
Lets try combine .cumprod() and .tail()
df['result']=df['value'].cumprod(0).tail(2)
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0

Perform a lookup for each column individually Python Pandas

For a dataframe I would like to perform a lookup for every column and place the results in the neighbouring column. id_df contains the IDs and looks as following:
Col1 Col2 ... Col160 Col161
0 4328.0 4561.0 ... NaN 5828.0
1 3587.0 4328.0 ... NaN 20572.0
2 4454.0 1702.0 ... NaN 683.0
lookup_df also contains the ID and a value that I'm interested in. lookup_df looks as following:
ID Value
0 3587 3.0650
1 4454 2.9000
2 5 2.8450
3 8 2.8750
4 11 3.1000
5 13 3.1600
6 16 2.4450
7 18 3.0700
8 20 2.7950
9 23 3.0500
10 25 3.2250
I would like to get the following Dataframe df3:
Col1ID Col1 Value ... Col161 ID Col161 Value
0 4328.0 2.4450 ... 5828.0 3.1600
1 3587.0 3.2250 ... 20572.0 3.0650
2 4454.0 3.0500 ... 683.0 3.1600
Because I'm an excel user I thought of using the function 'merge', but I don't see how this can be done with multiple columns.
Thank you!
Use map:
m = lookup_df.set_index('ID')['Value']
result = pd.DataFrame()
for col in id_df.columns:
result[col + '_ID'] = df[col]
result[col + '_Value'] = df[col].map(m)

Compare corresponding columns with each other and store the result in a new column

I had a data which I pivoted using pivot table method , now the data looks like this:
rule_id a b c
50211 8 0 0
50249 16 0 3
50378 0 2 0
50402 12 9 6
I have set 'rule_id' as index. Now I compared one column to it's corresponding column and created another column with it's result. The idea is if the first column has a value other than 0 and the second column , to which the first column is compared to ,has 0 , then 100 should be updated in the newly created column, but if the situation is vice-versa then 'Null' should be updated. If both column have 0 , then also 'Null' should be updated. If the last column has value 0 , then 'Null' should be updated and other than 0 , then 100 should be updated. But if both the columns have values other than 0(like in the last row of my data) , then the comparison should be like this for column a and b:
value_of_b/value_of_a *50 + 50
and for column b and c:
value_of_c/value_of_b *25 + 25
and similarly if there are more columns ,then the multiplication and addition value should be 12.5 and so on.
I was able to achieve all the above things apart from the last result which is the division and multiplication stuff. I used this code:
m = df.eq(df.shift(-1, axis=1))
arr = np.select([df ==0, m], [np.nan, df], 1*100)
df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: f'comp{x+1}')
df3 = df.join(df2)
df is the dataframe which stores my pivoted table data which I mentioned at the start. After using this code my data looks like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 100 100 100
But I want the data to look like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 87.5 41.67 100
If you guys can help me get the desired data , I would greatly appreciate it.
Edit:
This is how my data looks:
The problem is that the coefficient to use when building the new compx column does not depend only on the columns position. In fact in each row it is reset to its maximum of 50 after each 0 value and is half of previous one after a non 0 value. Those resetable series are hard to vectorize in pandas, especially in rows. Here I would build a companion dataframe holding only those coefficients, and use directly the numpy underlying arrays to compute them as efficiently as possible. Code could be:
# transpose the dataframe to process columns instead of rows
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
df['comp{}'.format(i)] = np.where(df[col1] == 0, np.nan,
np.where(df[col2] == 0, 100,
df[col2]/df[col1]*coeff[col1]
+coeff[col1]))
old = df.columns[0] # store name of first column
# Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
With this initial dataframe:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32
rule_id
50402 0 0 9 0
51121 0 1 0 0
51147 0 1 0 0
51183 2 0 0 0
51283 0 12 9 6
51684 0 1 0 0
52035 0 4 3 2
it gives as expected:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32 comp1 comp2 comp3 comp4
rule_id
50402 0 0 9 0 NaN NaN 100.000000 NaN
51121 0 1 0 0 NaN 100.0 NaN NaN
51147 0 1 0 0 NaN 100.0 NaN NaN
51183 2 0 0 0 100.0 NaN NaN NaN
51283 0 12 9 6 NaN 87.5 41.666667 100.0
51684 0 1 0 0 NaN 100.0 NaN NaN
52035 0 4 3 2 NaN 87.5 41.666667 100.0
Ok, I think you can iterate over your dataframe df and use some if-else to get the desired output.
for i in range(len(df.index)):
if df.iloc[i,1]!=0 and df.iloc[i,2]==0: # column start from index 0
df.loc[i,'colname'] = 'whatever you want' # so rule_id is column 0
elif:
.
.
.

Categories