I am trying to work on a requirement where I have to fill values inside a dataframe as NaN if it goes beyond a certain value.
s={'2018':[1,2,3,4],'2019':[2,3,4,5],'2020':[4,6,8,9],'2021':[11,12,34,42], 'qty':[45,22,12,42],'price':[22,33,44,55]}
p=pd.DataFrame(data=s)
k=(p.qty+p.price) # Not sure if this is the right way as per the requirement.
The condition is that if column 2018 or 19 or 20 or 21 has a value greater than k, then fill that value as NaN.
Say if k=3, the fourth row in 2018 with value 4 will be NaN.
k value will be different for all columns, hence it needs to be calculated column wise and accordingly value has to be NaN.
How would I be able to do this?
Once you figure out exactly what k = (p.qty+p.price) is, you can update it. However, I think the way you want to solve this is using the gt() operator on a column by column basis. Here's my solution.
import pandas as pd
s={'2018':[1,2,3,4],'2019':[2,3,4,5],'2020':[4,6,8,9],'2021': [11,12,34,42], 'qty':[1,2,3,4], 'price':[1,2,3,4]}
p=pd.DataFrame(data=s)
k = (p.qty * p.price)
needed = p[['qty', 'price']]
p = p.where(p.gt(k, axis=0), None)
p[['qty','price']] = needed
print(p)
This Outputs:
2018 2019 2020 2021 qty price
0 NaN 2.0 4.0 11 1 1
1 NaN NaN 6.0 12 2 2
2 NaN NaN NaN 34 3 3
3 NaN NaN NaN 42 4 4
A few notes. I save and re-add the final columns. However, if you do not need those you can remove lines with the word needed. The line with the bulk of the code is p = p.where(p.gt(k, axis=0), None). In this current example, my comparisons are on the column level. So, '2019' : 2,3,4,5 gets compared to k : 1,4,9,16. Showing 2 > 1, but 3,4,5 are all less than 4,9,16, resulting in True, False, False, False. DataFrame.where(cond, other) replaces the False values with None which is python's standard for null.
It is actually very simple. What you need is to learn more about the logical statements in pandas dataframes. To solve your problem, you can try code below:
s={'2018':[1,2,3,4],'2019':[2,3,4,5],'2020':[4,6,8,9],'2021':[11,12,34,42], 'qty':[45,22,12,42],'price':[22,33,44,55]}
p=pd.DataFrame(data=s)
k = 4
p[p<k]
Output
2018
2019
2020
2021
qty
price
0
nan
nan
4
11
45
22
1
nan
nan
6
12
22
33
2
nan
4
8
34
12
44
3
4
5
9
42
42
55
Note that k = (p.qty+p.price) will return a numpy array, not a scalar value.
Related
I have the following dataframe in Python:
ID
maths
value
0
add
12
1
sub
30
0
add
10
2
mult
3
0
sub
10
1
add
11
3
sub
40
2
add
21
My idea is to perform the following operations to get the result I want:
First step: Group the rows of the dataframe by ID. The order of the groups shall be indicated by the order of the original dataframe.
ID
maths
value
0
add
12
0
add
10
0
sub
10
1
sub
30
1
add
11
2
mult
3
2
add
21
3
sub
40
Second step: For each group created: Create a value for a new column 'result' where a mathematical operation indicated by the previous column of 'maths' is performed. If there is no previous row for the group, this column would have the value NaN.
ID
maths
value
result
0
add
12
NaN
0
add
10
22
0
sub
10
20
1
sub
30
NaN
1
add
11
19
2
mult
3
NaN
2
add
21
63
3
sub
40
NaN
Third step: Return the resulting dataframe.
I have tried to realise this code by making use of the pandas groupby method. But I have problems to iterate with conditions for each row and each group, and I don't know how to create the new column 'result' on a groupby object.
grouped_df = testing.groupby('ID')
for key, item in grouped_df:
print(grouped_df.get_group(key))
I don't know whether to use orderby or groupby or some other method that works for what I want to do. If you can help me with a better idea, I'd appreciate it.
ID = list("00011223")
maths = ["add","add","sub","sub","add","mult","add","sub"]
value = [12,10,10,30,11,3,21,40]
import pandas as pd
df = pd.DataFrame(list(zip(ID,maths,value)),columns = ["ID","Maths","Value"])
df["Maths"] = df.groupby(["ID"]).pipe(lambda df:df.Maths.shift(1)).fillna("add")
df["Value1"] = df.groupby(["ID"]).pipe(lambda df:df.Value.shift(1))
df["result"] = df.groupby(["Maths"]).pipe(lambda x:(x.get_group("add")["Value1"] + x.get_group("add")["Value"]).append(
x.get_group("sub")["Value1"] - x.get_group("sub")["Value"]).append(
x.get_group("mult")["Value1"] * x.get_group("mult")["Value"])).sort_index()
Here is the Output:
df
Out[168]:
ID Maths Value Value1 result
0 0 add 12 NaN NaN
1 0 add 10 12.0 22.0
2 0 add 10 10.0 20.0
3 1 add 30 NaN NaN
4 1 sub 11 30.0 19.0
5 2 add 3 NaN NaN
6 2 mult 21 3.0 63.0
7 3 add 40 NaN NaN
I have a dataset which looks like this:
import pandas as pd, numpy as np
df = pd.DataFrame([[1,0,3,0], [5,6,7,8], [9,10,11,12], [11, 14,15,16], [0,0,19,0]], columns=['a','b','c','d'])
So what I want to do is:
in the last row, wherever value is 0s, replace with the mean of previous three rows of the same column
if the value is not 0, then leave it as it is
Also all other 0's elsewhere should remain 0 only.
So the end result should look something like this:
a b c d
1 0 3 0
5 6 7 8
9 10 11 12
13 14 15 16
9 10 19 12
Here, all three 0s are replaced with the previous three values' mean. And 19 remains as it is.
What I am trying to do is:
if (df.iloc[-1].any()==0):
df.iloc[-1] = df[-4:-1].mean()
else:
pass
This did not change the values and no error was returned as well. What am I doing wrong here?
It'll be much easier if you just replace 0 with NaN then use fillna with rolling mean, and shift:
>>> df.iloc[-1]=df.iloc[-1].replace(0, np.nan)
>>> df=df.fillna(df.rolling(3, min_periods=1).mean().shift())
OUTPUT:
a b c d
0 1.0 0.0 3 0.0
1 5.0 6.0 7 8.0
2 9.0 10.0 11 12.0
3 13.0 14.0 15 16.0
4 9.0 10.0 19 12.0
With np.where:
last_row = df.iloc[-1]
df.iloc[-1] = np.where(last_row.eq(0), df.iloc[-4:-1].mean(), last_row)
This will take values from three previous rows' mean where last row is equal to 0 and from the last row itself otherwise, i.e., nonzero values will stay as is.
pandas' where can be similarly used:
last_row = df.iloc[-1]
df.iloc[-1] = last_row.where(last_row.ne(0), df.iloc[-4:-1].mean())
Where the last row's values are not equal to 0 will be replaced with the mean of previous three's mean.
I have a dataframe with around 50 columns and around 3000 rows. Most cells are empty but not all of them. I am trying to add a new row at the end of the dataframe, with the mean value of each column and I need it to ignore the empty cells.
I am using df.mean(axis=0), which somehows turns all values of the dataframe into imaginary numbers. All values stay the same but a +0j is added. I have no Idea why.
Turbine.loc['Mean_Values'] = Turbine.mean(axis=0)
I couldnt find a solution for this, is it because of the empty cells?
Base on this, df.mean() will automatically skip the NaN/Null value with parameter value of skipna=True. Example:
df=pd.DataFrame({'value':[1,2,3,np.nan,5,6,np.nan]})
df=df.append({'value':df.mean(numeric_only=True).value}, ignore_index=True,)
print(df)
Output:
value
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
7 3.4
But if there is a complex number in a cell, the result of df.mean() will be cast to complex number. Example:
df=pd.DataFrame({'value':[1,2,3,np.nan,5,6,np.nan, complex(1,0)]})
print(df)
print('\n')
df=df.append({'value':df.mean(numeric_only=True).value}, ignore_index=True,)
print(df)
Output with a complex value in a cell:
value
0 (1+0j)
1 (2+0j)
2 (3+0j)
3 NaN
4 (5+0j)
5 (6+0j)
6 NaN
7 (1+0j)
value
0 (1+0j)
1 (2+0j)
2 (3+0j)
3 NaN
4 (5+0j)
5 (6+0j)
6 NaN
7 (1+0j)
8 (3+0j)
Hope this can help you :)
some cells had information about directions (north, west...) in them, which were interpreted as imaginary numbers.
This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I cannot figure out how to compare two columns and if one columns is greater than or equal to another number input '1' to a new column. If the condition is not met I would like python to do nothing.
The data set for testing is here:
data = [[12,10],[15,10],[8,5],[4,5],[15,'NA'],[5,'NA'],[10,10], [9,10]]
df = pd.DataFrame(data, columns = ['Score', 'Benchmark'])
Score Benchmark
0 12 10
1 15 10
2 8 5
3 4 5
4 15 NA
5 5 NA
6 10 10
7 9 10
The desired output is:
desired_output_data = [[12,10, 1],[15,10,1],[8,5,1],[4,5],[15,'NA'],[5,'NA'],[10,10,1], [9,10]]
desired_output_df = pd.DataFrame(desired_output_data, columns = ['Score', 'Benchmark', 'MetBench'])
Score Benchmark MetBench
0 12 10 1.0
1 15 10 1.0
2 8 5 1.0
3 4 5 NaN
4 15 NA NaN
5 5 NA NaN
6 10 10 1.0
7 9 10 NaN
I tried doing something like this:
if df['Score'] >= df['Benchmark']:
df['MetBench'] = 1
I am new to programming in general so any guidance would be greatly appreciated.
Thank you!
Can usege and map
df.Score.ge(df.Benchmark).map({True: 1, False:np.nan})
or use the mapping from False to np.nan implicitly, since pandas uses the dict.get method to apply the mapping, and None is the default value (thanks #piRSquared)
df.Score.ge(df.Benchmark).map({True: 1})
Or simply series.where
df.Score.ge(df.Benchmark).where(lambda s: s)
Both outputs
0 1.0
1 1.0
2 1.0
3 NaN
4 NaN
5 NaN
6 1.0
7 NaN
dtype: float64
Make sure to do
df['Benchmark'] = pd.to_numeric(df['Benchmark'], errors='coerce')
first, since you have 'NA' as a string, but you need the numeric value np.nan to be able to compare it with other numbers
I have just read this question:
In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column?
and I am completely baffled by the answer. How does this work???
I mean, when I groupby('user') shouldn't the result be, well, grouped by user?
Whatever the function I use (mean, sum etc) I would expect a result like this:
aa=pd.DataFrame([{'user':'F','time':0},
{'user':'T','time':0},
{'user':'T','time':0},
{'user':'T','time':1},
{'user':'B','time':1},
{'user':'K','time':2},
{'user':'J','time':2},
{'user':'T','time':3},
{'user':'J','time':4},
{'user':'B','time':4}])
aa2=aa.groupby('user')['time'].sum()
print(aa2)
user
B 5
F 0
J 6
K 2
T 4
Name: time, dtype: int64
How does diff() instead return a diff of each row with the previous, within each group?
aa['diff']=aa.groupby('user')['time'].diff()
print(aa)
time user diff
0 0 F NaN
1 0 T NaN
2 0 T 0.0
3 1 T 1.0
4 1 B NaN
5 2 K NaN
6 2 J NaN
7 3 T 2.0
8 4 J 2.0
9 4 B 3.0
And more important, how is the result not a unique list of 'user' values?
I found many answers that use groupby.diff() but none of them explain it in detail. It would be extremely useful to me, and hopefully to others, to understand the mechanics behind it. Thanks.