Changing Value of adjacent column based on value of of another column - python

I have following dataframe:
A1 A2 B1 B2
0 10 20 20 NA
1 20 40 30 No
2 50 No 50 10
3 40 NA 50 20
I want to change value in column A1 to NaN whenever corresponding value in column A2 is No or NA. Same for B1.
Note: NA here is a string objects not NaN.
A1 A2 B1 B2
0 10 20 NaN NA
1 20 40 NaN No
2 NaN No 50 10
3 NaN NA 50 20

Use if NA and No are strings use Series.isin in DataFrame.loc or :
df.loc[df.A2.isin(['NA','No']), 'A1'] = np.nan
Or Series.mask:
df['A1'] = df['A1'].mask(df.A2.isin(['NA','No']))
If NA is missing value test it by Series.isna:
df.loc[df.A2.isna() | df.A2.eq('No'), 'A1'] = np.nan
Or:
df['A1'] = df['A1'].mask(df.A2.isna() | df.A2.eq('No'))

Related

output NaN value when using apply function to a dataframe with index

I am trying to use apply function to create 2 new columns. when dataframe has index, it doesn't wokr, the new columns have values of NaN. If dataframe has no index, then it works. Could you please help? Thanks
def calc_test(row):
a=row['col1']+row['col2']
b=row['col1']/row['col2']
return (a,b)
df_test_dict={'col1':[1,2,3,4,5],'col2':[10,20,30,40,50]}
df_test=pd.DataFrame(df_test_dict)
df_test.index=['a1','b1','c1','d1','e1']
df_test
col1 col2
a1 1 10
b1 2 20
c1 3 30
d1 4 40
e1 5 50
Now I use apply function, the new creately columns have values of NaN. Thanks for your help.
df_test[['a','b']] = pd.DataFrame(df_test.apply(lambda row:calc_test(row),axis=1).tolist())
df_test
col1 col2 a b
a1 1 10 NaN NaN
b1 2 20 NaN NaN
c1 3 30 NaN NaN
d1 4 40 NaN NaN
e1 5 50 NaN Na
When using apply, you may use the result_type ='expand' argument to expand the output of your function as columns of a pandas Dataframe:
df_test[['a','b']]=df_test.apply(lambda row:calc_test(row),axis=1, result_type ='expand')
This returns:
col1 col2 a b
a1 1 10 11.0 0.1
b1 2 20 22.0 0.1
c1 3 30 33.0 0.1
d1 4 40 44.0 0.1
e1 5 50 55.0 0.1
You are wrapping the return of the apply as a DataFrame which has a default indexing of [0, 1, 2, 3, 4] which don't exist in your original DataFrame's index. You can see this by looking at the output of pd.DataFrame(df_test.apply(lambda row:calc_test(row),axis=1).tolist()).
Simply remove the pd.DataFrame() to fix this problem.
df_test[['a', 'b']] = df_test.apply(lambda row:calc_test(row),axis=1).tolist()

How can I sum two dataframe's totals in a new dataframe?

I got the following code:
df_A = pd.DataFrame ({'a1': [2,2,3,5,6],
'a2' : [8,6,3,5,2],
'a3': [7,4,3,0,6] })
df_B = pd.DataFrame ({'b1': [9,5,3,7,6],
'b2' : [0,6,4,5,3],
'b3': [7,8,8,0,10] })
This looks like:
a1 a2 a3
0 2 8 7
1 2 6 4
2 3 3 3
3 5 5 0
4 6 2 6
and:
b1 b2 b3
0 9 0 7
1 5 6 8
2 3 4 8
3 7 5 0
4 6 3 10
I want to have the sum of each column so I did:
total_A = df_A.sum()
total_B = df_B.sum()
The outcome for total_A was:
0
a1 18
a2 24
a3 20
for total_B:
0
b1 30
b2 18
b3 33
And then both totals needs to be summed as well. But I am getting NaNs
I prefer to get a df with column named
total_1, total_2, total_3
and as key the total values for each column:
total_1, total_2, total_3
48 42 53
So 48 is sum of column a1 + column b1; 42 is sum of column a2 + column b2 and 53 is sum of column a3 + column b3.
Can someone help me please?
The indexes are not aligned, so pandas won't sum a1 with b1. You need to align the index and there are many different ways/
You can to use the underlying numpy data on B to avoid index aligment:
df_A.sum()+df_B.sum().values
or rename B columns to match that of A:
df_A.add(df_B.set_axis(df_A.columns, axis=1)).sum()
output:
a1 48
a2 42
a3 53
dtype: int64
or set a common index:
(df_A
.rename(columns=lambda x: x.replace('a', 'total_'))
.add(df_B.rename(columns=lambda x: x.replace('b', 'total_')))
.sum()
)
output:
total_1 48
total_2 42
total_3 53
dtype: int64
as numpy array:
(df_A.to_numpy()+df_B.to_numpy()).sum(0)
output:
array([48, 42, 53])

How to sum values in dataframe until certain values in other column by group?

I have a dataframe:
id life_day value
a1 1 10
a1 2 20
a1 3 10
a1 4 5
a1 5 5
a1 6 1
b2 1 7
b2 3 11
b2 4 10
b2 5 20
I want to sum values for each id till life_day 4. So desired result is:
id life_day value
a1 4 45
b2 4 28
How to do that? I tried df[df["life_day"] == 90].groupby("id).sum() but brings wrong results
Your approach almost works, but I don't know why you wrote == 90 in df["life_day"] == 90, and it looks like you want the max of life_day, not the sum.
df[df['life_day'] <= 4].groupby('id').agg({'life_day': 'max', 'value': 'sum'})
life_day value
id
a1 4 45
b2 4 28
Use the pandas where condition to mask and then groupby agg
df.where(df['life_day'].le(4)).groupby('id').agg({'life_day':'last','value':'sum'}).reset_index()
id life_day value
0 a1 4.0 45.0
1 b2 4.0 28.0

Replace a subset of rows of dataframe A with rows of dataframe B in python pandas

I'm trying to replace the values in the rows 500 to 750 in column col1 in dataframe df_A with the values column col1 of dataframe df_B (with altogether 250 rows) in Python Pandas.
I tried doing it like this
df_A.col1.iloc[500:750] = df_B.col1
But this yields the notorious
A value is trying to be set on a copy of a slice from a DataFrame
and the values in df_A.col1.iloc[500:750] get replaced by NaNs . So how can I do this kind of replacement of several rows with rows from another dataframe in Pandas without using a for-loop?
Try to use loc instead:
import pandas as pd
df=pd.DataFrame(np.arange(15).reshape(5,3), columns=['a0','a1','a2'])
dg=pd.DataFrame(np.arange(9).reshape(3,3), columns=['b0','b1','b2'])
print('df=', df)
print('\ndg=', dg)
#replacement of [5,8,11] by [1,4,7]
df.loc[1:3, 'a2']=dg.b1.values
print("\ndf (after replacement) \n ",df)
df= a0 a1 a2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
dg= b0 b1 b2
0 0 1 2
1 3 4 5
2 6 7 8
df (after replacement)
a0 a1 a2
0 0 1 2
1 3 4 1
2 6 7 4
3 9 10 7
4 12 13 14

Conditionally count values in a pandas groupby object

I have a pandas.core.groupby.DataFrameGroupBy object where I am trying to count the number of rows where a value for TOTAL_FLOOR_AREA is > 30. I can count the number of rows for each dataframe in the groupby object using:
import numpy as np
grouped = master_lsoa.groupby('lsoa11')
grouped.aggregate(np.count_nonzero).TOTAL_FLOOR_AREA
But how do I conditionally count rows where the value for TOTAL_FLOOR_AREA is greater than 30?
Sam
I think you need:
np.random.seed(6)
N = 15
master_lso = pd.DataFrame({'lsoa11': np.random.randint(4, size=N),
'TOTAL_FLOOR_AREA': np.random.choice([0,30,40,50], size=N)})
master_lso['lsoa11'] = 'a' + master_lso['lsoa11'].astype(str)
print (master_lso)
TOTAL_FLOOR_AREA lsoa11
0 40 a2
1 50 a1
2 30 a3
3 0 a0
4 40 a2
5 0 a1
6 30 a3
7 0 a2
8 40 a0
9 0 a2
10 0 a1
11 50 a1
12 50 a3
13 40 a1
14 30 a1
First filter rows by condition by boolean indexing - it is faster before grouping, because less rows.
df = master_lso[master_lso['TOTAL_FLOOR_AREA'] > 30]
print (df)
TOTAL_FLOOR_AREA lsoa11
0 40 a2
1 50 a1
4 40 a2
8 40 a0
11 50 a1
12 50 a3
13 40 a1
Then groupby and aggregate size:
df1 = df.groupby('lsoa11')['TOTAL_FLOOR_AREA'].size().reset_index(name='Count')
print (df1)
lsoa11 Count
0 a0 1
1 a1 3
2 a2 2
3 a3 1
you could also construct a new column indicating where the condition is met and sum up like (stealing #jezrael's dataframe):
master_lso.assign(Large_Enough= lambda x:x["TOTAL_FLOOR_AREA"]>30)\
.groupby('lsoa11')["Large_Enough"].sum().reset_index()
Note that Truevalues are interpreted as 1. So the sum provides the corresponding count here.
The advantage over #jezrael's solution is that you can still sum up the total area per group

Categories