How can I sum two dataframe's totals in a new dataframe? - python

I got the following code:
df_A = pd.DataFrame ({'a1': [2,2,3,5,6],
'a2' : [8,6,3,5,2],
'a3': [7,4,3,0,6] })
df_B = pd.DataFrame ({'b1': [9,5,3,7,6],
'b2' : [0,6,4,5,3],
'b3': [7,8,8,0,10] })
This looks like:
a1 a2 a3
0 2 8 7
1 2 6 4
2 3 3 3
3 5 5 0
4 6 2 6
and:
b1 b2 b3
0 9 0 7
1 5 6 8
2 3 4 8
3 7 5 0
4 6 3 10
I want to have the sum of each column so I did:
total_A = df_A.sum()
total_B = df_B.sum()
The outcome for total_A was:
0
a1 18
a2 24
a3 20
for total_B:
0
b1 30
b2 18
b3 33
And then both totals needs to be summed as well. But I am getting NaNs
I prefer to get a df with column named
total_1, total_2, total_3
and as key the total values for each column:
total_1, total_2, total_3
48 42 53
So 48 is sum of column a1 + column b1; 42 is sum of column a2 + column b2 and 53 is sum of column a3 + column b3.
Can someone help me please?

The indexes are not aligned, so pandas won't sum a1 with b1. You need to align the index and there are many different ways/
You can to use the underlying numpy data on B to avoid index aligment:
df_A.sum()+df_B.sum().values
or rename B columns to match that of A:
df_A.add(df_B.set_axis(df_A.columns, axis=1)).sum()
output:
a1 48
a2 42
a3 53
dtype: int64
or set a common index:
(df_A
.rename(columns=lambda x: x.replace('a', 'total_'))
.add(df_B.rename(columns=lambda x: x.replace('b', 'total_')))
.sum()
)
output:
total_1 48
total_2 42
total_3 53
dtype: int64
as numpy array:
(df_A.to_numpy()+df_B.to_numpy()).sum(0)
output:
array([48, 42, 53])

Related

How to sum values in dataframe until certain values in other column by group?

I have a dataframe:
id life_day value
a1 1 10
a1 2 20
a1 3 10
a1 4 5
a1 5 5
a1 6 1
b2 1 7
b2 3 11
b2 4 10
b2 5 20
I want to sum values for each id till life_day 4. So desired result is:
id life_day value
a1 4 45
b2 4 28
How to do that? I tried df[df["life_day"] == 90].groupby("id).sum() but brings wrong results
Your approach almost works, but I don't know why you wrote == 90 in df["life_day"] == 90, and it looks like you want the max of life_day, not the sum.
df[df['life_day'] <= 4].groupby('id').agg({'life_day': 'max', 'value': 'sum'})
life_day value
id
a1 4 45
b2 4 28
Use the pandas where condition to mask and then groupby agg
df.where(df['life_day'].le(4)).groupby('id').agg({'life_day':'last','value':'sum'}).reset_index()
id life_day value
0 a1 4.0 45.0
1 b2 4.0 28.0

Replace value with the value of nearest neighbor in Pandas dataframe

I have a problem with getting nearest values for some rows in pandas dataframe and fill another column with values from those rows.
data sample I have:
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 100
A A5 250 3
A A6 250 100
B B1 0 1
B B2 30 2
The thing is, wherever match_v is equal to 100, I need to replace that 100 with a value from the row where r_value is the closest to r_value from origin row(where match_v is equal to 100), but just withing group (grouped by id)
Expected output
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 2
A A5 250 3
A A6 250 3
B B1 0 1
B B2 30 2
I have tried with creating lead and leg with shift and then finding differences. But doesn't work well and it somehow messed up already good values.
I haven't tried anything else cause I really don't have any idea.
Any help or hint is welcomed and I if you need any additional info, I'm here.
Thanks in advance.
More like merge_asof
s=df.loc[df.match_v!=100]
s=pd.merge_asof(df.sort_values('r_value'),s.sort_values('r_value'),on='r_value',by='id',direction='nearest')
df['match_v']=df['su_id'].map(s.set_index('su_id_x')['match_v_y'])
df
Out[231]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Here is another way using numpy broadcast , build for speed up calculation
l=[]
for x , y in df.groupby('id'):
s1=y.r_value.values
s=abs((s1-s1[:,None])).astype(float)
s[np.tril_indices(s.shape[0], 0)] = 999999
s=s.argmin(0)
s2=y.match_v.values
l.append(s2[s][s2==100])
df.loc[df.match_v==100,'match_v']=np.concatenate(l)
df
Out[264]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
You could define a custom function which does the calculation and substitution, and then use it with groupby and apply.
def mysubstitution(x):
for i in x.index[x['match_v'] == 100]:
diff = (x['r_value'] - (x['r_value'].iloc[i])).abs()
exclude = x.index.isin([i])
closer_idx = diff[~exclude].idxmin()
x['match_v'].iloc[i] = x['match_v'].iloc[closer_idx]
return x
ddf = df.groupby('id').apply(mysubstitution)
ddf is:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Assuming there is always at least one valid value within the group when a 100 is first encountered.
m = dict()
for i in range(len(df)):
if df.loc[i, "match_v"] == 100:
df.loc[i, "match_v"] = m[df.loc[i, "id"]]
else:
m[df.loc[i, "id"]] = df.loc[i, "match_v"]

Conditionally count values in a pandas groupby object

I have a pandas.core.groupby.DataFrameGroupBy object where I am trying to count the number of rows where a value for TOTAL_FLOOR_AREA is > 30. I can count the number of rows for each dataframe in the groupby object using:
import numpy as np
grouped = master_lsoa.groupby('lsoa11')
grouped.aggregate(np.count_nonzero).TOTAL_FLOOR_AREA
But how do I conditionally count rows where the value for TOTAL_FLOOR_AREA is greater than 30?
Sam
I think you need:
np.random.seed(6)
N = 15
master_lso = pd.DataFrame({'lsoa11': np.random.randint(4, size=N),
'TOTAL_FLOOR_AREA': np.random.choice([0,30,40,50], size=N)})
master_lso['lsoa11'] = 'a' + master_lso['lsoa11'].astype(str)
print (master_lso)
TOTAL_FLOOR_AREA lsoa11
0 40 a2
1 50 a1
2 30 a3
3 0 a0
4 40 a2
5 0 a1
6 30 a3
7 0 a2
8 40 a0
9 0 a2
10 0 a1
11 50 a1
12 50 a3
13 40 a1
14 30 a1
First filter rows by condition by boolean indexing - it is faster before grouping, because less rows.
df = master_lso[master_lso['TOTAL_FLOOR_AREA'] > 30]
print (df)
TOTAL_FLOOR_AREA lsoa11
0 40 a2
1 50 a1
4 40 a2
8 40 a0
11 50 a1
12 50 a3
13 40 a1
Then groupby and aggregate size:
df1 = df.groupby('lsoa11')['TOTAL_FLOOR_AREA'].size().reset_index(name='Count')
print (df1)
lsoa11 Count
0 a0 1
1 a1 3
2 a2 2
3 a3 1
you could also construct a new column indicating where the condition is met and sum up like (stealing #jezrael's dataframe):
master_lso.assign(Large_Enough= lambda x:x["TOTAL_FLOOR_AREA"]>30)\
.groupby('lsoa11')["Large_Enough"].sum().reset_index()
Note that Truevalues are interpreted as 1. So the sum provides the corresponding count here.
The advantage over #jezrael's solution is that you can still sum up the total area per group

How to count the frequency of values occur for a particular row across all columns

Suppose I have a data frame with four columns and part of this is like:
C1 C2 C3 C4
60 60 60 60
59 59 58 58
0 0 0 0
12 13 13 11
Now i want to create four columns each corresponding to the frequency of each values considering other three columns like the columns will look like:
F1 F2 F3 F4
4 4 4 4
2 2 2 2
2 2 2 2
1 2 2 1
In the cell 1,1 the value is 4 because the value 60 appears in all the columns for the particular rows.
In cell 4,1 the value is 1 as 12 appears in no other columns for the particular row.
I need to calculate and add the features F1,F2,F3,F4 in the pandas data frame.
Use apply axis=1 for process by rows with map by frequencies by value_counts:
df = df.apply(lambda x: x.map(x.value_counts()),axis=1)
print (df)
C1 C2 C3 C4
0 4 4 4 4
1 2 2 2 2
2 4 4 4 4
3 1 2 2 1

How to count the following number of rows in pandas (new)

Relate to the question below,I would like to count the number of following rows.
Thanks to the answer,I could handle data.
But I met some trouble and exception.
How to count the number of following rows in pandas
A B
1 a0
2 a1
3 b1
4 a0
5 b2
6 a2
7 a2
First,I would like to cut df.with startswith("a")
df1
A B
1 a0
df2
A B
2 a1
3 b1
df3
A B
4 a0
5 b2
df4
A B
6 a2
df5
A B
7 a2
I would like to count each df's rows
"a" number
a0 1
a1 2
a0 2
a2 1
a2 1
How could be this done?
I am happy someone tell me how to handle this kind of problem.
You can use aggregate by custom Series created with cumsum:
print (df.B.str.startswith("a").cumsum())
0 1
1 2
2 2
3 3
4 3
5 4
6 5
Name: B, dtype: int32
df1 = df.B.groupby(df.B.str.startswith("a").cumsum()).agg(['first', 'size'])
df1.columns =['"A"','number']
df1.index.name = None
print (df1)
"A" number
1 a0 1
2 a1 2
3 a0 2
4 a2 1
5 a2 1

Categories