Calculate the sum of values replacing NaN - python

I have a data frame with some NaNs in column B.
df = pd.DataFrame({
'A':[654,987,321,654,987,15,98,338],
'B':[987,np.nan,741,np.nan, 65,35,94,np.nan]})
df
A B
0 654 987.0
1 987 NaN
2 321 741.0
3 654 NaN
4 987 65.0
5 15 35.0
6 98 94.0
7 338 NaN
I replace NaNs in B with the numbers form A
df.B.fillna(df.A, inplace = True)
df
A B
0 654 987.0
1 987 987.0
2 321 741.0
3 654 654.0
4 987 65.0
5 15 35.0
6 98 94.0
7 338 338.0
What's the easiest way to calculate the sum of the values that have replaced the NaNs in B?

You can use series.isna() with .loc[] to filter the Column A which meets the condition that column B is null and then sum:
df.loc[df['B'].isna(),'A'].sum()
Alternative:
df['B'].fillna(df['A']).sum() - df['B'].sum()
Note: you should do this before doing the inplace operation or preferable create a copy and save under a different variable for later reference.

Try the function math.isnan to check the NaN value.
import numpy as np
import pandas as pd
import math
df = pd.DataFrame({
'A':[654,987,321,654,987,15,98,338],
'B':[987,np.nan,741,np.nan, 65,35,94,np.nan]})
for i in range(0,len(df['B'])):
if (math.isnan(df['B'][i])):
df['B'][i] = df['A'][i]
print(df)
Output :
A B
0 654 987.0
1 987 987.0
2 321 741.0
3 654 654.0
4 987 65.0
5 15 35.0
6 98 94.0
7 338 338.0

Related

How to shift location of columns based on a condition of cells of other columns in python

I need a bit of help with python. Here is what I want to achieve.
I have a dataset that looks like below:
import pandas as pd
# define data
data = {'A': [55, "g", 35, 10,'pj'], 'B': [454, 27, 895, 3545,34],
'C': [4, 786, 7, 3, 896],
'Phone Number': [123456789, 7, 3456789012, 4567890123, 1],'another_col':[None,234567890,None,None,215478565]}
pd.DataFrame(data)
A B C Phone Number another_col
0 55 454 4 123456789 None
1 g 27 786 7 234567890.0
2 35 895 7 3456789012 None
3 10 3545 3 4567890123 None
4 pj 34 896 1 215478565.0
I have extracted this data from pdf and unfortunately it adds some random strings as shown above in the dataframe. I want to check if any of the cells in any of the columns contain strings or none-numeric value. If so, then delete the string and shift the entire row to the left. Finally, the desired output is as shown below:
A B C Phone Number another_col
0 55 454 4 1.234568e+08 None
1 27 786 7 2.345679e+08 None
2 35 895 7 3.456789e+09 None
3 10 3545 3 4.567890e+09 None
4 34 896 1 2.15478565+8 None
I would really appreciate your help.
One way is to use to_numeric to coerce each value to numeric values, then shifting each row leftward using dropna:
out = (df.apply(pd.to_numeric, errors='coerce')
.apply(lambda x: pd.Series(x.dropna().tolist(), index=df.columns.drop('another_col')), axis=1))
Output:
A B C Phone Number
0 55.0 454.0 4.0 1.234568e+08
1 27.0 786.0 7.0 2.345679e+08
2 35.0 895.0 7.0 3.456789e+09
3 10.0 3545.0 3.0 4.567890e+09
4 34.0 896.0 1.0 2.154786e+08
You can create boolean mask, shift and pd.concat:
m=pd.to_numeric(df['A'], errors='coerce').isna()
pd.concat([df.loc[~m], df.loc[m].shift(-1, axis=1)]).sort_index()
Output:
A B C Phone Number another_col
0 55 454 4 1.234568e+08 NaN
1 27 786 7 2.345679e+08 NaN
2 35 895 7 3.456789e+09 NaN
3 10 3545 3 4.567890e+09 NaN
4 34 896 1 2.154786e+08 NaN

Pandas append DataFrame2 ROW to DataFrame1 ROW

I want to append rows from second DataFrame (df2) to first DataFrame (df1) depending whether in df1 column "isValid" is [T]rue.
I know how to iterate over df1 column and search for True values, but don't know how to easily append rows from second DataFrame. Originally my data have around 1000 lines and 40 columns, so I need to do operations automatically.
import pandas
df1 = pandas.read_csv('df1.csv', sep=';')
df2 = pandas.read_csv('df2.csv', sep=';')
print(df1.to_string(), '\n')
print(df2.to_string(), '\n')
columnSeriesObj = df1.iloc[:, 2]
n = 0
k = 0
for i in columnSeriesObj:
if i == "T":
print("True in row number", k)
# APPEND n ROW from df2 to k ROW from df1
n += 1
k += 1
print('\n', df1.to_string())
Here are some test values:
df1.csv
DataA;DataB;isValid
1568;1104;F
1224;1213;F
1676;1246;F
1279;1489;T
1437;1890;T
1705;1007;F
1075;1720;F
1361;1983;F
1966;1751;F
1938;1564;F
1894;1684;F
1189;1803;F
1275;1138;F
1085;1748;T
1337;1775;T
1719;1975;F
1045;1187;F
1426;1757;F
1410;1363;F
1405;1025;F
1699;1873;F
1777;1464;F
1925;1310;T
df2.csv
Nr;X;Y;Z;A ;B;C
1;195;319;18;qwe;hjk;wsx
2;268;284;23;rty;zxc;edc
3;285;277;36;uio;vbn;rfv
4;143;369;34;asd;mlp;tgb
5;290;247;16;fgh;qaz;yhn
I want to df1 after appending look like this (screenshot from Excel):
Thank you for any suggestions! :D
You can filter the index values in df1 where the column isValid equals T, then update the index of df2 with the filtered index values from df1 finally join it with df1:
m = df1['isValid'].eq('T')
idx = m[m].index[:len(df2)]
df1.join(df2.set_index(idx)).fillna('')
DataA DataB isValid Nr X Y Z A B C
0 1568 1104 F
1 1224 1213 F
2 1676 1246 F
3 1279 1489 T 1 195 319 18 qwe hjk wsx
4 1437 1890 T 2 268 284 23 rty zxc edc
5 1705 1007 F
6 1075 1720 F
7 1361 1983 F
8 1966 1751 F
9 1938 1564 F
10 1894 1684 F
11 1189 1803 F
12 1275 1138 F
13 1085 1748 T 3 285 277 36 uio vbn rfv
14 1337 1775 T 4 143 369 34 asd mlp tgb
15 1719 1975 F
16 1045 1187 F
17 1426 1757 F
18 1410 1363 F
19 1405 1025 F
20 1699 1873 F
21 1777 1464 F
22 1925 1310 T 5 290 247 16 fgh qaz yhn
I suggest the following:
I created some dummy data, similar to yours:
import pandas as pd
import random
df = pd.DataFrame({"a": list(range(20)), "b": [random.choice(("T", "F")) for _ in range(20)]})
df2 = pd.DataFrame({"value1": list(range(5)), "nr": list(range(5))})
First you create a new column in the first dataframe that holds the incrementing ID ("Nr"). To do so, use the count generator from itertools.
from itertools import count
counter = count(start=1)
df["id"] = df.apply(lambda row: next(counter) if row["b"] == "T" else None, axis=1)
After that you can perform a join with the merge method.
df.merge(df2, left_on="id", right_on="nr", how="outer")
How about something like this:
(e.g. first find the overlapping index-values and then join the dataframes)
import pandas as pd
import numpy as np
df1 = pd.read_csv("df1.csv", sep=';')
df2 = pd.read_csv(r"df2.csv", sep=';')
# find intersecting indices
useidx = np.intersect1d(df2.index,
df1[df1.isValid == 'T'].index)
# join relevant values
df_joined = df1.join(df2.loc[useidx])
df_joined then looks like this:
>>> DataA DataB isValid Nr X Y Z A B C
>>> 0 1568 1104 F NaN NaN NaN NaN NaN NaN NaN
>>> 1 1224 1213 F NaN NaN NaN NaN NaN NaN NaN
>>> 2 1676 1246 F NaN NaN NaN NaN NaN NaN NaN
>>> 3 1279 1489 T 4.0 143.0 369.0 34.0 asd mlp tgb
>>> 4 1437 1890 T 5.0 290.0 247.0 16.0 fgh qaz yhn
>>> 5 1705 1007 F NaN NaN NaN NaN NaN NaN NaN
>>> 6 1075 1720 F NaN NaN NaN NaN NaN NaN NaN
>>> 7 1361 1983 F NaN NaN NaN NaN NaN NaN NaN

Python: how to merge and divide two dataframes?

I have a dataframe df containing the population p assigned to some buildings b
df
p b
0 150 3
1 345 7
2 177 4
3 267 2
and a dataframe df1 that associates some other buildings b1 to the buildings in df
df1
b1 b
0 17 3
1 9 7
2 13 7
I want to assign to the buildings that have an association in df1 a population divided the number of buildings. In this way we generate df2 that assign a population of 150/2=75 to the buildings 3 and 17 and a population of 345/3=115 to the buildings 7,9,13.
df2
p b
0 75 3
1 75 17
2 115 7
3 115 9
4 115 13
5 177 4
6 267 2
IIUC, you can try with merging both dfs on b then stack() and some cleansing, finally group on p and transform count and divide p with that to get divided values on p:
m=(df.merge(df1,on='b',how='left').set_index('p').stack().reset_index(name='b')
.drop_duplicates().drop('level_1',1).sort_values('p'))
m.p=m.p/m.groupby('p')['p'].transform('count')
print(m.sort_index())
p b
0 75.0 3.0
1 75.0 17.0
2 115.0 7.0
3 115.0 9.0
5 115.0 13.0
6 177.0 4.0
7 267.0 2.0
Another way using pd.concat. After that, fillna individually b1 and p. Next, transform with mean and assign filled b1 to the final dataframe
df2 = pd.concat([df, df1], sort=True).sort_values('b')
df2['b1'] = df2.b1.fillna(df2.b)
df2['p'] = df2.p.fillna(0)
df2.groupby('b').p.transform('mean').to_frame().assign(b=df2.b1).reset_index(drop=True)
Out[159]:
p b
0 267.0 2.0
1 75.0 3.0
2 75.0 17.0
3 177.0 4.0
4 115.0 7.0
5 115.0 9.0
6 115.0 13.0

Calculations between different rows

I try to run loop over a pandas dataframe that takes two arguments from different rows. I tried to use .iloc and shift functions but did not manage to get the result i need.
Here's a simple example to explain better what i want to do:
dataframe1:
a b c
0 101 1 aaa
1 211 2 dcd
2 351 3 yyy
3 401 5 lol
4 631 6 zzz
for the above df I want to make new column ('d') that gets the diff between the values in column 'a' only if the diff between the values in column 'b' is equal to 1, if not the value should be null. like the following dataframe2:
a b c d
0 101 1 aaa nan
1 211 2 dcd 110
2 351 3 yyy 140
3 401 5 lol nan
4 631 6 zzz 230
Is there any designed function that can handle this kind of calculations?
Try like this, using loc and diff():
df.loc[df.b.diff() == 1, 'd'] = df.a.diff()
>>> df
a b c d
0 101 1 aaa NaN
1 211 2 dcd 110.0
2 351 3 yyy 140.0
3 401 5 lol NaN
4 631 6 zzz 230.0
You can create a group key
df1.groupby(df1.b.diff().ne(1).cumsum()).a.diff()
Out[361]:
0 NaN
1 110.0
2 140.0
3 NaN
4 230.0
Name: a, dtype: float64

Problems with combining columns from dataframes in pandas

I have two dataframes that I'm trying to merge.
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 NaN NaN
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 NaN NaN
5 313 3 NaN NaN
...
df2
code scale R1 R2...
0 121 2 30 20
3 313 2 15 10
...
I need, based on the equality of the columns code and scale copy the value from df2 to df1.
The result should look like this:
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 30 20
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 15 10
5 313 3 NaN NaN
...
The problem is that there can be a lot of columns like R1 and R2 and I can not check each one separately, so I wanted to use something from this instruction, but nothing gives me the desired result. I'm doing something wrong, but I can't understand what. I really need advice.
What do you want to happen if the two dataframes both have values for R1/R2? If you want keep df1, you could do
df1.set_index(['code', 'scale']).fillna(df2.set_index(['code', 'scale'])).reset_index()
To keep df2 just do the fillna the other way round. To combine in some other way please clarify the question!
Try this ?
pd.concat([df,df1],axis=0).sort_values(['code','scale']).drop_duplicates(['code','scale'],keep='last')
Out[21]:
code scale R1 R2
0 121 1 80.0 110.0
0 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
3 313 2 15.0 10.0
5 313 3 NaN NaN
This is a good situation for combine_first. It replaces the nulls in the calling dataframe from the passed dataframe.
df1.set_index(['code', 'scale']).combine_first(df2.set_index(['code', 'scale'])).reset_index()
code scale R1 R2
0 121 1 80.0 110.0
1 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
4 313 2 15.0 10.0
5 313 3 NaN NaN
Other solutions
with fillna
df.set_index(['code', 'scale']).fillna(df1.set_index(['code', 'scale'])).reset_index()
with add - a bit faster
df.set_index(['code', 'scale']).add(df1.set_index(['code', 'scale']), fill_value=0)

Categories