What I am trying to do is updating the empty df1 from df2, which is created in a while-loop that requests data through an API. I want to keep all rows and their order from df1.
df1:
df = pd.DataFrame({'A': ['c1', 'c1', 'c2','c2', 'c3', 'c3'], 'B': ['y1', 'y2', 'y1', 'y2', 'y1', 'y2'], 'C': ["","","","","",""], 'D': ["","","","","",""]})
A B C D
0 c1 y1
1 c1 y2
2 c2 y1
3 c2 y2
4 c3 y1
5 c3 y2
df2:
values_for_df = pd.DataFrame({'A': ['c1', 'c1', 'c2', 'c3'], 'B': ['y1', 'y2', 'y1', 'y2'], 'C': [4, 5, 4, 6], 'D': [7, 8, 9,""]})
A B C D
0 c1 y1 4 7
1 c1 y2 5 8
2 c2 y1 4 9
3 c3 y2 6
Output:
A B C D
0 c1 y1 4 7
1 c1 y2 5 8
2 c2 y1 4 9
3 c3 y2 6
4 c3 y1
5 c3 y2
Wanted output:
A B C D
0 c1 y1 4 7
1 c1 y2 5 8
2 c2 y1 4 9
3 c2 y2
4 c3 y1
5 c3 y2 6
This process will be repated 1000s of times. Can someone help me with this, share his ideas / alternative ways or explain me why the actual output differs from my expected output?
Try:
df = df.set_index(['A','B'])
values_for_df = values_for_df.set_index(['A','B'])
df.update(values_for_df, filter_func=lambda x: x=='')
df.reset_index()
A B C D
0 c1 y1 4 7
1 c1 y2 5 8
2 c2 y1 4 9
3 c2 y2
4 c3 y1
5 c3 y2 6
Related
I have a data frame where some rows have one ID and one related ID. In the example below, a1 and a2 are related (say to the same person) while b and c don't have any related rows.
import pandas as pd
test = pd.DataFrame(
[['a1', 1, 'a2'],
['a1', 2, 'a2'],
['a1', 3, 'a2'],
['a2', 4, 'a1'],
['a2', 5, 'a1'],
['b', 6, ],
['c', 7, ]],
columns=['ID1', 'Value', 'ID2']
)
test
ID1 Value ID2
0 a1 1 a2
1 a1 2 a2
2 a1 3 a2
3 a2 4 a1
4 a2 5 a1
5 b 6 None
6 c 7 None
What I need to achieve is to add a column containing the sum of all values for related rows. In this case, the desired output should be like below. Is there a way to get this, please?
ID1
Value
ID2
Group by ID1 and ID2
a1
1
a2
15
a1
2
a2
15
a1
3
a2
15
a2
4
a1
15
a2
5
a1
15
b
6
6
c
7
7
Note that I learnt to use group by to get sum for ID1 (from this question); but not for 'ID1' and 'ID2' together.
test['Group by ID1'] = test.groupby("ID1")["Value"].transform("sum")
test
ID1 Value ID2 Group by ID1
0 a1 1 a2 6
1 a1 2 a2 6
2 a1 3 a2 6
3 a2 4 a1 9
4 a2 5 a1 9
5 b 6 None 6
6 c 7 None 7
Update
Think I can still use for loop to get this done like below. But wondering if there is another non-loop way. Thanks.
bottle = pd.DataFrame().reindex_like(test)
bottle['ID1'] = test['ID1']
bottle['ID2'] = test['ID2']
for index, row in bottle.iterrows():
bottle.loc[index, "Value"] = test[test['ID1'] == row['ID1']]['Value'].sum() + \
test[test['ID1'] == row['ID2']]['Value'].sum()
print(bottle)
ID1 Value ID2
0 a1 15.0 a2
1 a1 15.0 a2
2 a1 15.0 a2
3 a2 15.0 a1
4 a2 15.0 a1
5 b 6.0 None
6 c 7.0 None
A possible solution would be to sort the pairs in ID1 and ID2, such that they always appear in the same order.
Swapping the IDs:
s = df['ID1'] > df['ID2']
df.loc[s, ['ID1', 'ID2']] = df.loc[s, ['ID2', 'ID1']].values
print(df)
>>> ID1 Value ID2
0 a1 1 a2
1 a1 2 a2
2 a1 3 a2
3 a1 4 a2
4 a1 5 a2
5 b 6 None
6 c 7 None
Then we can do a simple groupby:
df['RSUM'] = df.groupby(['ID1', 'ID2'], dropna=False)['Value'].transform("sum")
print(df)
>>> ID1 Value ID2 RSUM
0 a1 1 a2 15
1 a1 2 a2 15
2 a1 3 a2 15
3 a1 4 a2 15
4 a1 5 a2 15
5 b 6 None 6
6 c 7 None 7
Note the dropna=False to not discard IDs that have no pairing.
If you do not want to permanently swap the IDs, you can just create a temporary dataframe.
Hi All i have created a dummy DF
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3','K0', 'K1', 'K2', 'K3'],
'C': [7, 11, 9, 13,9, 6, 10, 5],
'D': [1, 2, 1, 2,2, 1, 2,1]})
result = pd.merge(left, right, on='key')
output:
key A B C D
0 K0 A0 B0 7 1
1 K0 A0 B0 9 2
2 K1 A1 B1 11 2
3 K1 A1 B1 6 1
4 K2 A2 B2 9 1
5 K2 A2 B2 10 2
6 K3 A3 B3 13 2
7 K3 A3 B3 5 1
The problem I am trying to solve is that I want to group the entries by the key Value, and perform a mathematical operation on it, you will notice that there is 2 entries in the column
so for each key, if the top value is less than the bottom value in column D, perform simple addition on the column C entries, in this case the math would only be applied to index =[4,5,6,7] and the calculations would be
9-10 = -1
13-5 = 8
Ideally these results would be stored in a list, I know the data structure is not ideal but this is what I have been given to work with, and i have no idea how to approach it
# perform checking on column D (and add "check" column)
result["check"] = result.groupby('key')['D'].diff(periods=-1)
result
key A B C D check
0 K0 A0 B0 7 1 -1.0
1 K0 A0 B0 9 2 NaN
2 K1 A1 B1 11 2 1.0
3 K1 A1 B1 6 1 NaN
4 K2 A2 B2 9 1 -1.0
5 K2 A2 B2 10 2 NaN
6 K3 A3 B3 13 2 1.0
7 K3 A3 B3 5 1 NaN
# perform difference between column C values (and add it as a column)
result["diff"] = result.groupby('key')['C'].diff(periods=-1)
result
key A B C D check diff
0 K0 A0 B0 7 1 -1.0 -2.0
1 K0 A0 B0 9 2 NaN NaN
2 K1 A1 B1 11 2 1.0 5.0
3 K1 A1 B1 6 1 NaN NaN
4 K2 A2 B2 9 1 -1.0 -1.0
5 K2 A2 B2 10 2 NaN NaN
6 K3 A3 B3 13 2 1.0 8.0
7 K3 A3 B3 5 1 NaN NaN
# filter dataframe to get only desired results
result[(result.check < 0)]['diff']
0 -2.0
4 -1.0
# results as a list
list(result[(result.check < 0)]['diff'])
[-2.0, -1.0]
I'm trying to join two DataFrames by index that can contain columns in common and I only want to add one to the other if that specific value is NaN or doesn't exist. I'm using the pandas example, so I've got:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
as
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
and
df4 = pd.DataFrame({'B': ['B2p', 'B3p', 'B6p', 'B7p'],
'D': ['D2p', 'D3p', 'D6p', 'D7p'],
'F': ['F2p', 'F3p', 'F6p', 'F7p']},
index=[2, 3, 6, 7])
as
B D F
2 B2p D2p F2p
3 B3p D3p F3p
6 B6p D6p F6p
7 B7p D7p F7p
and the searched result is:
A B C D F
0 A0 B0 C0 D0 Nan
1 A1 B1 C1 D1 Nan
2 A2 B2 C2 D2 F2p
3 A3 B3 C3 D3 F3p
6 Nan B6p Nan D6p F6p
7 Nan B7p Nan D7p F7p
This is a good use case for combine_first, where the row and column indices of the resulting dataframe will be the union of the two, i.e in the absence of an index in one of the dataframes, the value from the other is used (same behaviour as if it contained a NaN:
df1.combine_first(df4)
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 F2p
3 A3 B3 C3 D3 F3p
6 NaN B6p NaN D6p F6p
7 NaN B7p NaN D7p F7p
I am new to python and developing a code
I want to search for a word in a column and if a match is found, i want to insert an empty row below that.
My code is below
If df.columnname=='total':
Df.insert
Could someone pls help me.
Do give the following a try:
>>>df
id Label
0 1 A
1 2 B
2 3 B
3 4 B
4 5 A
5 6 B
6 7 A
7 8 A
8 9 C
9 10 C
10 11 C
# Create a separate dataframe with the id of the rows to be duplicated
df1 = df.loc[df['Label']=='B', 'id']
# Join it back and reset the index
df = pd.concat(df,df1).sort_index()
>>>df
id Label
0 1 A
1 2 B
2 2 NaN
3 3 B
4 3 NaN
5 4 B
6 4 NaN
7 5 A
8 6 B
9 6 NaN
10 7 A
11 8 A
12 9 C
13 10 C
14 11 C
Use below code:
from numpy import nan as Nan
import pandas as pd
df1 = pd.DataFrame({'Column1': ['A0', 'total', 'total', 'A3'],'Column2': ['B0', 'B1',
'B2', 'B3'],'Column3': ['C0', 'C1', 'C2', 'C3'],'Column4': ['D0', 'D1', 'D2',
'D3']},index=[0, 1, 2, 3])
count = 0
for index, row in df1.iterrows():
if row["Column1"] == 'total':
df1 = pd.DataFrame(np.insert(df1.values, index+1+count, values=[" "]
* len(df1.columns), axis=0),columns = df1.columns)
count += 1
print (df1)
Input:
Column1 Column2 Column3 Column4
0 A0 B0 C0 D0
1 total B1 C1 D1
2 total B2 C2 D2
3 A3 B3 C3 D3
Output:
Column1 Column2 Column3 Column4
0 A0 B0 C0 D0
1 total B1 C1 D1
2
3 total B2 C2 D2
4
5 A3 B3 C3 D3
I want to update data frame X on values from dataframe from Y.
X = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']})
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
Y = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1'],
'C': ['C0xx', 'C1xx'],
'D': ['D0xx', 'D1xx']})
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
And the result to be:
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
Of course my dataframe is match bigger.
1. Both DataFrames have the same index
This is the case you presented in the example given in your question.
You might want to use the update method:
>>> X.update(Y)
>>> X
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
It also works if lines are in a different order in X and Y:
>>> Y = pd.DataFrame({'A': ['A1', 'A0'],
'B': ['B1', 'B0'],
'C': ['C1xx', 'C0xx'],
'D': ['D1xx', 'D0xx']},
index=[1,0])
>>> Y
A B C D
1 A1 B1 C1xx D1xx
0 A0 B0 C0xx D0xx
>>> X.update(Y)
>>> X
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
2. Different indexes
If Y has a different index:
>>> Y = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1'],
'C': ['C0xx', 'C1xx'],
'D': ['D0xx', 'D1xx']},
index=[2,1])
>>> Y
A B C D
2 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
You can still use update if you can find another column usable as an index (identifying the lines so that they match the lines to be replaced). I take the example of the "A" column but a multiple index would work as well.
>>> X2, Y2 = X.set_index("A"), Y.set_index("A")
>>> X2.update(Y2)
>>> X2.reset_index(inplace=True)
>>> X2
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
I think you need combine_first with set_index if need add missing values by A, B columns in both df:
print (Y.set_index(['A','B']).combine_first(X.set_index(['A','B'])).reset_index())
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
Unfortunately update works bad:
Y = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1'],
'C': ['C0xx', 'C1xx'],
'D': ['D0xx', 'D1xx']}, index=[2,1])
print (X)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
print (Y)
A B C D
2 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
X.update(Y)
print (X)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1xx D1xx
2 A0 B0 C0xx D0xx
X.set_index(['A','B']).update(Y.set_index(['A','B']))
print (X)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
print (Y.set_index(['A','B']).combine_first(X.set_index(['A','B'])).reset_index())
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2