Hi All i have created a dummy DF
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3','K0', 'K1', 'K2', 'K3'],
'C': [7, 11, 9, 13,9, 6, 10, 5],
'D': [1, 2, 1, 2,2, 1, 2,1]})
result = pd.merge(left, right, on='key')
output:
key A B C D
0 K0 A0 B0 7 1
1 K0 A0 B0 9 2
2 K1 A1 B1 11 2
3 K1 A1 B1 6 1
4 K2 A2 B2 9 1
5 K2 A2 B2 10 2
6 K3 A3 B3 13 2
7 K3 A3 B3 5 1
The problem I am trying to solve is that I want to group the entries by the key Value, and perform a mathematical operation on it, you will notice that there is 2 entries in the column
so for each key, if the top value is less than the bottom value in column D, perform simple addition on the column C entries, in this case the math would only be applied to index =[4,5,6,7] and the calculations would be
9-10 = -1
13-5 = 8
Ideally these results would be stored in a list, I know the data structure is not ideal but this is what I have been given to work with, and i have no idea how to approach it
# perform checking on column D (and add "check" column)
result["check"] = result.groupby('key')['D'].diff(periods=-1)
result
key A B C D check
0 K0 A0 B0 7 1 -1.0
1 K0 A0 B0 9 2 NaN
2 K1 A1 B1 11 2 1.0
3 K1 A1 B1 6 1 NaN
4 K2 A2 B2 9 1 -1.0
5 K2 A2 B2 10 2 NaN
6 K3 A3 B3 13 2 1.0
7 K3 A3 B3 5 1 NaN
# perform difference between column C values (and add it as a column)
result["diff"] = result.groupby('key')['C'].diff(periods=-1)
result
key A B C D check diff
0 K0 A0 B0 7 1 -1.0 -2.0
1 K0 A0 B0 9 2 NaN NaN
2 K1 A1 B1 11 2 1.0 5.0
3 K1 A1 B1 6 1 NaN NaN
4 K2 A2 B2 9 1 -1.0 -1.0
5 K2 A2 B2 10 2 NaN NaN
6 K3 A3 B3 13 2 1.0 8.0
7 K3 A3 B3 5 1 NaN NaN
# filter dataframe to get only desired results
result[(result.check < 0)]['diff']
0 -2.0
4 -1.0
# results as a list
list(result[(result.check < 0)]['diff'])
[-2.0, -1.0]
Related
I have a data frame where some rows have one ID and one related ID. In the example below, a1 and a2 are related (say to the same person) while b and c don't have any related rows.
import pandas as pd
test = pd.DataFrame(
[['a1', 1, 'a2'],
['a1', 2, 'a2'],
['a1', 3, 'a2'],
['a2', 4, 'a1'],
['a2', 5, 'a1'],
['b', 6, ],
['c', 7, ]],
columns=['ID1', 'Value', 'ID2']
)
test
ID1 Value ID2
0 a1 1 a2
1 a1 2 a2
2 a1 3 a2
3 a2 4 a1
4 a2 5 a1
5 b 6 None
6 c 7 None
What I need to achieve is to add a column containing the sum of all values for related rows. In this case, the desired output should be like below. Is there a way to get this, please?
ID1
Value
ID2
Group by ID1 and ID2
a1
1
a2
15
a1
2
a2
15
a1
3
a2
15
a2
4
a1
15
a2
5
a1
15
b
6
6
c
7
7
Note that I learnt to use group by to get sum for ID1 (from this question); but not for 'ID1' and 'ID2' together.
test['Group by ID1'] = test.groupby("ID1")["Value"].transform("sum")
test
ID1 Value ID2 Group by ID1
0 a1 1 a2 6
1 a1 2 a2 6
2 a1 3 a2 6
3 a2 4 a1 9
4 a2 5 a1 9
5 b 6 None 6
6 c 7 None 7
Update
Think I can still use for loop to get this done like below. But wondering if there is another non-loop way. Thanks.
bottle = pd.DataFrame().reindex_like(test)
bottle['ID1'] = test['ID1']
bottle['ID2'] = test['ID2']
for index, row in bottle.iterrows():
bottle.loc[index, "Value"] = test[test['ID1'] == row['ID1']]['Value'].sum() + \
test[test['ID1'] == row['ID2']]['Value'].sum()
print(bottle)
ID1 Value ID2
0 a1 15.0 a2
1 a1 15.0 a2
2 a1 15.0 a2
3 a2 15.0 a1
4 a2 15.0 a1
5 b 6.0 None
6 c 7.0 None
A possible solution would be to sort the pairs in ID1 and ID2, such that they always appear in the same order.
Swapping the IDs:
s = df['ID1'] > df['ID2']
df.loc[s, ['ID1', 'ID2']] = df.loc[s, ['ID2', 'ID1']].values
print(df)
>>> ID1 Value ID2
0 a1 1 a2
1 a1 2 a2
2 a1 3 a2
3 a1 4 a2
4 a1 5 a2
5 b 6 None
6 c 7 None
Then we can do a simple groupby:
df['RSUM'] = df.groupby(['ID1', 'ID2'], dropna=False)['Value'].transform("sum")
print(df)
>>> ID1 Value ID2 RSUM
0 a1 1 a2 15
1 a1 2 a2 15
2 a1 3 a2 15
3 a1 4 a2 15
4 a1 5 a2 15
5 b 6 None 6
6 c 7 None 7
Note the dropna=False to not discard IDs that have no pairing.
If you do not want to permanently swap the IDs, you can just create a temporary dataframe.
I am trying to explode an existing dataframe based on a numeric value in a column. For example, if the column has a numeric value of 3, I want to have 3 of those rows, so on and so on.
Assuming we start with this dataframe:
inventory_partner inventory_partner2 calc
0 A1 aa 1
1 A2 bb 2
2 A3 cc 5
3 A4 dd 4
4 A5 ee 5
5 A6 ff 3
How do we get to this dataframe?
inventory_partner inventory_partner2 calc
0 A1 aa 1
1 A2 bb 2
1 A2 bb 2
2 A3 cc 5
2 A3 cc 5
2 A3 cc 5
2 A3 cc 5
2 A3 cc 5
3 A4 dd 4
3 A4 dd 4
3 A4 dd 4
3 A4 dd 4
4 A5 ee 5
4 A5 ee 5
4 A5 ee 5
4 A5 ee 5
4 A5 ee 5
5 A6 ff 3
5 A6 ff 3
5 A6 ff 3
I have gotten this to work by using the below code, but I was wondering if there is an easier way to accomplish this without having to manually create the comma separated lists to feed into the explode method.
import pandas as pd
#create dataframe
d = {'inventory_partner': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6'], 'inventory_partner2': ['aa', 'bb', 'cc', 'dd', 'ee', 'ff'], 'calc': [1, 2, 5, 4, 5, 3]}
df1 = pd.DataFrame(data=d)
print(df1) #print original dataframe
#create my_comma_list column based on number values in calc column
df1.insert(3, 'my_comma_list', '')
df1.loc[df1['calc'] == 1, 'my_comma_list'] = '1'
df1.loc[df1['calc'] == 2, 'my_comma_list'] = '1, 2'
df1.loc[df1['calc'] == 3, 'my_comma_list'] = '1, 2, 3'
df1.loc[df1['calc'] == 4, 'my_comma_list'] = '1, 2, 3, 4'
df1.loc[df1['calc'] == 5, 'my_comma_list'] = '1, 2, 3, 4, 5'
print(df1) #print before row explosion
#explode the rows using the my_comma_list column to get desired number of rows
df1 = df1.assign(my_comma_list=df1['my_comma_list'].str.split(',')).explode('my_comma_list')
#drop the my_comma_list column since we no longer need it
del df1['my_comma_list']
print(df1) #print after row explosion
You can use Index.repeat and DataFrame.loc to repeat rows.
import pandas as pd
#create dataframe
d = {'inventory_partner': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6'],
'inventory_partner2': ['aa', 'bb', 'cc', 'dd', 'ee', 'ff'],
'calc': [1, 2, 5, 4, 5, 3]}
df1 = pd.DataFrame(data=d)
print (df1)
df1 = df1.loc[df1.index.repeat(df1['calc'])]
print (df1)
Output is:
Original DataFrame:
inventory_partner inventory_partner2 calc
0 A1 aa 1
1 A2 bb 2
2 A3 cc 5
3 A4 dd 4
4 A5 ee 5
5 A6 ff 3
Updated DataFrame with repeated rows:
inventory_partner inventory_partner2 calc
0 A1 aa 1
1 A2 bb 2
1 A2 bb 2
2 A3 cc 5
2 A3 cc 5
2 A3 cc 5
2 A3 cc 5
2 A3 cc 5
3 A4 dd 4
3 A4 dd 4
3 A4 dd 4
3 A4 dd 4
4 A5 ee 5
4 A5 ee 5
4 A5 ee 5
4 A5 ee 5
4 A5 ee 5
5 A6 ff 3
5 A6 ff 3
5 A6 ff 3
If you want to repeat rows based on a column value with a reference lookup, you can create a dictionary and identify how many times you want it to repeat, then use map to pass the value.
Let's say, you want to repeat based on the value in inventory_partner. Then you can do this:
import pandas as pd
inv_partner_dict = {'A1':1, 'A2':2, 'A3':5, 'A4':4,'A5':5,'A6':3}
#create dataframe
d = {'inventory_partner': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6'],
'inventory_partner2': ['aa', 'bb', 'cc', 'dd', 'ee', 'ff'],
'calc': [1, 2, 5, 4, 5, 3]}
df1 = pd.DataFrame(data=d)
print (df1)
df1 = df1.loc[df1.index.repeat(df1['inventory_partner2'].map(inv_partner_dict))]
print (df1)
This will do the same thing.
The output of this will be:
Original DataFrame:
inventory_partner inventory_partner2 calc
0 A1 aa 1
1 A2 bb 2
2 A3 cc 5
3 A4 dd 4
4 A5 ee 5
5 A6 ff 3
Updated DataFrame with repeated rows:
inventory_partner inventory_partner2 calc
0 A1 aa 1
1 A2 bb 2
1 A2 bb 2
2 A3 cc 5
2 A3 cc 5
2 A3 cc 5
2 A3 cc 5
2 A3 cc 5
3 A4 dd 4
3 A4 dd 4
3 A4 dd 4
3 A4 dd 4
4 A5 ee 5
4 A5 ee 5
4 A5 ee 5
4 A5 ee 5
4 A5 ee 5
5 A6 ff 3
5 A6 ff 3
5 A6 ff 3
Use pd.Series.repeat to get the index and then reindex:
df.reindex(df.inventory_partner2.repeat(df.calc).index)
I'm trying to join two DataFrames by index that can contain columns in common and I only want to add one to the other if that specific value is NaN or doesn't exist. I'm using the pandas example, so I've got:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
as
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
and
df4 = pd.DataFrame({'B': ['B2p', 'B3p', 'B6p', 'B7p'],
'D': ['D2p', 'D3p', 'D6p', 'D7p'],
'F': ['F2p', 'F3p', 'F6p', 'F7p']},
index=[2, 3, 6, 7])
as
B D F
2 B2p D2p F2p
3 B3p D3p F3p
6 B6p D6p F6p
7 B7p D7p F7p
and the searched result is:
A B C D F
0 A0 B0 C0 D0 Nan
1 A1 B1 C1 D1 Nan
2 A2 B2 C2 D2 F2p
3 A3 B3 C3 D3 F3p
6 Nan B6p Nan D6p F6p
7 Nan B7p Nan D7p F7p
This is a good use case for combine_first, where the row and column indices of the resulting dataframe will be the union of the two, i.e in the absence of an index in one of the dataframes, the value from the other is used (same behaviour as if it contained a NaN:
df1.combine_first(df4)
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 F2p
3 A3 B3 C3 D3 F3p
6 NaN B6p NaN D6p F6p
7 NaN B7p NaN D7p F7p
I am new to python and developing a code
I want to search for a word in a column and if a match is found, i want to insert an empty row below that.
My code is below
If df.columnname=='total':
Df.insert
Could someone pls help me.
Do give the following a try:
>>>df
id Label
0 1 A
1 2 B
2 3 B
3 4 B
4 5 A
5 6 B
6 7 A
7 8 A
8 9 C
9 10 C
10 11 C
# Create a separate dataframe with the id of the rows to be duplicated
df1 = df.loc[df['Label']=='B', 'id']
# Join it back and reset the index
df = pd.concat(df,df1).sort_index()
>>>df
id Label
0 1 A
1 2 B
2 2 NaN
3 3 B
4 3 NaN
5 4 B
6 4 NaN
7 5 A
8 6 B
9 6 NaN
10 7 A
11 8 A
12 9 C
13 10 C
14 11 C
Use below code:
from numpy import nan as Nan
import pandas as pd
df1 = pd.DataFrame({'Column1': ['A0', 'total', 'total', 'A3'],'Column2': ['B0', 'B1',
'B2', 'B3'],'Column3': ['C0', 'C1', 'C2', 'C3'],'Column4': ['D0', 'D1', 'D2',
'D3']},index=[0, 1, 2, 3])
count = 0
for index, row in df1.iterrows():
if row["Column1"] == 'total':
df1 = pd.DataFrame(np.insert(df1.values, index+1+count, values=[" "]
* len(df1.columns), axis=0),columns = df1.columns)
count += 1
print (df1)
Input:
Column1 Column2 Column3 Column4
0 A0 B0 C0 D0
1 total B1 C1 D1
2 total B2 C2 D2
3 A3 B3 C3 D3
Output:
Column1 Column2 Column3 Column4
0 A0 B0 C0 D0
1 total B1 C1 D1
2
3 total B2 C2 D2
4
5 A3 B3 C3 D3
I have the following type of a dataframe, values which are grouped by 3 different categories A,B,C:
import pandas as pd
A = ['A1', 'A2', 'A3', 'A2', 'A1']
B = ['B3', 'B2', 'B2', 'B1', 'B3']
C = ['C2', 'C2', 'C3', 'C1', 'C3']
value = ['6','2','3','3','5']
df = pd.DataFrame({'categA': A,'categB': B, 'categC': C, 'value': value})
df
Which looks like:
categA categB categC value
0 A1 B3 C2 6
1 A2 B2 C2 2
2 A3 B2 C3 3
3 A2 B1 C1 3
4 A1 B3 C3 5
Now, when I want to unstack this df by the C category, .unstack() returns some multi-indexed dataframe with 'value' at the first level and my categories of interest C1, C2 & C3 at the second level:
df = df.set_index(['categA','categB','categC']).unstack('categC')
df
Output:
value
categC C1 C2 C3
categA categB
A1 B3 NaN 6 5
A2 B1 3 NaN NaN
B2 NaN 2 NaN
A3 B2 NaN NaN 3
Is there a quick and clean way to get rid of the multi-index by reducing it to the highest available level? This is what I'd like to have as output:
categA categB C1 C2 C3
A1 B3 NaN 6 5
A2 B1 3 NaN NaN
B2 NaN 2 NaN
A3 B2 NaN NaN 3
Many thanks in advance!
Edit:
print(df.reset_index())
gives:
categA categB value
categC C1 C2 C3
0 A1 B3 NaN 6 5
1 A2 B1 3 NaN NaN
2 A2 B2 NaN 2 NaN
3 A3 B2 NaN NaN 3
Adding reset_index also , unstack with Series
df.set_index(['categA','categB','categC']).value.unstack('categC').reset_index()
Out[875]:
categC categA categB C1 C2 C3
0 A1 B3 None 6 5
1 A2 B1 3 None None
2 A2 B2 None 2 None
3 A3 B2 None None 3