Calculations between different rows - python

I try to run loop over a pandas dataframe that takes two arguments from different rows. I tried to use .iloc and shift functions but did not manage to get the result i need.
Here's a simple example to explain better what i want to do:
dataframe1:
a b c
0 101 1 aaa
1 211 2 dcd
2 351 3 yyy
3 401 5 lol
4 631 6 zzz
for the above df I want to make new column ('d') that gets the diff between the values in column 'a' only if the diff between the values in column 'b' is equal to 1, if not the value should be null. like the following dataframe2:
a b c d
0 101 1 aaa nan
1 211 2 dcd 110
2 351 3 yyy 140
3 401 5 lol nan
4 631 6 zzz 230
Is there any designed function that can handle this kind of calculations?

Try like this, using loc and diff():
df.loc[df.b.diff() == 1, 'd'] = df.a.diff()
>>> df
a b c d
0 101 1 aaa NaN
1 211 2 dcd 110.0
2 351 3 yyy 140.0
3 401 5 lol NaN
4 631 6 zzz 230.0

You can create a group key
df1.groupby(df1.b.diff().ne(1).cumsum()).a.diff()
Out[361]:
0 NaN
1 110.0
2 140.0
3 NaN
4 230.0
Name: a, dtype: float64

Related

How to shift location of columns based on a condition of cells of other columns in python

I need a bit of help with python. Here is what I want to achieve.
I have a dataset that looks like below:
import pandas as pd
# define data
data = {'A': [55, "g", 35, 10,'pj'], 'B': [454, 27, 895, 3545,34],
'C': [4, 786, 7, 3, 896],
'Phone Number': [123456789, 7, 3456789012, 4567890123, 1],'another_col':[None,234567890,None,None,215478565]}
pd.DataFrame(data)
A B C Phone Number another_col
0 55 454 4 123456789 None
1 g 27 786 7 234567890.0
2 35 895 7 3456789012 None
3 10 3545 3 4567890123 None
4 pj 34 896 1 215478565.0
I have extracted this data from pdf and unfortunately it adds some random strings as shown above in the dataframe. I want to check if any of the cells in any of the columns contain strings or none-numeric value. If so, then delete the string and shift the entire row to the left. Finally, the desired output is as shown below:
A B C Phone Number another_col
0 55 454 4 1.234568e+08 None
1 27 786 7 2.345679e+08 None
2 35 895 7 3.456789e+09 None
3 10 3545 3 4.567890e+09 None
4 34 896 1 2.15478565+8 None
I would really appreciate your help.
One way is to use to_numeric to coerce each value to numeric values, then shifting each row leftward using dropna:
out = (df.apply(pd.to_numeric, errors='coerce')
.apply(lambda x: pd.Series(x.dropna().tolist(), index=df.columns.drop('another_col')), axis=1))
Output:
A B C Phone Number
0 55.0 454.0 4.0 1.234568e+08
1 27.0 786.0 7.0 2.345679e+08
2 35.0 895.0 7.0 3.456789e+09
3 10.0 3545.0 3.0 4.567890e+09
4 34.0 896.0 1.0 2.154786e+08
You can create boolean mask, shift and pd.concat:
m=pd.to_numeric(df['A'], errors='coerce').isna()
pd.concat([df.loc[~m], df.loc[m].shift(-1, axis=1)]).sort_index()
Output:
A B C Phone Number another_col
0 55 454 4 1.234568e+08 NaN
1 27 786 7 2.345679e+08 NaN
2 35 895 7 3.456789e+09 NaN
3 10 3545 3 4.567890e+09 NaN
4 34 896 1 2.154786e+08 NaN

Python Pandas - difference between 'loc' and 'where'?

Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1
Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1
You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)

Search and replace matching values across data frames in Python

I have the following Pandas Dataframes:
df1:
C D E F G
111 222 333 444 555
666 777
df2:
A B
111 3
222 4
333 3
444 3
555 4
100 3
666 4
200 3
777 3
I need to look up in df2 to find matching value from df1.A Then replace that value in df1 with the paired valued in df2.B
So the required output would be:
C D E F G
3 4 3 3 4
4 3
I tried a left merge and thought to try and reshape the values across but thought there must be a simpler / cleaner direct search and replace method. Any help much appreciated.
First create a series mapping:
s = df2.set_index('A')['B']
Then apply this to each value:
df1 = df1.applymap(s.get)
try this,
temp=df2.set_index('A')['B']
print df1.replace(temp)
Output:
C D E F G
0 3 4 3.0 3.0 4.0
1 4 3 NaN NaN NaN

Problems with combining columns from dataframes in pandas

I have two dataframes that I'm trying to merge.
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 NaN NaN
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 NaN NaN
5 313 3 NaN NaN
...
df2
code scale R1 R2...
0 121 2 30 20
3 313 2 15 10
...
I need, based on the equality of the columns code and scale copy the value from df2 to df1.
The result should look like this:
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 30 20
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 15 10
5 313 3 NaN NaN
...
The problem is that there can be a lot of columns like R1 and R2 and I can not check each one separately, so I wanted to use something from this instruction, but nothing gives me the desired result. I'm doing something wrong, but I can't understand what. I really need advice.
What do you want to happen if the two dataframes both have values for R1/R2? If you want keep df1, you could do
df1.set_index(['code', 'scale']).fillna(df2.set_index(['code', 'scale'])).reset_index()
To keep df2 just do the fillna the other way round. To combine in some other way please clarify the question!
Try this ?
pd.concat([df,df1],axis=0).sort_values(['code','scale']).drop_duplicates(['code','scale'],keep='last')
Out[21]:
code scale R1 R2
0 121 1 80.0 110.0
0 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
3 313 2 15.0 10.0
5 313 3 NaN NaN
This is a good situation for combine_first. It replaces the nulls in the calling dataframe from the passed dataframe.
df1.set_index(['code', 'scale']).combine_first(df2.set_index(['code', 'scale'])).reset_index()
code scale R1 R2
0 121 1 80.0 110.0
1 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
4 313 2 15.0 10.0
5 313 3 NaN NaN
Other solutions
with fillna
df.set_index(['code', 'scale']).fillna(df1.set_index(['code', 'scale'])).reset_index()
with add - a bit faster
df.set_index(['code', 'scale']).add(df1.set_index(['code', 'scale']), fill_value=0)

Categories