Pandas: how to merge df with condition - python

I have df
number A B C
123 10 10 1
123 10 11 1
123 18 27 1
456 10 18 2
456 42 34 2
789 13 71 3
789 19 108 3
789 234 560 4
and second df
number A B
123 18 27
456 32 19
789 234 560
I need, if number, A, Bis equal to this column in second df, add that to new df and also add string with C is equal string, that we add earlier.
Desire output
number A B C
123 10 10 1
123 10 11 1
123 18 27 1
789 234 560 4
How can I write this condition?

One way is to give df2 a dummy column:
In [11]: df2["in_df2"] = True
then you can do the merge:
In [12]: df1.merge(df2, how="left")
Out[12]:
number A B C in_df2
0 123 10 10 1 NaN
1 123 10 11 1 NaN
2 123 18 27 1 True
3 456 10 18 2 NaN
4 456 42 34 2 NaN
5 789 13 71 3 NaN
6 789 19 108 3 NaN
7 789 234 560 4 True
Now, we only want those groups which contains a True:
In [13]: df1.merge(df2, how="left").groupby(["number", "C"]).filter(lambda x: x["in_df2"].any())
Out[13]:
number A B C in_df2
0 123 10 10 1 NaN
1 123 10 11 1 NaN
2 123 18 27 1 True
7 789 234 560 4 True

Related

Data frame segmentation and dropping

I have the following DataFrame in pandas:
A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90],
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]
I want to create a new column and in that column I want to have values from column A based on the condition on column B. Conditions are if there is no ''txt'' in between two consecutive ''BW'', then I will have those on column C. But if there is ''txt'' between two consecutive ''BW'', I want to drop all those values. So the expected output should look like:
A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90],
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]
C = [1,10,23, BW, 24,24,55, BW, nan, nan, nan, nan, nan, nan, BW, 43,BW]
I have no clue how to do it. Any help is much appreciated.
EDIT:
Updated answer which was missing the values of BW in the final df.
import pandas as pd
import numpy as np
BW = 999
txt = -999
A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]
df = pd.DataFrame({'A': A, 'B': B})
df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())
df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.A)
df['C'] = np.where(df['B'] == BW, df['B'], df['C'])
df['C'] = df['C'].astype('Int64')
df = df.drop('group', axis=1)
In [435]: df
Out[435]:
A B C
0 1 24 1
1 10 23 10
2 23 29 23
3 45 999 999 <-- BW
4 24 49 24
5 24 59 24
6 55 72 55
7 67 999 999 <-- BW
8 73 9 <NA>
9 26 183 <NA>
10 13 17 <NA>
11 96 -999 <NA> <-- txt is in the middle of BW
12 53 2 <NA>
13 23 49 <NA>
14 24 999 999 <-- BW
15 43 479 43
16 90 999 999 <-- BW
You can achieve it like so, assuming BW and txt are specific values I just filled them with some random number to differentiate them
In [277]: BW = 999
In [278]: txt = -999
In [293]: A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
...: B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,49,BW,479,BW]
In [300]: df = pd.DataFrame({'A': A, 'B': B})
In [301]: df
Out[301]:
A B
0 1 24
1 10 23
2 23 29
3 45 999
4 24 49
5 24 59
6 55 72
7 67 999
8 73 9
9 26 183
10 13 17
11 96 -999
12 53 2
13 23 49
14 24 999
15 43 479
16 90 999
First lets split the different groups of values, here I am splitting them into unique groups where each group contains the values of B that are between the value BW and the next BW.
In [321]: df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())
In [322]: df
Out[322]:
A B group
0 1 24 0.00000000
1 10 23 0.00000000
2 23 29 0.00000000
3 45 999 NaN
4 24 49 1.00000000
5 24 59 1.00000000
6 55 72 1.00000000
7 67 999 NaN
8 73 9 2.00000000
9 26 183 2.00000000
10 13 17 2.00000000
11 96 -999 2.00000000
12 53 2 2.00000000
13 23 49 2.00000000
14 24 999 NaN
15 43 479 3.00000000
16 90 999 NaN
Next with the use of np.where() we can replace the values depending on the condition that you set.
In [360]: df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.B)
In [432]: df
Out[432]:
A B group C
0 1 24 0.00000000 24.00000000
1 10 23 0.00000000 23.00000000
2 23 29 0.00000000 29.00000000
3 45 999 NaN 999.00000000
4 24 49 1.00000000 49.00000000
5 24 59 1.00000000 59.00000000
6 55 72 1.00000000 72.00000000
7 67 999 NaN 999.00000000
8 73 9 2.00000000 NaN
9 26 183 2.00000000 NaN
10 13 17 2.00000000 NaN
11 96 -999 2.00000000 NaN
12 53 2 2.00000000 NaN
13 23 49 2.00000000 NaN
14 24 999 NaN 999.00000000
15 43 479 3.00000000 479.00000000
16 90 999 NaN 999.00000000
Here we need to set the where B is equal to BW for C back to the values of B.
In [488]: df['C'] = np.where(df['B'] == BW, df['B'], df['C'])
In [489]: df
Out[489]:
A B group C
0 1 24 0.00000000 24.00000000
1 10 23 0.00000000 23.00000000
2 23 29 0.00000000 29.00000000
3 45 999 NaN 999.00000000
4 24 49 1.00000000 49.00000000
5 24 59 1.00000000 59.00000000
6 55 72 1.00000000 72.00000000
7 67 999 NaN 999.00000000
8 73 9 2.00000000 NaN
9 26 183 2.00000000 NaN
10 13 17 2.00000000 NaN
11 96 -999 2.00000000 NaN
12 53 2 2.00000000 NaN
13 23 49 2.00000000 NaN
14 24 999 NaN 999.00000000
15 43 479 3.00000000 479.00000000
16 90 999 NaN 999.00000000
Lastly just convert the float column to int and drop the group column which we do not need anymore. If you want to maintain that the NaN values are np.nan then ignore the conversion to Int64.
In [396]: df.C = df.C.astype('Int64')
In [397]: df
Out[397]:
A B group C
0 1 24 0.00000000 24
1 10 23 0.00000000 23
2 23 29 0.00000000 29
3 45 999 NaN 999
4 24 49 1.00000000 49
5 24 59 1.00000000 59
6 55 72 1.00000000 72
7 67 999 NaN 999
8 73 9 2.00000000 <NA>
9 26 183 2.00000000 <NA>
10 13 17 2.00000000 <NA>
11 96 -999 2.00000000 <NA>
12 53 2 2.00000000 <NA>
13 23 49 2.00000000 <NA>
14 24 999 NaN 999
15 43 479 3.00000000 479
16 90 999 NaN 999
In [398]: df = df.drop('group', axis=1)
In [435]: df
Out[435]:
A B C
0 1 24 24
1 10 23 23
2 23 29 29
3 45 999 999
4 24 49 49
5 24 59 59
6 55 72 72
7 67 999 999
8 73 9 <NA>
9 26 183 <NA>
10 13 17 <NA>
11 96 -999 <NA>
12 53 2 <NA>
13 23 49 <NA>
14 24 999 999
15 43 479 479
16 90 999 999
I don't know if this is the most efficient way to do it, but you can create a new column called mask from mapping the values in column B the following way: 'BW' to True, 'txt' to False and all other values to np.nan.
Then if you forward fill the NaN from mask, and backward fill the NaN from mask and logically combine the results (set equal to True as long as one of the forward or backward filled columns is False), you can create a column called final_mask where all of the values between consecutive BW containing txt are filled in with True.
You can then use .apply to select the value of column A only when the final_mask is False and column B isn't 'BW', select column B if final_mask is False and column B is 'BW', and np.nan otherwise.
import numpy as np
import pandas as pd
A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
B = [24,23,29, 'BW',49,59,72, 'BW',9,183,17, 'txt',2,49,'BW',479,'BW']
df = pd.DataFrame({'A':A,'B':B})
df["mask"] = df["B"].apply(lambda x: True if x == 'BW' else False if x == 'txt' else np.nan)
df["ffill"] = df["mask"].fillna(method="ffill")
df["bfill"] = df["mask"].fillna(method="bfill")
df["final_mask"] = (df["ffill"] == False) | (df["bfill"] == False)
df["C"] = df.apply(lambda x: x['A'] if (
(x['final_mask'] == False) & (x['B'] != 'BW'))
else x['B'] if ((x['final_mask'] == False) & (x['B'] == 'BW'))
else np.nan, axis=1
)
>>> df
A B mask ffill bfill final_mask C
0 1 24 NaN NaN True False 1
1 10 23 NaN NaN True False 10
2 23 29 NaN NaN True False 23
3 45 BW True True True False BW
4 24 49 NaN True True False 24
5 24 59 NaN True True False 24
6 55 72 NaN True True False 55
7 67 BW True True True False BW
8 73 9 NaN True False True NaN
9 26 183 NaN True False True NaN
10 13 17 NaN True False True NaN
11 96 txt False False False True NaN
12 53 2 NaN False True True NaN
13 23 49 NaN False True True NaN
14 24 BW True True True False BW
15 43 479 NaN True True False 43
16 90 BW True True True False BW
Dropping the columns we created along the way:
df.drop(columns=['mask','ffill','bfill','final_mask'])
A B C
0 1 24 1
1 10 23 10
2 23 29 23
3 45 BW BW
4 24 49 24
5 24 59 24
6 55 72 55
7 67 BW BW
8 73 9 NaN
9 26 183 NaN
10 13 17 NaN
11 96 txt NaN
12 53 2 NaN
13 23 49 NaN
14 24 BW BW
15 43 479 43
16 90 BW BW

Groupby & Sum from occurance of a particular value till the occurance of another particular value or the same value

I have a dataframe as below.
I want to groupby 'user' & 'eve' and sum 'Ses' till 100/200 & from 100 to 200.
Also, return the value of column 'Name' where 100/200 occurs.
If after an hundred, there is no 100 or 200 (like last row in group a & 123 or a & 456), ignore it.
User eve Ses ID Name
a 123 1 10 a
a 123 2 11 a
a 123 3 12 a
a 123 4 13 a
a 123 3 100 xyz
a 123 6 10 a
a 456 1 11 a
a 456 2 12 a
a 456 3 13 a
a 456 4 40 a
a 456 1 100 mno
a 456 14 10 a
a 456 7 20 a
a 456 8 30 a
a 456 12 200 pqr
a 456 10 10 a
b 123 1 20 a
b 123 2 30 a
b 123 3 40 a
b 123 4 50 a
b 123 1 70 a
b 123 6 100 abc
b 888 1 20 a
b 888 1 200 jkl
b 888 3 10 a
b 888 4 20 a
b 888 5 30 a
b 888 1 100 rrr
b 888 7 50 a
b 888 8 70 a
The expected output for the above input df is a df below.
User eve Ses Name
a 123 13 xyz
a 456 11 mno
a 456 41 pqr
b 123 17 abc
b 888 2 jkl
b 888 13 rrr
This is my approach:
# valid IDs
df['valids'] = df['ID'].isin([100,200])
# mask the trailing non-hundred ids
heads = (df['ID'].where(df['valids'])
.groupby([df['User'],df['eve']])
.bfill().notnull()
)
df = df[heads]
# groupby and output:
(df.groupby(['User','eve', df['valids'].shift(fill_value=0).cumsum()],
as_index=False)
.agg({'Ses':'sum', 'Name':'last'})
)
Output:
User eve Ses Name
0 a 123 13 xyz
1 a 456 11 mno
2 a 456 41 pqr
3 b 123 17 abc
4 b 888 2 jkl
5 b 888 13 rrr

Generating rows in a pandas dataframe to make up for missing values of a column (or multiple columns)

I have the following dataframe.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 4 101 79
6 4 102 21
7 5 101 129
8 6 101 561
Notice that for sensor_id 102, there are no values for hour = 3. This is due to the fact that the sensors do not generate a separate row of data if the hourly_count is equal to zero. This means that sensor 102 should have hourly_counts = 0 at hour = 3, but this is just the way the original data was collected.
I would ideally wish for a code that fills in this gap. So it should understand that if there are 2 sensors, each sensor should have an hourly record, and if not, insert a row in the dataframe for that sensor for that hour and fill the hourly_count column at that row as 0.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
Any help is really appreciated.
Using DataFrame.reindex, you can explicitly define your index. This is useful if you are missing data from both sensors for a particular hour. You can also extend the hour beyond what you have. In the following example, it extends out to hour 8.
new_ix = pd.MultiIndex.from_product([range(1,9), [101, 102]], names=['hour', 'sensor_id'])
df_new = df.set_index(['hour', 'sensor_id'])
df_new.reindex(new_ix, fill_value=0).reset_index()
Output:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
12 7 101 0
13 7 102 0
14 8 101 0
15 8 102 0
Use pandas.DataFrame.pivot and then unstack with reset_index:
new_df = df.pivot('sensor_id','hour', 'hourly_count').fillna(0).unstack().reset_index()
print(new_df)
Output:
hour sensor_id 0
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Assume missing is on sensor_id 2 only. One way is you just create a new df with all combination of all hours of sensor_id 1, and merge left this new df with original df to get hourly_count and fillna
a = df.hour.unique()
Idf1 = pd.MultiIndex.from_product([a, [101, 102]]).to_frame(index=False, name=['hour', 'sensor_id'])
Out[157]:
hour sensor_id
0 1 101
1 1 102
2 2 101
3 2 102
4 3 101
5 3 102
6 4 101
7 4 102
8 5 101
9 5 102
10 6 101
11 6 102
df1.merge(df, on=['hour','sensor_id'], how='left').fillna(0)
Out[161]:
hour sensor_id hourly_count
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Other way: using unstack with fill_value
df.set_index(['hour', 'sensor_id']).unstack(fill_value=0).stack().reset_index()
Out[171]:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0

How to perform conditional updation of column values in Pandas DataFrame?

I have a below dataframe is there any way to perform conditional addition of column values in pandas.
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 NaN 4 5 5 54 3 2
222 bbb pune 1 70 NaN 5 4 4 8 3 4
333 ccc mumbai 2 NaN NaN 9 3 4 8 4 3
444 ddd hyd 4 NaN NaN 3 8 6 4 2 7
What I want to achive
if city = pune default_sal should be updated in total_sal for ex for
emp_id 111 total_salary should be 90
if city!=pune then depending on months_worked value total salary
should be updated.For ex for emp id 333 months_worked =2 So addition
of jan and feb value should be updated as total_sal which is 9+3=12
Desired O/P
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 90 4 5 5 54 3 2
222 bbb pune 1 70 70 5 4 4 8 3 4
333 ccc mumbai 2 NaN 12 9 3 4 8 4 3
444 ddd hyd 4 NaN 21 3 8 6 4 2 7
Using np.where after create the help series
s1=pd.Series([df.iloc[x,6:y+6].sum() for x,y in enumerate(df.months_worked)],index=df.index)
np.where(df.City=='pune',df.default_sal,s1 )
Out[429]: array([90., 70., 12., 21.])
#df['total']=np.where(df.City=='pune',df.default_sal,s1 )

Problems while merging two pandas dataframes with different shapes?

This is quite simple, but I do not get why I can't merge two dataframes. I have the following dfs with different shapes (one is larger and wider than the other):
df1
A id
0 microsoft inc 1
1 apple computer. 2
2 Google Inc. 3
3 IBM 4
4 amazon, Inc. 5
df2
B C D E id
0 (01780-500-01) 237489 - 342 API True. 1
0 (409-6043-01) 234324 API Other 2
0 23423423 API NaN NaN 3
0 (001722-5e240-60) NaN NaN Other 4
1 (0012172-52411-60) 32423423. NaN Other 4
0 29849032-29482390 API Yes False 5
1 329482030-23490-1 API Yes False 5
I would like to merge df1 and df2 by the index column:
df3
A B C D E id
0 microsoft inc (01780-500-01) 237489 - 342 API True. 1
1 apple computer. (409-6043-01) 234324 API Other 2
2 Google Inc. 23423423 API NaN NaN 3
3 IBM (001722-5e240-60) NaN NaN Other 4
4 IBM (0012172-52411-60) 32423423. NaN Other 4
5 amazon, Inc. 29849032-29482390 API Yes False 5
6 amazon, Inc. 329482030-23490-1 API Yes False 5
I know that this could be done by using merge(). Also, I read this excellent tutorial and tried to:
In:
pd.merge(df1, df2, on=df1.id, how='outer')
Out:
IndexError: indices are out-of-bounds
Then I tried:
pd.merge(df2, df1, on='id', how='outer')
And apparently its repeating several times the merged rows, something like this:
A B C D E index
0 microsoft inc (01780-500-01) 237489 - 342 API True. 1
1 apple computer. (409-6043-01) 234324 API Other 2
2 apple computer. (409-6043-01) 234324 API Other 2
3 apple computer. (409-6043-01) 234324 API Other 2
4 apple computer. (409-6043-01) 234324 API Other 2
5 apple computer. (409-6043-01) 234324 API Other 2
6 apple computer. (409-6043-01) 234324 API Other 2
7 apple computer. (409-6043-01) 234324 API Other 2
8 apple computer. (409-6043-01) 234324 API Other 2
...
I think that this is related with the fact that I created a temporal index df2['position'] = df2.index since the indices look weird, and then removed it. So, my question is how to get df3?
UPDATE
I fixed the index of df2 like this:
df2.reset_index(drop=True, inplace=True)
And now looks like this:
B C D E id
0 (01780-500-01) 237489 - 342 API True. 1
1 (409-6043-01) 234324 API Other 2
2 23423423 API NaN NaN 3
3 (001722-5e240-60) NaN NaN Other 4
4 (0012172-52411-60) 32423423. NaN Other 4
5 29849032-29482390 API Yes False 5
6 329482030-23490-1 API Yes False 5
I am still having the same issue. The merged rows are repeating several times.
>>>print(df2.dtypes)
B object
C object
D object
E object
id int64
dtype: object
>>>print(df1.dtypes)
A object
id int64
dtype: object
Update2
>>>print(df2['id'])
0 1
1 2
2 3
3 4
4 4
5 5
6 5
7 6
8 6
9 7
10 8
11 8
12 8
13 8
14 9
15 10
16 11
17 11
18 12
19 12
20 13
21 13
22 14
23 15
24 16
25 16
26 17
27 17
28 18
29 18
...
476 132
477 132
478 132
479 132
480 132
481 132
482 132
483 132
484 133
485 133
486 133
487 133
488 134
489 134
490 134
491 134
492 135
493 135
494 136
495 136
496 137
497 137
498 137
499 137
500 137
501 137
502 137
503 138
504 138
505 138
Name: id, dtype: int64
And
>>>print(df1)
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 8
12 12
13 6
14 7
15 8
16 6
17 11
18 13
19 14
20 15
21 11
22 2
23 16
24 17
25 18
26 9
27 19
28 11
29 20
..
108 57
109 43
110 22
111 2
112 58
113 49
114 22
115 59
116 2
117 6
118 22
119 2
120 37
121 2
122 9
123 60
124 61
125 62
126 63
127 42
128 64
129 4
130 29
131 11
132 2
133 25
134 4
135 65
136 66
137 4
Name: id, dtype: int64
You could try setting the index as id and then using join:
df1 = pd.DataFrame([('microsoft inc',1),
('apple computer.',2),
('Google Inc.',3),
('IBM',4),
('amazon, Inc.',5)],columns = ('A','id'))
df2 = pd.DataFrame([('(01780-500-01)','237489', '- 342','API', 1),
('(409-6043-01)','234324', ' API','Other ',2),
('23423423','API', 'NaN','NaN', 3),
('(001722-5e240-60)','NaN', 'NaN','Other', 4),
('(0012172-52411-60)','32423423',' NaN','Other', 4),
('29849032-29482390','API', ' Yes',' False', 5),
('329482030-23490-1','API', ' Yes',' False', 5)],
columns = ['B','C','D','E','id'])
df1 =df1.set_index('id')
df1.drop_duplicates(inplace=True)
df2 = df2.set_index('id')
df3 = df1.join(df2,how='outer')
Since you've set the index columns (aka join keys) for both dataframes, you wouldn't have to specify the on='id' param.
This is an alternate way to solve the problem.. I don't see anything wrong with pd.merge(df1, df2, on='id', how='outer'). You might want to double check the id column in both dataframes, as mentioned by #JohnE

Categories