Merging DataFrames with Partial String Matches - python

I am trying to merge two dataframes of different sizes based on partial match of the column 'name' and 'sub_name' (and full match of values in column A):
sub_name val_1 A
0 AAA 2 208
1 AAB 4 208
2 AAC 8 208
3 BAA 7 210
4 CAA 4 213
5 CAC 6 213
6 CAD 2 213
7 CAE 3 213
8 EAA 8 222
9 FAA 3 223
name val_2 A
0 XAAA 1 208
1 AABYY 5 208
2 BAAZ 9 210
3 CAAY 5 213
4 YCABX 8 213
5 XXCAC 6 213
6 YCADZ 3 213
7 XDAA 6 215
8 EAAX 4 222
the codes:
df1 = pd.DataFrame({'sub_name': ['AAA','AAB','AAC','BAA','CAA','CAC','CAD','CAE','EAA', 'FAA'],
'val_1': [2,4,8,7,4,6,2,3,8,3],
'A':[208,208,208,210,213,213,213,213,222,223]})
df2 = pd.DataFrame({'name': ['XAAA','AABYY','BAAZ','CAAY','YCABX','XXCAC','YCADZ','XDAA','EAAX'],
'val_2': [1,5,9,5,8,6,3,6,4],
'A': [208,208,210,213,213,213,213,215,222]})
Edit: I want to do an outer merge of these two dataframes – if there is no match, keep the rows, if there is a partial match (between sub_name and name) and also if the values in column A match, merge them together. If there is a partial match between name and sub_name but the column A values don't match, keep both rows.
I am trying to obtain:
name val_1 val_2 A
0 AAA 2.0 1.0 208
1 AAB 4.0 5.0 208
2 AAC 8.0 NaN 208
3 BAA 7.0 9.0 210
4 CAA 4.0 5.0 213
5 YCABX NaN 8.0 213
6 CAC 6.0 6.0 213
7 CAD 2.0 3.0 213
8 CAE 3.0 NaN 213
9 xDAA NaN 6.0 215
10 EAA 8.0 4.0 222
11 FAA 3.0 NaN 223
or this (it doesn't matter if I keep the full name or just the sub_name where the rows match):
name val_1 val_2 A
0 XAAA 2.0 1.0 208
1 AABYY 4.0 5.0 208
2 AAC 8.0 NaN 208
3 BAAZ 7.0 9.0 210
4 CAAY 4.0 5.0 213
5 YCABX NaN 8.0 213
6 XXCAC 6.0 6.0 213
7 YCADZ 2.0 3.0 213
8 CAE 3.0 NaN 213
9 XDAA NaN 6.0 215
10 EAA 8.0 4.0 222
11 FAA 3.0 NaN 223
If I needed full match I would use .merge(df1, df2, how = 'outer') but since I am working with substrings I don't know how to approach this. Maybe str.contains() could be useful?
Note: The sub_name can be made of more than just three letters. This is just an example.

# Create new column, an extraction of df's subname in df2's name
df2['sub_name']=df2['name'].str.findall('|'.join(df1['sub_name'].values.tolist())).str.join(',')
#Do an outer merge
df_new=df2.merge(df1, how='outer', on=['sub_name', 'A'])
#Update the name's columns values and drop the sub_name column created above
df_new=df_new.assign(name=df_new['name'].combine_first(df_new['sub_name'])).drop(columns=['sub_name'])
outcome
name val_2 A val_1
0 XAAA 1.0 208 2.0
1 AABYY 5.0 208 4.0
2 BAAZ 9.0 210 7.0
3 CAAY 5.0 213 4.0
4 YCABX 8.0 213 NaN
5 XXCAC 6.0 213 6.0
6 YCADZ 3.0 213 2.0
7 XDAA 6.0 215 NaN
8 EAAX 4.0 222 8.0
9 AAC NaN 208 8.0
10 CAE NaN 213 3.0
11 FAA NaN 223 3.0

You can use a fuzzy match with a threshold, then merge:
from thefuzz import process
def best(x, thresh=80):
match, score = process.extractOne(x, df2['name'])
if score>=thresh:
return match
df1.merge(df2, left_on=['A', df1['sub_name'].apply(best)],
right_on=['A', 'name'],
how='outer')
Output:
sub_name val_1 A name val_2
0 AAA 2.0 208 XAAA 1.0
1 AAB 4.0 208 AABYY 5.0
2 AAC 8.0 208 None NaN
3 BAA 7.0 210 BAAZ 9.0
4 CAA 4.0 213 CAAY 5.0
5 CAC 6.0 213 XXCAC 6.0
6 CAD 2.0 213 YCADZ 3.0
7 CAE 3.0 213 None NaN
8 EAA 8.0 222 EAAX 4.0
9 FAA 3.0 223 None NaN
10 NaN NaN 213 YCABX 8.0
11 NaN NaN 215 XDAA 6.0

Related

Iterative ffill with median values in a dataframe

Appreciate any help on this.
Let's say I have the following df with two columns:
col1 col2
NaN NaN
11 100
12 110
15 115
NaN NaN
NaN NaN
NaN NaN
9 142
12 144
NaN NaN
NaN NaN
NaN NaN
6 155
9 156
7 161
NaN NaN
NaN NaN
I'd like to forward fill and replace the Nan values with the median value of the preceding values. For example, the median of 11,12,15 in 'col1' is 12, therefore I need the Nan values to be filled with 12 until I get to the next non-Nan values in the column and continue iterating the same. See below the expected df:
col1 col2
NaN NaN
11 100
12 110
15 115
12 110
12 110
12 110
9 142
12 144
10.5 143
10.5 143
10.5 143
6 155
9 156
7 161
7 156
7 156
Try:
m1 = (df.col1.isna() != df.col1.isna().shift(1)).cumsum()
m2 = (df.col2.isna() != df.col2.isna().shift(1)).cumsum()
df["col1"] = df["col1"].fillna(
df.groupby(m1)["col1"].transform("median").ffill()
)
df["col2"] = df["col2"].fillna(
df.groupby(m2)["col2"].transform("median").ffill()
)
print(df)
Prints:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0
IIUC, if we fill null values like so:
Fill with Median of last 3 non-null items.
Fill with Median of last 2 non-null items.
Front fill values.
We'll get what you're looking for.
out = (df.combine_first(df.rolling(4,3).median())
.combine_first(df.rolling(3,2).median())
.ffill())
print(out)
Output:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0

Get values from between two other values for each row in the dataframe

I want to extract the integer values for each Hole_ID between the From and To values (inclusive). And save them to a new data frame with the Hole IDs as the column headers.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.array([['Hole_1',110,117],['Hole_2',220,225],['Hole_3',112,114],['Hole_4',248,252],['Hole_5',116,120],['Hole_6',39,45],['Hole_7',65,72],['Hole_8',79,83]]),columns=['HOLE_ID','FROM', 'TO'])
Example starting data
HOLE_ID FROM TO
0 Hole_1 110 117
1 Hole_2 220 225
2 Hole_3 112 114
3 Hole_4 248 252
4 Hole_5 116 120
5 Hole_6 39 45
6 Hole_7 65 72
7 Hole_8 79 83
This is what I would like:
Out[5]:
Hole_1 Hole_2 Hole_3 Hole_4 Hole_5 Hole_6 Hole_7 Hole_8
0 110 220 112 248 116 39 65 79
1 111 221 113 249 117 40 66 80
2 112 222 114 250 118 41 67 81
3 113 223 Nan 251 119 42 68 82
4 114 224 Nan 252 120 43 69 83
5 115 225 Nan Nan Nan 44 70 Nan
6 116 Nan Nan Nan Nan 45 71 Nan
7 117 Nan Nan Nan Nan Nan 72 Nan
I have tried to use the range function, which works if I manually define the range:
for i in df['HOLE_ID']:
df2[i]=range(int(1),int(10))
gives
Hole_1 Hole_2 Hole_3 Hole_4 Hole_5 Hole_6 Hole_7 Hole_8
0 1 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5 5
5 6 6 6 6 6 6 6 6
6 7 7 7 7 7 7 7 7
7 8 8 8 8 8 8 8 8
8 9 9 9 9 9 9 9 9
but this won't take the df To and From values as inputs to the range.
df2=pd.DataFrame()
for i in df['HOLE_ID']:
df2[i]=range(df['To'],df['From'])
gives an error.
Apply a method that returns a series of a range between from and to and then transpose the result, eg:
import numpy as np
df.set_index('HOLE_ID').apply(lambda v: pd.Series(np.arange(v['FROM'], v['TO'] + 1)), axis=1).T
Gives you:
HOLE_ID Hole_1 Hole_2 Hole_3 Hole_4 Hole_5 Hole_6 Hole_7 Hole_8
0 110.0 220.0 112.0 248.0 116.0 39.0 65.0 79.0
1 111.0 221.0 113.0 249.0 117.0 40.0 66.0 80.0
2 112.0 222.0 114.0 250.0 118.0 41.0 67.0 81.0
3 113.0 223.0 NaN 251.0 119.0 42.0 68.0 82.0
4 114.0 224.0 NaN 252.0 120.0 43.0 69.0 83.0
5 115.0 225.0 NaN NaN NaN 44.0 70.0 NaN
6 116.0 NaN NaN NaN NaN 45.0 71.0 NaN
7 117.0 NaN NaN NaN NaN NaN 72.0 NaN
Let's try:
df[['FROM','TO']] = df[['FROM', 'TO']].apply(pd.to_numeric)
dfe = df.set_index('HOLE_ID').apply(lambda x: np.arange(x['FROM'], x['TO']+1), axis=1).explode().to_frame()
dfe.set_index(dfe.groupby(level=0).cumcount(), append=True)[0].unstack(0)
Output:
HOLE_ID Hole_1 Hole_2 Hole_3 Hole_4 Hole_5 Hole_6 Hole_7 Hole_8
0 110 220 112 248 116 39 65 79
1 111 221 113 249 117 40 66 80
2 112 222 114 250 118 41 67 81
3 113 223 NaN 251 119 42 68 82
4 114 224 NaN 252 120 43 69 83
5 115 225 NaN NaN NaN 44 70 NaN
6 116 NaN NaN NaN NaN 45 71 NaN
7 117 NaN NaN NaN NaN NaN 72 NaN
Here is another way that creates a range from the 2 columns and creates a df:
out = (pd.DataFrame(df[['FROM','TO']].astype(int).agg(tuple,1)
.map(lambda x: range(x[0],x[1]+1).tolist(),index=df['HOLE_ID']).T)
HOLE_ID Hole_1 Hole_2 Hole_3 Hole_4 Hole_5 Hole_6 Hole_7 Hole_8
0 110.0 220.0 112.0 248.0 116.0 39.0 65.0 79.0
1 111.0 221.0 113.0 249.0 117.0 40.0 66.0 80.0
2 112.0 222.0 114.0 250.0 118.0 41.0 67.0 81.0
3 113.0 223.0 NaN 251.0 119.0 42.0 68.0 82.0
4 114.0 224.0 NaN 252.0 120.0 43.0 69.0 83.0
5 115.0 225.0 NaN NaN NaN 44.0 70.0 NaN
6 116.0 NaN NaN NaN NaN 45.0 71.0 NaN
7 117.0 NaN NaN NaN NaN NaN 72.0 NaN

Convert dataframe of floats to integers in pandas?

How do I convert every numeric element of my pandas dataframe to an integer? I have not seen any documentation online for how to do so, which is surprising given Pandas is so popular...
If you have a data frame of ints, simply use astype directly.
df.astype(int)
If not, use select_dtypes first to select numeric columns.
df.select_dtypes(np.number).astype(int)
df = pd.DataFrame({'col1': [1.,2.,3.,4.], 'col2': [10.,20.,30.,40.]})
col1 col2
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
>>> df.astype(int)
col1 col2
0 1 10
1 2 20
2 3 30
3 4 40
You can use apply for this purpose:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1.0, 20.0), 'B':np.arange(101.0, 120.0)})
print(df)
A B
0 1.0 101.0
1 2.0 102.0
2 3.0 103.0
3 4.0 104.0
4 5.0 105.0
5 6.0 106.0
6 7.0 107.0
7 8.0 108.0
8 9.0 109.0
9 10.0 110.0
10 11.0 111.0
11 12.0 112.0
12 13.0 113.0
13 14.0 114.0
14 15.0 115.0
15 16.0 116.0
16 17.0 117.0
17 18.0 118.0
18 19.0 119.0
df2 = df.apply(lambda a: [int(b) for b in a])
print(df2)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
15 16 116
16 17 117
17 18 118
18 19 119
A better approach is to change the type at the level of series:
for col in df.columns:
if df[col].dtype == np.float64:
df[col] = df[col].astype('int')
print(df)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
15 16 116
16 17 117
17 18 118
18 19 119
Try this:
column_types = dict(df.dtypes)
for column in df.columns:
if column_types[column] == 'float64':
df[column] = df[column].astype('int')
df[column] = df[column].apply(lambda x: int(x))

Pandas reindex an index of a multi index in decreasing order of the series values

I have a pandas series with a multi index like:
A 385 0.463120
278 0.269023
190 0.244348
818 0.232505
64 0.199640
B 1889 0.381681
1568 0.284957
1543 0.259003
1950 0.241432
1396 0.197692
C 2485 0.859803
2980 0.823075
2588 0.774576
2748 0.613309
2055 0.607444
E 3081 0.815492
3523 0.666928
3638 0.628147
3623 0.554344
3400 0.506123
I'd like to reindex the second index like this with pandas:
A 1 0.463120
2 0.269023
3 0.244348
4 0.232505
5 0.199640
B 1 0.381681
2 0.284957
3 0.259003
4 0.241432
5 0.197692
C 1 0.859803
2 0.823075
3 0.774576
4 0.613309
5 0.607444
D 1 0.815492
2 0.666928
3 0.628147
4 0.554344
5 0.506123
I.e. that the second index is increasing as the values of the series are decreasing with a single value of first index.
Is there a way to do so just using pandas?
You can use pandas.core.groupby.GroupBy.cumcount:
# create example data
df = pd.DataFrame({'a':list(pd.util.testing.rands_array(1, 4, dtype='O')) * 5,
'b':np.random.rand(20) // .1,
'c':np.random.rand(20) // .01}
)
df.set_index(['a','b'], inplace=True)
df = df.sort_values(['a','c'], ascending=[True,False])
df['x'] = df.groupby('a').cumcount()+1
df = df.reset_index().set_index(['a','x'])
returns
b c
a x
a 1 5.0 89.0
2 4.0 84.0
3 2.0 83.0
4 3.0 41.0
5 4.0 30.0
k 1 7.0 70.0
2 7.0 64.0
3 9.0 46.0
4 6.0 16.0
5 4.0 8.0
p 1 5.0 71.0
2 7.0 70.0
3 6.0 54.0
4 0.0 16.0
5 7.0 1.0
w 1 6.0 61.0
2 2.0 57.0
3 3.0 53.0
4 6.0 38.0
5 0.0 22.0

Plotting a heatmap for trajectory data from a pandas dataframe

I have a dataframe in pandas containing information that I would like display as a heatmap of sorts. The dataframe displays the x and y co-ordinates of several objects at varying points in time and includes other information in extra columns (eg:mass).
time object x y mass
3 1.0 216 12 12
4 1.0 218 13 12
5 1.0 217 12 12
6 1.0 234 13 13
1 2.0 361 289 23
2 2.0 362 287 22
3 2.0 362 286 22
5 3.0 124 56 18
6 3.0 126 52 17
I would like to create a heatmap with the x and y values corresponding to the x and y axes of the heatmap. The greater the number of objects at a particular x/y location, the more intense I would like the color to be. Any ideas on how you would accomplish this?
One idea is to use seaborn heatmap. First I would pivot your dataframe over your desired output, in this case x, y and say mass, with:
In [4]: df
Out[4]:
time object x y mass
0 3 1.0 216 12 12
1 4 1.0 218 13 12
2 5 1.0 217 12 12
3 6 1.0 234 13 13
4 1 2.0 361 289 23
5 2 2.0 362 287 22
6 3 2.0 362 286 22
7 5 3.0 124 56 18
8 6 3.0 126 52 17
In [5]: d = df.pivot('x','y','mass')
In [6]: d
Out[6]:
y 12 13 52 56 286 287 289
x
124 NaN NaN NaN 18.0 NaN NaN NaN
126 NaN NaN 17.0 NaN NaN NaN NaN
216 12.0 NaN NaN NaN NaN NaN NaN
217 12.0 NaN NaN NaN NaN NaN NaN
218 NaN 12.0 NaN NaN NaN NaN NaN
234 NaN 13.0 NaN NaN NaN NaN NaN
361 NaN NaN NaN NaN NaN NaN 23.0
362 NaN NaN NaN NaN 22.0 22.0 NaN
Then you can apply a simple heatmap with:
ax = sns.heatmap(d)
as a result you have the following image. In the case you need more complex attribute instead of the single mass, you can add a new column in the original dataframe. Finally here you can find some samples on how to define colormaps, style etc.

Categories