Grouping together lists in pandas - python

I have a database of patents citing other patents looking like this:
{'index': {0: 0, 1: 1, 2: 2, 12: 12, 21: 21},
'docdb_family_id': {0: 57904406,
1: 57904406,
2: 57906556,
12: 57909419,
21: 57942222},
'cited_docdbs': {0: [15057621,
16359315,
18731820,
19198211,
19198218,
19198340,
19550248,
19700609,
20418230,
22144166,
22513333,
22800966,
22925564,
23335606,
23891186,
25344297,
25345599,
25414615,
25495423,
25588955,
26530649,
27563473,
34277948,
36626718,
38801947,
40454852,
40885675,
40957530,
41249600,
41377563,
41378429,
41444278,
41797413,
42153280,
42340085,
42340086,
42678557,
42709962,
42709963,
42737942,
43648036,
44691991,
44947081,
45352855,
45815534,
46254922,
46382961,
47830116,
49676686,
49912209,
54191614],
1: [15057621,
16359315,
18731820,
19198211,
19198218,
19198340,
19550248,
19700609,
20418230,
22144166,
22513333,
22800966,
22925564,
23335606,
23891186,
25344297,
25345599,
25414615,
25495423,
25588955,
26530649,
27563473,
34277948,
36626718,
38801947,
40454852,
40885675,
40957530,
41249600,
41377563,
41378429,
41444278,
41797413,
42153280,
42340085,
42340086,
42678557,
42709962,
42709963,
42737942,
43648036,
44691991,
44947081,
45352855,
45815534,
46254922,
46382961,
47830116,
49676686,
49912209,
54191614],
2: [6078355,
8173164,
14235835,
16940834,
18152411,
18704525,
27343995,
45467248,
46172598,
49878759,
50995553,
52668238],
12: [6293366,
7856452,
16980051,
23177359,
26477802,
27453602,
41135094,
53004244,
54332594,
55018863],
21: [7913900,
13287798,
18834564,
23971781,
26904791,
27304292,
29720924,
34622252,
35197847,
37766575,
39873073,
42075013,
44508652,
44530218,
45571357,
48222848,
48747089,
49111776,
49754218,
50024241,
50474222,
50545849,
52580625,
58800268]},
'doc_std_name': {0: 'SEEO INC',
1: 'BOSCH GMBH ROBERT',
2: 'SAMSUNG SDI CO LTD',
12: 'NAGAI TAKAYUKI',
21: 'SAMSUNG SDI CO LTD'}}
Now, what I would like to do is performing a groupby firm as follows:
df_grouped_byfirm=data_min.groupby("doc_std_name").agg(publn_nrs=('docdb_family_id',"unique")).reset_index()
but merging together the lists of cited_docdbs. So, for instance in the example above, for SAMSUNG SDI CO LTD the final list of cited_docdbs should become a mega list where all the cited docdbs of both ids of SAMSUNG SDI CO LTD are merged together:
[6078355,
8173164,
14235835,
16940834,
18152411,
18704525,
27343995,
45467248,
46172598,
49878759,
50995553,
52668238,
7913900,
13287798,
18834564,
23971781,
26904791,
27304292,
29720924,
34622252,
35197847,
37766575,
39873073,
42075013,
44508652,
44530218,
45571357,
48222848,
48747089,
49111776,
49754218,
50024241,
50474222,
50545849,
52580625,
58800268]
Thank you

You can flatten nested lists with dict.fromkeys for remove duplicates in original order:
f = lambda x: list(dict.fromkeys(z for y in x for z in y))
df=df.groupby("doc_std_name").agg(publn_nrs=('cited_docdbs',f))
print (df)
publn_nrs
doc_std_name
BOSCH GMBH ROBERT [15057621, 16359315, 18731820, 19198211, 19198...
NAGAI TAKAYUKI [6293366, 7856452, 16980051, 23177359, 2647780...
SAMSUNG SDI CO LTD [6078355, 8173164, 14235835, 16940834, 1815241...
SEEO INC [15057621, 16359315, 18731820, 19198211, 19198...
If order is not important use sets for remove duplicates:
f = lambda x: list(set(z for y in x for z in y))
df=df.groupby("doc_std_name").agg(publn_nrs=('cited_docdbs',f))
print (df)
publn_nrs
doc_std_name
BOSCH GMBH ROBERT [19700609, 19198211, 19198340, 44947081, 19198...
NAGAI TAKAYUKI [27453602, 7856452, 26477802, 23177359, 550188...
SAMSUNG SDI CO LTD [48222848, 18834564, 42075013, 58800268, 18704...
SEEO INC [19700609, 19198211, 19198340, 44947081, 19198...

You can just use sum in agg to concatenate the lists within each group.
df.groupby("doc_std_name").agg({"cited_docdbs": sum}).reset_index()
This will give the follow:
doc_std_name cited_docdbs
0 BOSCH GMBH ROBERT [15057621, 16359315, 18731820, 19198211, 19198...
1 NAGAI TAKAYUKI [6293366, 7856452, 16980051, 23177359, 2647780...
2 SAMSUNG SDI CO LTD [6078355, 8173164, 14235835, 16940834, 1815241...
3 SEEO INC [15057621, 16359315, 18731820, 19198211, 19198...

Related

One column (df1) compare with two column(df2) with python

I would like to compare one column (df1) with two columns (df2).
df1
name area
Cody California
Billy Connecticut
Jeniffer Indiana
Franc Georgia
Mark Illinois
Tamis Connecticut
Danye Illinois
Leesa Indiana
Hector Illinois
Coy California
df2
name1 name2 points
Billy NA 20
Cody NA 27.5
Coy NA 25
Danye NA 21
Franc NA 19
Hector 40
Jeniffer 30
Leesa 20
Mark 50
Tamis 90
Output
name area points
Cody California 27.5
Billy Connecticut 20
Jeniffer Indiana 30
Franc Georgia 19
Mark Illinois 50
Tamis Connecticut 90
Danye Illinois 21
Leesa Indiana 20
Hector Illinois 40
Coy California 25
You could try as follows:
import pandas as pd
import numpy as np
data = {'name': {0: 'Cody', 1: 'Billy', 2: 'Jeniffer', 3: 'Franc', 4: 'Mark',
5: 'Tamis', 6: 'Danye', 7: 'Leesa', 8: 'Hector', 9: 'Coy'},
'area': {0: 'California', 1: 'Connecticut', 2: 'Indiana', 3: 'Georgia',
4: 'Illinois', 5: 'Connecticut', 6: 'Illinois', 7: 'Indiana',
8: 'Illinois', 9: 'California'}}
df = pd.DataFrame(data)
data2 = {'name1': {0: 'Billy', 1: 'Cody', 2: 'Coy', 3: 'Danye', 4: 'Franc',
5: np.nan, 6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan},
'name2': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan, 5: 'Hector',
6: 'Jeniffer', 7: 'Leesa', 8: 'Mark', 9: 'Tamis'},
'points': {0: 20.0, 1: 27.5, 2: 25.0, 3: 21.0, 4: 19.0, 5: 40.0,
6: 30.0, 7: 20.0, 8: 50.0, 9: 90.0}}
df2 = pd.DataFrame(data2)
# fill NaNs in `name2` based on `name1`
df2['name2'] = df2['name2'].fillna(df2['name1'])
# merge dfs
df_new = df.merge(df2[['name2','points']], left_on='name', right_on='name2')
print(df_new)
name area points
0 Cody California 27.5
1 Billy Connecticut 20.0
2 Jeniffer Indiana 30.0
3 Franc Georgia 19.0
4 Mark Illinois 50.0
5 Tamis Connecticut 90.0
6 Danye Illinois 21.0
7 Leesa Indiana 20.0
8 Hector Illinois 40.0
9 Coy California 25.0
Alternatively, instead of merge you could use map to add column points to your first df:
df['points'] = df['name'].map(df2.set_index('name2')['points'])

Sequentially extract top n from group depending on group value

I am working on calculating some football stats.
I have the following dataframe:
{'Player': {8: 'Darrel Williams', 2: 'Mark Ingram', 3: 'Michael Carter', 4: 'Najee Harris', 10: 'James Conner', 0: 'Buffalo Bills', 15: 'Davante Adams', 1: 'Aaron Rodgers', 5: 'Tyler Bass', 11: 'Corey Davis', 6: 'Van Jefferson', 14: 'Matt Ryan', 7: 'T.J. Hockenson', 9: 'Antonio Brown', 12: 'Alvin Kamara', 13: 'Tyler Boyd'}, 'Position': {8: 'RB', 2: 'RB', 3: 'RB', 4: 'RB', 10: 'RB', 0: 'DEF', 15: 'WR', 1: 'QB', 5: 'K', 11: 'WR', 6: 'WR', 14: 'QB', 7: 'TE', 9: 'WR', 12: 'RB', 13: 'WR'}, 'Score': {8: 24.9, 2: 18.8, 3: 16.2, 4: 15.3, 10: 13.9, 0: 12.0, 15: 11.3, 1: 10.48, 5: 9.0, 11: 8.8, 6: 6.9, 14: 1.68, 7: 0.0, 9: 0.0, 12: 0.0, 13: 0.0}}
Player
Position
Score
Darrel Williams
RB
24.9
Mark Ingram
RB
18.8
Michael Carter
RB
16.2
Najee Harris
RB
15.3
James Conner
RB
13.9
Buffalo Bills
DEF
12
Davante Adams
WR
11.3
Aaron Rodgers
QB
10.48
Tyler Bass
K
9
Corey Davis
WR
8.8
Van Jefferson
WR
6.9
Matt Ryan
QB
1.68
T.J. Hockenson
TE
0
Antonio Brown
WR
0
Alvin Kamara
RB
0
Tyler Boyd
WR
0
What I am looking to do, given the following requirements_dictionary, is to extract the top value (Score in the dataframe) for each key (Position in the dataframe):
requirements_dictionary = {'QB': 1, 'RB': 2, 'WR': 2, 'TE': 1, 'K': 1, 'DEF': 1, 'FLEX': 2}
What makes this challenging, is that for the final key, FLEX, that matches to no position in the dataframe, because that value could be a position of: RB, WR, or TE.
Final output should look like:
Player
Position
Score
Darrel Williams
RB
24.9
Mark Ingram
RB
18.8
Michael Carter
RB
16.2
Najee Harris
RB
15.3
Buffalo Bills
DEF
12
Davante Adams
WR
11.3
Aaron Rodgers
QB
10.48
Tyler Bass
K
9
Corey Davis
WR
8.8
T.J. Hockenson
TE
0
Since that is the top 2 RB, 1 QB, 2 WR, 1 TE, 1 K, 1 DEF and 2 FLEX.
I have tried the following code which gets me close:
all_points.groupby('Position')['Score'].nlargest(2)
Position
DEF 0 12.00
K 5 9.00
QB 1 10.48
14 1.68
RB 8 24.90
2 18.80
TE 7 0.00
WR 15 11.30
11 8.80
Name: Score, dtype: float64
However, that does not account for the FLEX "position"
I could alternatively loop through the dataframe and do this manually, but that seems very intensive.
How can I achieve the intended result?
Create a custom function that select a number of players according your requirements for each group and keep this index as idx_best. Then exclude all already selected players and select FLEX other players as idx_flex. Finally extract the union of this two indexes.
FLEX = requirements_dictionary['FLEX']
select_players = lambda x: x.nlargest(requirements_dictionary[x.name])
idx_best = df.groupby('Position')['Score'].apply(select_players).index.levels[1]
idx_flex = df.loc[df.index.difference(idx_best), 'Score'].nlargest(FLEX).index
out = df.loc[idx_best.union(idx_flex)].sort_values('Score', ascending=False)
Output:
>>> out
Player Position Score
8 Darrel Williams RB 24.90
2 Mark Ingram RB 18.80
3 Michael Carter RB 16.20
4 Najee Harris RB 15.30
0 Buffalo Bills DEF 12.00
15 Davante Adams WR 11.30
1 Aaron Rodgers QB 10.48
5 Tyler Bass K 9.00
11 Corey Davis WR 8.80
7 T.J. Hockenson TE 0.00
use the requirements dictionary to get the rows equal to a position then sort by score and get the head equal to the dictionary value for the position. Flex is top 2, for position in RB, WR, TE. I concatenate the flex results. my solution is more intuitive and logical to understand
txt="""Player,Position,Score
Darrel Williams,RB,24.9
Mark Ingram,RB,18.8
Michael Carter,RB,16.2
Najee Harris,RB,15.3
Buffalo Bills,DEF,12
Davante Adams,WR,11.3
Aaron Rodgers,QB,10.48
Tyler Bass,K,9
Corey Davis,WR,8.8
T.J. Hockenson,TE,0"""
df = pd.read_csv(io.StringIO(txt),sep=',')
requirements_dictionary = {'QB': 1, 'RB': 2, 'WR': 2, 'TE': 1, 'K': 1, 'DEF': 1, 'FLEX': 2}
#print(df)
df_top_rows = pd.DataFrame()
for position in requirements_dictionary.keys():
df_top_rows = df_top_rows.append(df[df['Position'] == position].sort_values(by='Score', ascending=False).head(requirements_dictionary[position]))
print(df_top_rows)
position='FLEX'
df_flex_rows = df_top_rows.append(df[df['Position'].isin(['RB','WR','TE'])].sort_values(by='Score', ascending=False).head(requirements_dictionary[position]))
#print(df_flex_rows)
df_result=pd.concat([df_top_rows,df_flex_rows],axis=0)
df_result.drop_duplicates(inplace=True)
print(df_result)
output
Player Position Score
6 Aaron Rodgers QB 10.48
0 Darrel Williams RB 24.90
1 Mark Ingram RB 18.80
5 Davante Adams WR 11.30
8 Corey Davis WR 8.80
9 T.J. Hockenson TE 0.00
7 Tyler Bass K 9.00
4 Buffalo Bills DEF 12.00​

Extracting country information from description using geograpy

PROBLEM: I want to extract country information from a user description. So far, I'm giving a try with the geograpy package. I like the behavior when the input is not very clear for example in Evesham or Rochdale, however, the package interprets some strings like Zaragoza, Spain as two mentions while the user is clearing saying that its location is in Spain. Still, I don't know why amsterdam is not giving as output Holland... How can I improve the outputs? Am I missing anything important? Is there a better package to achieve this?
DATA: My data example is:
user_location
2 Socialist Republic of Alachua
3 Hérault, France
4 Gwalior, India
5 Zaragoza,España
7 amsterdam
8 Evesham
9 Rochdale
I want to get something like this:
user_location country
2 Socialist Republic of Alachua ['USSR', 'United States']
3 Hérault, France ['France']
4 Gwalior, India ['India']
5 Zaragoza,España ['Spain']
7 amsterdam ['Holland']
8 Evesham ['United Kingdom']
9 Rochdale ['United Kingdom', 'United States']
REPREX:
import pandas as pd
import geograpy3
df = pd.DataFrame.from_dict({'user_location': {2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}})
df['country'] = df['user_location'].apply(lambda x: geograpy.get_place_context(text=x).countries if pd.notnull(x) else x)
print(df)
#> user_location country
#> 2 Socialist Republic of Alachua [USSR, Union of Soviet Socialist Republics, Al...
#> 3 Hérault, France [France, Hérault]
#> 4 Gwalior, India [British Indian Ocean Territory, Gwalior, India]
#> 5 Zaragoza,España [Zaragoza, España, Spain, El Salvador]
#> 7 amsterdam []
#> 8 Evesham [Evesham, United Kingdom]
#> 9 Rochdale [Rochdale, United Kingdom, United States]
Created on 2020-06-02 by the reprexpy package
geograpy3 was not behaving correctly anymore regarding country lookup since it didn't check if None was returned by pycountry. As a committer i just fixed this.
I have added your slightly modified example (to avoid the pandas import) as a unit test case:
def testStackoverflow62152428(self):
'''
see https://stackoverflow.com/questions/62152428/extracting-country-information-from-description-using-geograpy?noredirect=1#comment112899776_62152428
'''
examples={2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}
for index,text in examples.items():
places=geograpy.get_geoPlace_context(text=text)
print("example %d: %s" % (index,places.countries))
and the result is now:
example 2: ['United States']
example 3: ['France']
example 4: ['British Indian Ocean Territory', 'India']
example 5: ['Spain', 'El Salvador']
example 7: []
example 8: ['United Kingdom']
example 9: ['United Kingdom', 'United States']
indeed there is room for improvement for example 5. I have added an issue https://github.com/somnathrakshit/geograpy3/issues/7 - please stay tuned ...

How to calculate a win streak in Python/Pandas

I'm trying to calculate the win-streak or losing-streak going into a game. My goal is to generate a betting decision based on these streak factors or a recent record. I am new to Python and Pandas (and programming in general), so any detailed explanation of what code does would be welcome.
Here's my data
Season Game Date Game Index Away Team Away Score Home Team Home Score Winner Loser
0 2014 Regular Season Saturday, March 22, 2014 2014032201 Los Angeles Dodgers 3 Arizona D'Backs 1 Los Angeles Dodgers Arizona D'Backs
1 2014 Regular Season Sunday, March 23, 2014 2014032301 Los Angeles Dodgers 7 Arizona D'Backs 5 Los Angeles Dodgers Arizona D'Backs
2 2014 Regular Season Sunday, March 30, 2014 2014033001 Los Angeles Dodgers 1 San Diego Padres 3 San Diego Padres Los Angeles Dodgers
3 2014 Regular Season Monday, March 31, 2014 2014033101 Seattle Mariners 10 Los Angeles Angels 3 Seattle Mariners Los Angeles Angels
4 2014 Regular Season Monday, March 31, 2014 2014033102 San Francisco Giants 9 Arizona D'Backs 8 San Francisco Giants Arizona D'Backs
5 2014 Regular Season Monday, March 31, 2014 2014033103 Boston Red Sox 1 Baltimore Orioles 2 Baltimore Orioles Boston Red Sox
6 2014 Regular Season Monday, March 31, 2014 2014033104 Minnesota Twins 3 Chicago White Sox 5 Chicago White Sox Minnesota Twins
7 2014 Regular Season Monday, March 31, 2014 2014033105 St. Louis Cardinals 1 Cincinnati Reds 0 St. Louis Cardinals Cincinnati Reds
8 2014 Regular Season Monday, March 31, 2014 2014033106 Kansas City Royals 3 Detroit Tigers 4 Detroit Tigers Kansas City Royals
9 2014 Regular Season Monday, March 31, 2014 2014033107 Colorado Rockies 1 Miami Marlins 10 Miami Marlins Colorado Rockies
Dictionary below:
{'Away Score': {0: 3, 1: 7, 2: 1, 3: 10, 4: 9},
'Away Team': {0: 'Los Angeles Dodgers',
1: 'Los Angeles Dodgers',
2: 'Los Angeles Dodgers',
3: 'Seattle Mariners',
4: 'San Francisco Giants'},
'Game Date': {0: 'Saturday, March 22, 2014',
1: 'Sunday, March 23, 2014',
2: 'Sunday, March 30, 2014',
3: 'Monday, March 31, 2014',
4: 'Monday, March 31, 2014'},
'Game Index': {0: 2014032201,
1: 2014032301,
2: 2014033001,
3: 2014033101,
4: 2014033102},
'Home Score': {0: 1, 1: 5, 2: 3, 3: 3, 4: 8},
'Home Team': {0: "Arizona D'Backs",
1: "Arizona D'Backs",
2: 'San Diego Padres',
3: 'Los Angeles Angels',
4: "Arizona D'Backs"},
'Loser': {0: "Arizona D'Backs",
1: "Arizona D'Backs",
2: 'Los Angeles Dodgers',
3: 'Los Angeles Angels',
4: "Arizona D'Backs"},
'Season': {0: '2014 Regular Season',
1: '2014 Regular Season',
2: '2014 Regular Season',
3: '2014 Regular Season',
4: '2014 Regular Season'},
'Winner': {0: 'Los Angeles Dodgers',
1: 'Los Angeles Dodgers',
2: 'San Diego Padres',
3: 'Seattle Mariners',
4: 'San Francisco Giants'}}
I've tried looping through the season and the team, and then creating a streak count based on [this]: https://github.com/nhcamp/EPL-Betting/blob/master/EPL%20Match%20Results%20DF.ipynb github project.
I run into key errors early in building my loops, and I have trouble identifying data
game_table = pd.read_csv('MLB_Scores_2014_2018.csv')
# Get Team List
team_list = game_table['Away Team'].unique()
# Get Season List
season_list = game_table['Season'].unique()
#Defining "chunks" to append gamedata to the total dataframe
chunks = []
for season in season_list:
# Looping through seasons. Streaks reset for each season
season_games = game_table[game_table['Season'] == season]
for team in team_list:
# Looping through teams
season_team_games = season_games[(season_games['Away Team'] == team | season_games['Home Team'] == team)]
#Setting streak list and streak counter values
streak_list = []
streak = 0
# Looping through each game
for game in season_team_games.iterrow():
# Check if team is a winner, and up the streak
if game_table['Winner'] == team:
streak_list.append(streak)
streak += 1
# If not the winner, append streak and set to zero
elif game_table['Winner'] != team:
streak_list.append(streak)
streak = 0
# Just in case something wierd happens with the scores
else:
streak_list.append(streak)
game_table['Streak'] = streak_list
chunk_list.append(game_table)
And that's kind of where I lose it. How do I append separately if each team is the home team or the away team? Is there a better way to display this data?
As a general matter, I want to add a win-streak and/or losing-streak for each team in each game. Headers would look like this:
| Season | Game Date | Game Index | Away Team | Away Score | Home Team | Home Score | Winner | Loser | Away Win Streak | Away Lose Streak | Home Win Streak | Home Lose Streak |
Edit: this error message has been resolved
I also get an error creating the dataframe 'season_team_games."
TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
The error you are seeing come from the statement
season_team_games = season_games[(season_games['Away Team'] == team | season_games['Home Team'] == team)]
When you're adding two boolean conditions, you need to separate them out with parentheses. This is because the | operator takes precedence over the == operator. So this should become:
season_team_games = season_games[(season_games['Away Team'] == team) | (season_games['Home Team'] == team)]
I know there is more to the question than this error, but as mentioned in the comment, once you provide some text based data, it might be easier to help

How can I improve performance on my apply() with fuzzy matching statement

I've written a function called muzz that leverages the fuzzywuzzy module to 'merge' two pandas dataframes. Works great, but the performance is pretty bad on larger frames. Please take a look at my apply() that does the extracting/scoring and let me know if you have any ideas that could speed it up.
import pandas as pd
import numpy as np
import fuzzywuzzy as fw
Create a frame of raw data
dfRaw = pd.DataFrame({'City': {0: u'St Louis',
1: 'Omaha',
2: 'Chicogo',
3: 'Kansas city',
4: 'Des Moine'},
'State' : {0: 'MO', 1: 'NE', 2 : 'IL', 3 : 'MO', 4 : 'IA'}})
Which yields
City State
0 St Louis MO
1 Omaha NE
2 Chicogo IL
3 Kansas city MO
4 Des Moine IA
Then a frame that represents the good data that we want to look up
dfLocations = pd.DataFrame({'City': {0: 'Saint Louis',
1: u'Omaha',
2: u'Chicago',
3: u'Kansas City',
4: u'Des Moines'},
'State' : {0: 'MO', 1: 'NE', 2 : 'IL',
3 : 'KS', 4 : 'IA'},
u'Zip': {0: '63201', 1: '68104', 2: '60290',
3: '68101', 4: '50301'}})
Which yields
City State Zip
0 Saint Louis MO 63201
1 Omaha NE 68104
2 Chicago IL 60290
3 Kansas City KS 68101
4 Des Moines IA 50301
and now the muzz function. EDIT: Added choices= right[match_col_name] line and used choices in the apply per Brenbarn suggestion. I also, per Brenbarn suggestion, ran some tests with the extractOne() without the apply and it it appears to be the bottleneck. Maybe there's a faster way to do the fuzzy matching?
def muzz(left, right, on, match_col_name='match_on',score_col_name='score_match',
right_suffix='_match', score_cutoff=80):
right[match_col_name] = np.sum(right[on],axis=1)
choices= right[match_col_name]
###The offending statement###
left[[match_col_name,score_col_name]] =
pd.Series(np.sum(left[on],axis=1)).apply(lambda x : pd.Series(
fw.process.extractOne(x,choices,score_cutoff=score_cutoff)))
dfTemp = pd.merge(left,right,how='left',on=match_col_name,suffixes=('',right_suffix))
return dfTemp.drop(match_col_name, axis=1)
Calling muzz
muzz(dfRaw.copy(),dfLocations,on=['City','State'], score_cutoff=85)
Which yields
City State score_match City_match State_match Zip
0 St Louis MO 87 Saint Louis MO 63201
1 Omaha NE 100 Omaha NE 68104
2 Chicogo IL 89 Chicago IL 60290
3 Kansas city MO NaN NaN NaN NaN
4 Des Moine IA 96 Des Moines IA 50301

Categories