how to merge multiple datasets with differences in merge-index strings?

how to merge multiple datasets with differences in merge-index strings? - python

Hello I am struggling to find a solution to probably a very common problem.
I want to merge two csv-files with soccer data. They basically store different data of the same games. Normally I would do a merge with .merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. So for example Manchester City is called Man. City in the second data frame.
Here's roughly what df1 and df2 look like:
df:
team1 team2 date some_value_i_want_to_compare
Manchester City Arsenal 2022-05-20 22:00:00 0.2812 5
df2:
team1 team2 date some_value_i_want_to_compare
Man. City Arsenal 2022-05-20 22:00:00 0.2812 3
Note that in the above case there are only differences in team1 but there could also be cases where team2 is slightly different. So for example in this case Arsenal could be called FC Arsenal in the second data set.
So my main question is: How could I automatically analyse the differences in the two datasets naming?
My second question is: How do I scale this for more than 2 data sets so that the number of data sets ultimately doesn't matter?

As commenters and existing answer have suggested, if the number of unique names is not too large, then you can manually extract the mismatches and correct them. That is probably the best solution unless the number of mismatches is very large.
Another case which can occur, is when you have a ground truth list of allowed indexes (for example, the list of all soccer teams in a given league), but the data may contain many different attempts at spelling or abbreviating each team. If this is similar to your situation, you can use difflib to search for the most likely match for a given name. For example:
import difflib
true_names = ['Manchester United', 'Chelsea']
mismatch_names = ['Man. Unites', 'Chlsea', 'Chelsee']
best_matches = [difflib.get_close_matches(x, true_names, n=1) for x in mismatch_names]
for old,new in zip(mismatch_names, best_matches):
print(f"Best match for {old} is {new[0]}")
output:
Best match for Man. Unites is Manchester United
Best match for Chlsea is Chelsea
Best match for Chelsee is Chelsea
Note if the spelling is very bad, you can ask difflib to find the closest n matches using the n= keyword argument. This can help to reduce manual data cleaning work, although it is often unavoidable, at least to some degree.
Hope it helps.

You could start by doing an anti-join to isolate the ones that don't match:
# Merge two team datasets
teams_join = df1.merge(df2, on='team1',
how='left', indicator=True)
# Select the team1 column where _merge is left_only
team_list = teams_join.loc[teams_join['_merge'] == 'left_only', 'team1']
# print team names in df1 with no match in df2
print(df1[df1["team1"].isin(team_list)])
This will give you all the teams in df1 without a match in df2. You could do the same for df2 (just reverse everything df1 and df2 in the previous code). Then you can take those two lists with the names that don't match and manually rename them if there are few enough of them.

Related

how search on a specific rows of a dataframe to find a match in a second dataframe?

I have a large data set and I wanna do the following task in an efficient way. suppose we have 2 data frames. for each element in df2 I wanna search in the first data set df1 only in row where the first 2 letters are in common then the word with the most common token is choosen.
Let's see in an example:
df1:
common work co
summer hot su
apple ap
colorful fall co
support it su
could comp co
df2:
condition work it co
common mistakes co
could comp work co
summer su
Take the first row of df2 as an example (condition work it). I wanna find a row in df1 where they have the same first_two and have the most common token.
The first_two of condition work it is co. so I wanna search in df1 where first_two is co. So the search is done among: common work, colorful fall, could comp since condition work it has 1 common token with common work it is selected.
output:
df2:
name first_two match
condition work it co `common work`
common mistakes co `common work`
could comp work co `could comp`
summer su `summer hot'
appears ap Nane
The last row is Nane since there is no common word between appears and apple
I did following:
df3=(df1.groupby(['first_two'])
.agg({'name': lambda x: ",".join(x)})
.reset_index())
merge_=df3.merge(df2, on='first_two',how='inner')
But now I have to search in name_x for each name_y. how to find an element of name_x whose has the most common token with name_y?

You have pretty much explained the most efficient method already.
Extract first 2 letters using .str.[:2] for the series and assign it to new columns in both the dataframe.
Extract unique values of 2 letter column from df2.
Inner join the result from #2 on to df1.
Perform a group by count on the result of #3 and sort descending based on the count and drop duplicates to get the most repeated item for the 2 letter column.
Left join result of #4 onto df2.

Find similarity between two dataframes, row by row

I have two dataframes, df1 and df2 with the same columns. I would like to find similarity between these two datasets. I have been following one of these two approaches.
The first one was to append one of the two dataframes to the other one and selecting duplicates:
df=pd.concat([df1,df2],join='inner')
mask = df.Check.duplicated(keep=False)
df[mask] # it gives me duplicated rows
The second one is to consider a threshold value which, for each row from df1, finds a potential match in rows in df2.
Sample of data: Please note that the datasets have different length
For df1
Check
how to join to first row
large data work flows
I have two dataframes
fix grammatical or spelling errors
indent code by 4 spaces
why are you posting here?
add language identifier
my dad loves watching football
and for df2
Check
small data work flows
I have tried to puzze out an answer
mix grammatical or spelling errors
indent code by 2 spaces
indent code by 8 spaces
put returns between paragraphs
add curry on the chicken curry
mom!! mom!! mom!!
create code fences with backticks
are you crazy?
Trump did not win the last presidential election
In order to do this, I am using the following function:
def check(df1, thres, col):
matches = df1.apply(lambda row: ((fuzz.ratio(row['Check'], col) / 100.0) >= thres), axis=1)
return [df1. Check[i] for i, x in enumerate(matches) if x]
This should allow me to find rows which match.
The problem of the second approach (the one I most interested in) is that it actually does not take into account the two dataframes.
My expected value from the first function would be two dataframes, one for df1 and one for df2, having an extra column which includes the similarity found per each row compared to those in the other dataframe; from the second code, I should only assign a similarity value to them (I should have as many columns as the number of rows).
Please let me know if you need any further information or if you need more code.
Maybe there is a easier way to determine this similarity, but unfortunately I have not found it yet.
Any suggestion is welcome.
Expected output:
(it is an example; since I am setting a threshold the output may change)
df1
Check sim
how to join to first row []
large data work flows [small data work flows]
I have two dataframes []
fix grammatical or spelling errors [mix grammatical or spelling errors]
indent code by 4 spaces [indent code by 2 spaces, indent code by 8 spaces]
why are you posting here? []
add language identifier []
my dad loves watching football []
df2
Check sim
small data work flows [large data work flows]
I have tried to puzze out an answer []
mix grammatical or spelling errors [fix grammatical or spelling errors]
indent code by 2 spaces [indent code by 4 spaces]
indent code by 8 spaces [indent code by 4 spaces]
put returns between paragraphs []
add curry on the chicken curry []
mom!! mom!! mom!! []
create code fences with backticks []
are you crazy? []
Trump did not win the last presidential election []

I think your fuzzywuzzy solution is pretty good. I've implemented what you're looking for below. That this will grow as len(df1)*len(df2) so is pretty memory intensive, but at least should be reasonably clear. You may find the output of gen_scores useful as well.
from fuzzywuzzy import fuzz
from itertools import product
def gen_scores(df1, df2):
# generates a score for all row combinations between dfs
df_score = pd.DataFrame(product(df1.Check, df2.Check), columns=["c1", "c2"])
df_score["score"] = df_score.apply(lambda row: (fuzz.ratio(row["c1"], row["c2"]) / 100.0), axis=1)
return df_score
def get_matches(df1, df2, thresh=0.5):
# get all matches above a threshold, appended as list to each df
df = gen_scores(df1, df2)
df = df[df.score > thresh]
matches = df.groupby("c1").c2.apply(list)
df1 = pd.merge(df1, matches, how="left", left_on="Check", right_on="c1")
df1 = df1.rename(columns={"c2":"matches"})
df1.loc[df1.matches.isnull(), "matches"] = df1.loc[df1.matches.isnull(), "matches"].apply(lambda x: [])
matches = df.groupby("c2").c1.apply(list)
df2 = pd.merge(df2, matches, how="left", left_on="Check", right_on="c2")
df2 = df2.rename(columns={"c1":"matches"})
df2.loc[df2.matches.isnull(), "matches"] = df2.loc[df2.matches.isnull(), "matches"].apply(lambda x: [])
return (df1, df2)
# call code:
df1_match, df2_match = get_matches(df1, df2, thresh=0.5)
output:
Check matches
0 how to join to first row []
1 large data work flows [small data work flows]
2 I have two dataframes []
3 fix grammatical or spelling errors [mix gramma... [mix grammatical or spelling errors]
4 indent code by 4 spaces [indent code by 2 spaces, indent code by 8 spa...
5 why are you posting here? [are you crazy?]
6 add language identifier []
7 my dad loves watching football []

How do I systematically compare all rows in two Pandas dataframes using specific columns and return the differences?

I have two large dataframes from different sources, largely of the same structure but of different lengths and in a different order. Most of the data is comparable but not all. The rows represent individuals and the the columns contain data about those individuals. I want to check by row certain column values of one dataframe against the 'master' dataframe and then return the omissions so these can be added to it.
I have been using the df.query() method to check individual cases using my own inputs because I can search the master dataframe using multiple columns - so, something like df.query('surname == "JONES" and initials == "D V" and town == "LONDON"'). I want to do something like this but by creating a query of each row of the other dataframe using data from specific columns.
I think I can work out how I might do this using for loops and if statements but that's going to be wildly inefficient and obviously not ideal. List comprehension might be more efficient but I can't work out the dataframe comparison part unless I create a new column whose data is built from the values I want to compare (JONES-DV-LONDON, but that seems wrong).
There is an answer in here I think but it relies on the dataframes being more or less identical (which mine aren't - hence my wish to compare only certain columns).
I have been unable to find an example of someone doing the same, which might be my failure again. I am a novice and I have a feeling I might be thinking about this in completely the wrong way. I would very much value any thoughts and pointers...
EDIT - some sample data (not exactly what I'm using but hopefully helps show what I am trying to do)
df1 (my master list)
surname initials town
JONES D V LONDON
DAVIES H G BIRMINGHAM
df2
surname initials town
DAVIES H G BIRMINGHAM
HARRIS P J SOUTHAMPTON
JONES D V LONDON
I would like to identify the columns to use in the comparison (surname, initials, town here - assume there are more with data that cannot be matched) and then return the unique results from df2 - so in this case a dataframe containing:
surname initials town
HARRIS P J SOUTHAMPTON

define columns to join:
cols = ['surname', 'initials', 'town']
Than you can use merge with parameter indicator=True which shows appearance of the data (left_only, right_only or both) :
df_res = df1.merge(df2, 'outer',on=cols, indicator=True)
and exclude rows appear in both dataframes:
df_res = df_res[df_res['_merge'] != 'both']
print(df_res)
surname initials town _merge
2 HARRIS P J SOUTHAMPTON right_only
you can filter by left_only or right only.

Merge 2 data frames based on matching rows of 2 columns with Pandas

I have a very important problem that needs to be solved for a project!
So I have 2 data frames that look like these ones:
The first Dataframe is:
Date Winner Loser Tournament
2007-01-01 Luczak P. Hrbaty D. Grandslam
2007-01-02 Serra F. Johansson J. Grandslam
2007-01-02 ...... ......
The second Dataframe is:
Date Winner Loser Tournament
2007-05-28 Federer R. Russel M. Grandslam
2007-05-28 Ascione T. Cilic M. Grandslam
2007-05-28 ...... ......
The two data frames have the same number of rows corresponding to the same matches from the same period even though the first one starts from 2007-01-01 and the other from 2007-05-28. I checked it by looking at the excel files which I imported to build the two data frames (from different sources).
The problem is that one Dataframe (the first one) gives me the exact date for each match while the other Datframe (second one) sets the date for each row as the starting period of the tournament and not the exact date that match was played. So I cannot merge the two data frames based on Date values.
However, I know for sure that the couples of Winner and Loser for each row are the same so I wanted to merge the two data frames based on the rows in which the winner and the players are the same.
Does anybody knows how I can do this? Thanks in advance!

You can do it by merge_asof:
df = pd.merge_asof(df1.sort_values('Date'),
df2.sort_values('Date'), on='Date', by=['Winner','Loser'])

df= pd.merge(df1,df2,how='inner',left_on=['Winner','Loser'],right_on=['Winner','Loser'])

Indexing by Character in a Column of a Python/Pandas DataFrame

I am working on a project in which I scraped NBA data from ESPN and created a DataFrame to store it. One of the columns of my DataFrame is Team. Certain players that have been traded within a season have a value such as LAL/LAC under team, rather than just having one team name like LAL. With these rows of data, I would like to make 2 entries instead of one. Both entries would have the same, original data, except for 1 of the entries the team name would be LAL and for the other entry the team name would be LAC. Some team abbreviations are 2 letters while others are 3 letters.
I have already managed to create a separate DataFrame with just these rows of data that have values in the form team1/team2. I figured a good way of getting the data the way I want it would be to first copy this DataFrame with the multiple team entries, and then with one DataFrame, keep everything in the Team column up until the /, and with the other, keep everything in the Team column after the slash. I'm not quite sure what the code would be for this in the context of a DataFrame. I tried the following but it is invalid syntax:
first_team = first_team['TEAM'].str[:first_team[first_team['TEAM'].index("/")]]
where first_team is my DataFrame with just the entries with multiple teams. Perhaps this can give you a better idea of what I'm trying to accomplish!
Thanks in advance!

You're probably better off using split first to separate the teams into columns (also see Pandas DataFrame, how do i split a column into two), something like this:
d = pd.DataFrame({'player':['jordan','johnson'],'team':['LAL/LAC','LAC']})
pd.concat([d, pd.DataFrame(d.team.str.split('/').tolist(), columns = ['team1','team2'])], axis = 1)
player team team1 team2
0 jordan LAL/LAC LAL LAC
1 johnson LAC LAC None
Then if you want separate rows, you can use append.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.