Group by fuzzy string matches with fuzzywuzzy and groupby - python

I have a dataset of random words and names and I am trying to group all of the similar words and names. So given the dataframe below:
Name ID Value
0 James 1 10
1 James 2 2 142
2 Bike 3 1
3 Bicycle 4 1197
4 James Marsh 5 12
5 Ants 6 54
6 Job 7 6
7 Michael 8 80007
8 Arm 9 47
9 Mike K 10 9
10 Michael k 11 1
My pseudo code would be something like:
import pandas as pd
from fuzzywuzzy import fuzz
minratio = 95
for idx1, name1 in df['Name'].iteritems():
for idx2, name2 in df['Name'].iteritems():
ratio = fuzz.WRatio(name1, name2)
if ratio > minratio:
grouped = df.groupby(['Name', 'ID'])['Value']\
.agg(Total_Value='sum', Group_Size='count')
This would then give me the desired output:
print(grouped)
Name ID Total_Value Group_Size
0 James 1 164 3 # All James' grouped
2 Bike 3 1198 2 # Bike's and Bicycles grouped
5 Ants 6 54 1
6 Job 7 6 1
7 Michael 8 80017 3 # Mike's and Michael's grouped
8 Arm 9 47 1
Obviously this doesn't work, and honestly, I am not sure if this is even possible, but this is what I'm trying to accomplish. Any advice that could get me on the right track would be useful.

Using affinity propagation clustering (not perfect but maybe a starting point):
import pandas as pd
import numpy as np
import io
from fuzzywuzzy import fuzz
from scipy import spatial
import sklearn.cluster
s="""Name ID Value
0 James 1 10
1 James 2 2 142
2 Bike 3 1
3 Bicycle 4 1197
4 James Marsh 5 12
5 Ants 6 54
6 Job 7 6
7 Michael 8 80007
8 Arm 9 47
9 Mike K 10 9
10 Michael k 11 1"""
df = pd.read_csv(io.StringIO(s),sep='\s\s+',engine='python')
names = df.Name.values
sim = spatial.distance.pdist(names.reshape((-1,1)), lambda x,y: fuzz.WRatio(x,y))
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", random_state=None)
affprop.fit(spatial.distance.squareform(sim))
res = df.groupby(affprop.labels_).agg(
Names=('Name',','.join),
First_ID=('ID','first'),
Total_Value=('Value','sum'),
Group_Size=('Value','count')
)
Result
Names First_ID Total_Value Group_Size
0 James,James 2,James Marsh,Ants,Arm 1 265 5
1 Bike,Bicycle 3 1198 2
2 Job 7 6 1
3 Michael,Mike K,Michael k 8 80017 3

Related

joining a table to another table in pandas

I am trying to grab the data from, https://www.espn.com/nhl/standings
When I try to grab it, it is putting Florida Panthers one row to high and messing up the data. All the team names need to be shifted down a row. I have tried to mutate the data and tried,
dataset_one = dataset_one.shift(1)
and then joining with the stats table but I am getting NaN.
The docs seem to show a lot of ways of joining and merging data with similar columns headers but not sure the best solution here without a similar column header to join with.
Code:
import pandas as pd
page = pd.read_html('https://www.espn.com/nhl/standings')
dataset_one = page[0] # Team Names
dataset_two = page[1] # Stats
combined_data = dataset_one.join(dataset_two)
print(combined_data)
Output:
FLAFlorida Panthers GP W L OTL ... GF GA DIFF L10 STRK
0 CBJColumbus Blue Jackets 6 5 0 1 ... 22 16 6 5-0-1 W2
1 CARCarolina Hurricanes 10 4 3 3 ... 24 28 -4 4-3-3 L1
2 DALDallas Stars 6 5 1 0 ... 18 10 8 5-1-0 W4
3 TBTampa Bay Lightning 6 4 1 1 ... 23 14 9 4-1-1 L2
4 CHIChicago Blackhawks 6 4 1 1 ... 19 14 5 4-1-1 W1
5 NSHNashville Predators 10 3 4 3 ... 26 31 -5 3-4-3 W1
6 DETDetroit Red Wings 8 4 4 0 ... 20 24 -4 4-4-0 L1
Desired:
GP W L OTL ... GF GA DIFF L10 STRK
0 FLAFlorida Panthers 6 5 0 1 ... 22 16 6 5-0-1 W2
1 CBJColumbus Blue Jackets 10 4 3 3 ... 24 28 -4 4-3-3 L1
2 CARCarolina Hurricanes 6 5 1 0 ... 18 10 8 5-1-0 W4
3 DALDallas Stars 6 4 1 1 ... 23 14 9 4-1-1 L2
4 TBTampa Bay Lightning 6 4 1 1 ... 19 14 5 4-1-1 W1
5 CHIChicago Blackhawks 10 3 4 3 ... 26 31 -5 3-4-3 W1
6 NSHNashville Predators 8 4 4 0 ... 20 24 -4 4-4-0 L1
7 DETDetriot Red Wings 10 2 6 2 6 ... 20 35 -15 2-6-2 L6
Providing an alternative approach to #Noah's answer. You can first add an extra row, shift the df down by a row and then assign the header col as index 0 value.
import pandas as pd
page = pd.read_html('https://www.espn.com/nhl/standings')
dataset_one = page[0] # Team Names
dataset_two = page[1] # Stats
# Shifting down by one row
dataset_one.loc[max(dataset_one.index) + 1, :] = None
dataset_one = dataset_one.shift(1)
dataset_one.iloc[0] = dataset_one.columns
dataset_one.columns = ['team']
combined_data = dataset_one.join(dataset_two)
Just create the df slightly differently so it knows what is the proper header
dataset_one = pd.DataFrame(page[0], columns=["Team Name"])
Then when you join it should be aligned properly.
Another alternative is to do the following:
dataset_one = page[0].to_frame(name='Team Name')

How do I sort a whole pandas dataframe by one column, moving the rows grouped in 3s

I have a dataframe with genes (ensembl IDs and common name), homologs, counts, and totals in orders of three as such:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000019949 ENSG00000149257
1 serpinh1b SERPINH1
2 2 2 4
3 ENSDARG00000052437 ENSG00000268975
4 mia MIA-RAB4B
5 2 0 2
6 ENSDARG00000057992 ENSG00000134363
7 fstb FST
8 0 3 3
9 ENSDARG00000045580 ENSG00000139329
10 lum LUM
11 15 15 30
etc...
I want to sort these rows by the totals in descending order. such that all the rows are kept intact in groups of 3 in the orders shown. The ideal output would be:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000045580 ENSG00000139329
1 lum LUM
2 15 15 30
3 ENSDARG00000019949 ENSG00000149257
4 serpinh1b SERPINH1
5 2 2 4
6 ENSDARG00000057992 ENSG00000134363
7 fstb FST
8 0 3 3
9 ENSDARG00000052437 ENSG00000268975
10 mia MIA-RAB4B
11 2 0 2
etc...
I tried making the totals for each in all 3 rows and then sorting using dataframe.sort.values() and removing the previous 2 rows for each clump of 3 but it didnt work properly. Is there a way to group the rows together into clumps of 3, then sort them to maintain that structure? Thank you in advance for any assistance.
Update #1
If i try to use the code:
df['Total'] = df['Total'].bfill().astype(int)
df = df.sort_values(by='Total', ascending=False)
to add values to the total for each group of 3 and then sort, It partially works, but scrambles the code like this:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000045580 ENSG00000139329 30
1 lum LUM 30
2 15 15 30
4 serpinh1b SERPINH1 4
3 ENSDARG00000019949 ENSG00000149257 4
5 2 2 4
8 0 3 3
7 fstb FST 3
6 ENSDARG00000057992 ENSG00000134363 3
9 ENSDARG00000052437 ENSG00000268975 2
11 2 0 2
10 mia MIA-RAB4B 2
etc...
And even worse is if multiple genes have the same total counts, the rows will become interchanged between genes which becomes confusing
Is this a dead end? Maybe I should just rewrite the code a different way :(
You need to create a second key to keep the records together on sorting , see below:
df.Total= df.Total.bfill()
df["helper"]= np.arange(len(df))//3
df= df.sort_values(["Total","helper"])
df= df.drop(columns="helper")
It looks like your totals are missing values and that helps in this case
Approach 1
df['Total'] = df['Total'].bfill().astype(int)
df['idx'] = np.arange(len(df)) // 3
df = df.sort_values(by=['Total', 'idx'], ascending=False)
df = df.drop(['idx'], axis=1)
Zebrafish_Homolog Human_Homolog Total
9 ENSDARG00000045580 ENSG00000139329 30
10 lum LUM 30
11 15 15 30
0 ENSDARG00000019949 ENSG00000149257 4
1 serpinh1b SERPINH1 4
2 2 2 4
6 ENSDARG00000057992 ENSG00000134363 3
7 fstb FST 3
8 0 3 3
3 ENSDARG00000052437 ENSG00000268975 2
4 mia MIA-RAB4B 2
5 2 0 2
Note how the index stays the same, if you don't want that then reset_index()
df = df.reset_index(drop=True)
Approach 2
A more manual way of sorting.
The approach is to sort the index and then loc the df. It looks complicated but it's just subtract ints from a list. Note the process doesn't happen on the df until the end so there should be no speed issue for a larger df.
# Sort by total
df = df.reset_index().sort_values('Total', ascending=False)
# Get the index of the sorted values
uniq_index = df[df['Total'].notnull()]['index'].values
# Create the new index
index = uniq_index .repeat(3)
groups = [-2, -1, 0] * (len(df) // 3)
# Update so everything is in order
new_index = index + groups
# Apply to the dataframe
df = df.loc[new_index]
Zebrafish_Homolog Human_Homolog Total
0 ENSDARG00000045580 ENSG00000139329 NaN
1 lum LUM NaN
2 15 15 30.0
9 ENSDARG00000019949 ENSG00000149257 NaN
10 serpinh1b SERPINH1 NaN
11 2 2 4.0
3 ENSDARG00000057992 ENSG00000134363 NaN
4 fstb FST NaN
5 0 3 3.0
6 ENSDARG00000052437 ENSG00000268975 NaN
7 mia MIA-RAB4B NaN
8 2 0 2.0
12 ENSDARG00000052437 ENSG00000268975 NaN
13 mia MIA-RAB4B NaN
14 2 0 2.0

How do I give an error message if condition is not met?

Im doing an assignment for a basic programming course.
I have a dataframe (csv-file) containing the columns:
StudentID Name Assignment1 Assignment2 Assignment3
0 s123456 Michael Andersen 7 7 4
1 s123789 Bettina Petersen 12 10 10
2 s123468 Thomas Nielsen -3 7 2
3 s123579 Marie Hansen 10 12 12
4 s123579 Marie Hansen 10 12 12
5 s127848 Andreas Nielsen 2 2 2
6 s120799 Mads Westergaard 12 12 10
7 s123456 Michael Andersen 7 7 4
8 S184507 Andreas Døssing Mortensen 2 2 4
9 S129834 Jonas Jonassen 0 -3 4
10 S123481 Milad Mohammed 12 10 7
11 S128310 Abdul Jihad 10 4 7
12 S125493 Søren Sørensen 0 7 7
13 S128363 123 4 7 10
14 S127463 Jensen Jensen 5 2 10
15 S120987 Jeff Bezos 12 12 12
I need to make my program give an error message if a condition is not meet. In this instance if a student is in the dataframe more than once and if the grade given for an assignment is not on the scale of grades (-3, 0, 2, 4, 7, 10, 12):
The assignment is as follows:
If the user chooses to check for data errors, you must display a report of errors (if any) in the loaded data file. Your program must at least detect and display information about the following possible errors:
1. If two students in the data have the same student id.
2. If a grade in the data set is not one of the possible grades on the 7-step-scale.
How can I get around this?
I have tried to solve it like this, but no luck:
doubles = dataDuplicate["Name"].duplicated()
print(doubles)
grades = np.array([-3,0,2,4,7,10,12])
dataSortGrades = dataSortGrades.iloc[:,2:] #this gives
gradesNotInList = np.isin(dataSortGrades,grades)
if dataDuplicate["Name"] in doubles == True:
print("Error")
else:
print(#list of false values")
The standard approach is:
if condition:
print(message)
where condition (or not condition) and message should be adjusted to your specific needs.
You dont need to create 3 data frames. You can just create the dataframe and then perform you selections based on your conditions.
import pandas as pd
import re
data = """0 s123456 Michael Andersen 7 7 4
1 s123789 Bettina Petersen 12 10 10
2 s123468 Thomas Nielsen -3 7 2
3 s123579 Marie Hansen 10 12 12
4 s123579 Marie Hansen 10 12 12
5 s127848 Andreas Nielsen 2 2 2
6 s120799 Mads Westergaard 12 12 10
7 s123456 Michael Andersen 7 7 4
8 S184507 Andreas Døssing Mortensen 2 2 4
9 S129834 Jonas Jonassen 0 -3 4
10 S123481 Milad Mohammed 12 10 7
11 S128310 Abdul Jihad 10 4 7
12 S125493 Søren Sørensen 0 7 7
13 S128363 123 4 7 10
14 S127463 Jensen Jensen 5 2 10
15 S120987 Jeff Bezos 12 12 12"""
#Make the data frame
data = [re.split(r"\s{2,}", line)[1:] for line in data.splitlines()]
df = pd.DataFrame(data, columns=['StudentID', 'Name', 'Assignment1', 'Assignment2', 'Assignment3'])
#print the duplicates
print(f'###Duplicate studentIDs###')
print(df[df['StudentID'].duplicated()])
#print invalid grades
valid_grades = ('-3', '0', '2', '4', '7', '10', '12')
print(f'###Invalid grades###')
print(df[
(df['Assignment1'].isin(valid_grades) == False) |
(df['Assignment2'].isin(valid_grades) == False) |
(df['Assignment3'].isin(valid_grades) == False)
])
OUTPUT
###Duplicate studentIDs###
StudentID Name Assignment1 Assignment2 Assignment3
4 s123579 Marie Hansen 10 12 12
7 s123456 Michael Andersen 7 7 4
###Invalid grades###
StudentID Name Assignment1 Assignment2 Assignment3
14 S127463 Jensen Jensen 5 2 10

How to calculate a groupby mean and variance in a pandas DataFrame?

I have a DataFrame and I want to calculate the mean and the variance for each row for each person. Moreover, there is a column date and the chronological order must be respect when calculating the mean and the variance; the dataframe is already sorted by date. The date are just the number of day after the earliest date. The mean for the earliest date of a person row is simply the value in the column Points and the variance should be NAN or 0. Then, for the second date, the mean should be the means of the value in the column Points for this date and the previous one. Here is my code to generate the dataframe:
import pandas as pd
import numpy as np
data=[["Al",0, 12],["Bob",2, 10],["Carl",5, 12],["Al",5, 5],["Bob",9, 2]
,["Al",22, 4],["Bob",22, 16],["Carl",33, 2],["Al",45, 7],["Bob",68, 4]
,["Al",72, 11],["Bob",79, 5]]
df= pd.DataFrame(data, columns=["Name", "Date", "Points"])
print(df)
Name Date Points
0 Al 0 12
1 Bob 2 10
2 Carl 5 12
3 Al 5 5
4 Bob 9 2
5 Al 22 4
6 Bob 22 16
7 Carl 33 2
8 Al 45 7
9 Bob 68 4
10 Al 72 11
11 Bob 79 5
Here is my code to obtain the mean and the variance:
df['Mean'] = df.apply(
lambda x: df[(df.Name == x.Name) & (df.Date < x.Date)].Points.mean(),
axis=1)
df['Variance'] = df.apply(
lambda x: df[(df.Name == x.Name)& (df.Date < x.Date)].Points.var(),
axis=1)
However, the mean is shifted by one row and the variance by two rows. The dataframe obtained when sort by Nameand Dateis:
Name Date Points Mean Variance
0 Al 0 12 NaN NaN
3 Al 5 5 12.000000 NaN
5 Al 22 4 8.50000 24.500000
8 Al 45 7 7.000000 19.000000
10 Al 72 11 7.000000 12.666667
1 Bob 2 10 NaN NaN
4 Bob 9 2 10.000000 NaN
6 Bob 22 16 6.000000 32.000000
9 Bob 68 4 9.333333 49.333333
11 Bob 79 5 8.000000 40.000000
2 Carl 5 12 NaN NaN
7 Carl 33 2 12.000000 NaN
Instead, the dataframe should be as below:
Name Date Points Mean Variance
0 Al 0 12 12 NaN
3 Al 5 5 8.5 24.5
5 Al 22 4 7 19
8 Al 45 7 7 12.67
10 Al 72 11 7.8 ...
1 Bob 2 10 10 NaN
4 Bob 9 2 6 32
6 Bob 22 16 9.33 49.33
9 Bob 68 4 8 40
11 Bob 79 5 7.4 ...
2 Carl 5 12 12 NaN
7 Carl 33 2 7 50
What should I change ?

Replace and mapping string values in a Python dataframe with pandas

Hi i've been tryng to replace string values in a dataframe (strings are abbreviation of NFL teams), i have something like this:
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 1 Phi Atl Phi Phi Phi
1 2 Bal Bal Bal Buf Bal
2 3 Ind Ind Cin Cin Ind
3 4 NE NE Hou NE NE
4 5 Jax Jax NYG NYG NYG
and a Dataframe with the mapping, something like this:
Index TEAM_YH TeamID
0 ARI 1
1 ATL 2
2 BAL 3
...
31 WAS 32
I want to replace every string with the TeamID to make basic statistics (frequency), i've tried the next:
## Dataframe with strings and Team ID
dfDicTeams = dfTeams[['TEAM_YH','TeamID']].to_dict('dict')
## Dataframe with selections by users
dfW1.replace(dfDicTeams[['TEAM_YH']],dfDicTeams[['TeamID']]) ## Error: unhashable type: 'list'
dfW1.replace(dfDicTeams) ## Error: Replacement not allowed with overlapping keys and values
what am i doing wrong? is it posible to do it?
I'm using Python 3, and i want something like this:
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 1 26 2 26 26 26
1 2 3 3 3 4 3
2 3 14 14 7 7 14
3 4 21 21 13 21 21
4 5 15 15 23 23 23
to aggregate the options:
IDMatch ATeam Count HTeam Count
1 26 4 2 1
2 3 4 4 1
3 14 3 7 2
4 21 4 13 1
5 15 2 23 3
Given a main input dataframe df and a mapping dataframe df_map, you can create a series mapping, then use pd.DataFrame.applymap with a custom function:
s = df_map.set_index('TEAM_YH')['TeamID']
df.iloc[:, 2:] = df.iloc[:, 2:].applymap(lambda x: s.get(x.upper(), -1))
print(df)
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 0 1 7 2 7 7 7
1 1 2 3 3 3 4 3
2 2 3 5 5 -1 -1 5
3 3 4 -1 -1 -1 -1 -1
4 4 5 6 6 -1 -1 -1
The example df_map used to calculate the above result:
Index TEAM_YH TeamID
0 ARI 1
1 ATL 2
2 BAL 3
3 BUF 4
4 IND 5
5 JAX 6
6 PHI 7
32 WAS 32

Categories