data = {'col1':['Country', 'State', 'City', 'park' ,'avenue'],
'col2':['County','stats','PARK','Avenue', 'cities']}
col1 col2
0 Country County
1 State stats
2 City PARK
3 park Avenue
4 avenue cities
i was trying to match name of two columns with fuzzy wuzzy technique and order them by score.
output:
col1 col2 score order
0 Country County 92 1
1 Country stats 31 2
2 Country PARK 18 3
3 Country Avenue 17 4
4 Country cities 16 5
5 State County 80 1
6 State stats 36 2
7 State PARK 22 3
8 State Avenue 18 4
9 State cities 16 5
.....
what i did:
'''
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
for i in df.col1:
for j in df.col2:
print(i,j,fuzz.token_set_ratio(i, j))
'''
i got stuck here..
Let us do
df['score']=df.apply(lambda x : fuzz.ratio(x['col1'],x['col2']),1)
df['score']
0 92
1 60
2 0
3 0
4 17
dtype: int64
Then
df['order']=(-df['score']).groupby(df['col1']).rank(method='first')
Related
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed last year.
I have a dataframe df:
Name
Place
Price
Bob
NY
15
Jack
London
27
John
Paris
5
Bill
Sydney
3
Bob
NY
39
Jack
London
9
Bob
NY
2
Dave
NY
7
I need to assign an incremental value (from 1 to N) for each row which has the same name and place (price can be different).
df_out:
Name
Place
Price
Value
Bob
NY
15
1
Jack
London
27
1
John
Paris
5
1
Bill
Sydney
3
1
Bob
NY
39
2
Jack
London
9
2
Bob
NY
2
3
Dave
NY
7
1
I could do this by sorting the dataframe (on Name and Place) and then iteratively checking if they match between two consecutive rows. Is there a smarter/faster pandas way to do this?
You can use a grouped (on Name, Place) cumulative count and add 1 as it starts from 0:
df['Value'] = df.groupby(['Name','Place']).cumcount().add(1)
prints:
Name Place Price Value
0 Bob NY 15 1
1 Jack London 27 1
2 John Paris 5 1
3 Bill Sydney 3 1
4 Bob NY 39 2
5 Jack London 9 2
6 Bob NY 2 3
7 Dave NY 7 1
I don't know if this is possible but I have a data frame like this one:
df
State County Homicides Man Woman Not_Register
Gto Celaya 2 2 0 0
NaN NaN 8 4 2 2
NaN NaN 3 2 1 0
NaN Yiriria 2 1 1 0
Nan Acambaro 1 1 0 0
Sin Culiacan 3 1 1 1
NaN Nan 5 4 0 1
Chih Juarez 1 1 0 0
I want to group by State, County, Man Women, Homicides and Not Register. Like this:
State County Homicides Man Woman Not_Register
Gto Celaya 13 8 3 2
Gto Yiriria 2 1 1 0
Gto Acambaro 1 1 0 0
Sin Culiacan 8 5 1 2
Chih Juarez 1 1 0 0
So far, I been able to group by State and County and fill the rows with NaN with the right name of the county and State. My result and code:
import numpy as np
import math
df = df.fillna(method ='pad') #To repeat the name of the State and County with the right order
#To group
df = df.groupby(["State","County"]).agg('sum')
df =df.reset_index()
df
State County Homicides
Gto Celaya 13
Gto Yiriria 2
Gto Acambaro 1
Sin Culiacan 8
Chih Juarez 1
But When I tried to add the Men and woman
df1 = df.groupby(["State","County", "Man", "Women", "Not_Register"]).agg('sum')
df1 =df.reset_index()
df1
My result is repeating the Counties not giving me a unique County for State,
How can I resolve this issue?
Thanks for your help
Change to
df[['Homicides','Man','Woman','Not_Register']]=df[['Homicides','Man','Woman','Not_Register']].apply(pd.to_numeric,errors = 'coerce')
df = df.groupby(['State',"County"]).sum().reset_index()
I have a dataset of random words and names and I am trying to group all of the similar words and names. So given the dataframe below:
Name ID Value
0 James 1 10
1 James 2 2 142
2 Bike 3 1
3 Bicycle 4 1197
4 James Marsh 5 12
5 Ants 6 54
6 Job 7 6
7 Michael 8 80007
8 Arm 9 47
9 Mike K 10 9
10 Michael k 11 1
My pseudo code would be something like:
import pandas as pd
from fuzzywuzzy import fuzz
minratio = 95
for idx1, name1 in df['Name'].iteritems():
for idx2, name2 in df['Name'].iteritems():
ratio = fuzz.WRatio(name1, name2)
if ratio > minratio:
grouped = df.groupby(['Name', 'ID'])['Value']\
.agg(Total_Value='sum', Group_Size='count')
This would then give me the desired output:
print(grouped)
Name ID Total_Value Group_Size
0 James 1 164 3 # All James' grouped
2 Bike 3 1198 2 # Bike's and Bicycles grouped
5 Ants 6 54 1
6 Job 7 6 1
7 Michael 8 80017 3 # Mike's and Michael's grouped
8 Arm 9 47 1
Obviously this doesn't work, and honestly, I am not sure if this is even possible, but this is what I'm trying to accomplish. Any advice that could get me on the right track would be useful.
Using affinity propagation clustering (not perfect but maybe a starting point):
import pandas as pd
import numpy as np
import io
from fuzzywuzzy import fuzz
from scipy import spatial
import sklearn.cluster
s="""Name ID Value
0 James 1 10
1 James 2 2 142
2 Bike 3 1
3 Bicycle 4 1197
4 James Marsh 5 12
5 Ants 6 54
6 Job 7 6
7 Michael 8 80007
8 Arm 9 47
9 Mike K 10 9
10 Michael k 11 1"""
df = pd.read_csv(io.StringIO(s),sep='\s\s+',engine='python')
names = df.Name.values
sim = spatial.distance.pdist(names.reshape((-1,1)), lambda x,y: fuzz.WRatio(x,y))
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", random_state=None)
affprop.fit(spatial.distance.squareform(sim))
res = df.groupby(affprop.labels_).agg(
Names=('Name',','.join),
First_ID=('ID','first'),
Total_Value=('Value','sum'),
Group_Size=('Value','count')
)
Result
Names First_ID Total_Value Group_Size
0 James,James 2,James Marsh,Ants,Arm 1 265 5
1 Bike,Bicycle 3 1198 2
2 Job 7 6 1
3 Michael,Mike K,Michael k 8 80017 3
I have a dataframe df_sample with 10 parsed addresses and am comparing it to another dataframe with hundreds of thousands of parsed address records df. Both df_sample and df share the exact same structure:
zip_code city state street_number street_name unit_number country
12345 FAKEVILLE FLORIDA 123 FAKE ST NaN US
What I want to do is match a single row in df_sample against every row in df, starting with state and take only the rows where the fuzzy.ratio(df['state'], df_sample['state']) > 0.9 into a new dataframe. Once this new, smaller dataframe is created from those matches, I would continue to do this for city, zip_code, etc. Something like:
df_match = df[fuzzy.ratio(df_sample['state'], df['state']) > 0.9]
except that doesn't work.
My goal is to narrow down the number of matches each time I use a harder search criterion, and eventually end up with a dataframe with as few matches as possible based on narrowing it down by each column individually. But I am unsure as to how to do this for any single record.
Create your dataframes
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'key': [1, 1, 1, 1, 1],
'zip': [1, 2, 3, 4, 5],
'state': ['Florida', 'Nevada', 'Texas', 'Florida', 'Texas']})
df_sample = pd.DataFrame({'key': [1, 1, 1, 1, 1],
'zip': [6, 7, 8, 9, 10],
'state': ['florida', 'Flor', 'NY', 'Florida', 'Tx']})
merged_df = df_sample.merge(df, on='key')
merged_df['fuzzy_ratio'] = merged_df.apply(lambda row: fuzz.ratio(row['state_x'], row['state_y']), axis=1)
merged_df
you get the fuzzy ratio for each pair
key zip_x state_x zip_y state_y fuzzy_ratio
0 1 6 florida 1 Florida 86
1 1 6 florida 2 Nevada 31
2 1 6 florida 3 Texas 17
3 1 6 florida 4 Florida 86
4 1 6 florida 5 Texas 17
5 1 7 Flor 1 Florida 73
6 1 7 Flor 2 Nevada 0
7 1 7 Flor 3 Texas 0
8 1 7 Flor 4 Florida 73
9 1 7 Flor 5 Texas 0
10 1 8 NY 1 Florida 0
11 1 8 NY 2 Nevada 25
12 1 8 NY 3 Texas 0
13 1 8 NY 4 Florida 0
14 1 8 NY 5 Texas 0
15 1 9 Florida 1 Florida 100
16 1 9 Florida 2 Nevada 31
17 1 9 Florida 3 Texas 17
18 1 9 Florida 4 Florida 100
19 1 9 Florida 5 Texas 17
20 1 10 Tx 1 Florida 0
21 1 10 Tx 2 Nevada 0
22 1 10 Tx 3 Texas 57
23 1 10 Tx 4 Florida 0
24 1 10 Tx 5 Texas 57
then filter out what you don't want
mask = (merged_df['fuzzy_ratio']>80)
merged_df[mask]
result:
key zip_x state_x zip_y state_y fuzzy_ratio
0 1 6 florida 1 Florida 86
3 1 6 florida 4 Florida 86
15 1 9 Florida 1 Florida 100
18 1 9 Florida 4 Florida 100
I'm not familiar with fuzzy, so this is more of a comment than an answer. That said, you can do something like this:
# cross join
df_merge = pd.merge(*[d.assign(dummy=1) for d in (df, df_sample)],
on='dummy', how='left'
)
filters = pd.DataFrame()
# compute the fuzzy ratio for each pair of columns
for col in df.columns:
filters[col] = (df_merge[[col+'_x', col+'_y']]
.apply(lambda x: fuzzy.ratio(x[col+'_x'], x[col+'_y']), axis=1)
)
# filter only those with ratio > 0.9
df_match = df_merge[filter.gt(0.9).all(1)]
You wrote that your df has very big number of rows,
so full cross-join and then elimination may cause your code
to run out of memory.
Take a look at another solution, requiring less memory:
minRatio = 90
result = []
for idx1, t1 in df_sample.state.iteritems():
for idx2, t2 in df.state.iteritems():
ratio = fuzz.WRatio(t1, t2)
if ratio > minRatio:
result.append([ idx1, t1, idx2, t2, ratio ])
df2 = pd.DataFrame(result, columns=['idx1', 'state1', 'idx2', 'state2', 'ratio'])
It contains 2 nested loops running over both DataFrames.
The result is a DataFrame with rows containig:
index and state from df_sample,
index and state from df,
the ratio.
This gives you information which rows in both DataFrames are "related"
with each other.
The advantage is that you don't generate full cross join and (for now)
you operate only on state columns, instead of full rows.
You didn't describe what exactly the final result should be, but I tink
that based on the above code you will be able to proceed further.
If I have a the following dataframe:
df = pd.DataFrame({'name':['john','mary','peter','jeff','bill','lisa','jose'], 'gender':['M','F','M','M','M','F','M'],'state':['california','dc','california','dc','california','texas','texas'],'num_children':[2,0,0,3,2,1,4],'num_pets':[5,1,0,5,2,2,3]})
name gender state num_children num_pets
0 john M california 2 5
1 mary F dc 0 1
2 peter M california 0 0
3 jeff M dc 3 5
4 bill M california 2 2
5 lisa F texas 1 2
6 jose M texas 4 3
I want to create a new row and column pct. to get the percentage of zero values in columns num_children and num_pets
Expected output:
name gender state num_children num_pets pct.
0 pct. 28.6% 14.3%
1 john M california 2 5 0%
2 mary F dc 0 1 50%
3 peter M california 0 0 100%
4 jeff M dc 3 5 0%
5 bill M california 2 2 0%
6 lisa F texas 1 2 0%
7 jose M texas 4 3 0%
I have calculated percentage of zero in each row for targets columns:
df['pct'] = df[['num_children', 'num_pets']].astype(bool).sum(axis=1)/2
df['pct.'] = 1-df['pct']
del df['pct']
df['pct.'] = pd.Series(["{0:.0f}%".format(val * 100) for val in df['pct.']], index = df.index)
name gender state num_children num_pets pct.
0 john M california 2 5 0%
1 mary F dc 0 1 50%
2 peter M california 0 0 100%
3 jeff M dc 3 5 0%
4 bill M california 2 2 0%
5 lisa F texas 1 2 0%
6 jose M texas 4 3 0%
But i don't know how to insert results below to row of pct. as expected output, please help me to get expected result in more pythonic way. Thanks.
df[['num_children', 'num_pets']].astype(bool).sum(axis=0)/len(df.num_children)
Out[153]:
num_children 0.714286
num_pets 0.857143
dtype: float64
UPDATE: same thing but for calculation of sums, great thanks to #jezrael:
df['sums'] = df[['num_children', 'num_pets']].sum(axis=1)
df1 = (df[['num_children', 'num_pets']].sum()
.to_frame()
.T
.assign(name='sums'))
df = pd.concat([df1.reindex(columns=df.columns, fill_value=''), df],
ignore_index=True, sort=False)
print (df)
name gender state num_children num_pets sums
0 sums 12 18
1 john M california 2 5 7
2 mary F dc 0 1 1
3 peter M california 0 0 0
4 jeff M dc 3 5 8
5 bill M california 2 2 4
6 lisa F texas 1 2 3
7 jose M texas 4 3 7
You can use mean with boolean mask by compare 0 values by DataFrame.eq, because sum/len=mean by definition, multiple by 100 and add percentage with apply:
s = df[['num_children', 'num_pets']].eq(0).mean(axis=1)
df['pct'] = s.mul(100).apply("{0:.0f}%".format)
For first row create new DataFrame with same columns like original and concat together:
df1 = (df[['num_children', 'num_pets']].eq(0)
.mean()
.mul(100)
.apply("{0:.1f}%".format)
.to_frame()
.T
.assign(name='pct.'))
df = pd.concat([df1.reindex(columns=df.columns, fill_value=''), df],
ignore_index=True, sort=False)
print (df)
name gender state num_children num_pets pct
0 pct. 28.6% 14.3%
1 john M california 2 5 0%
2 mary F dc 0 1 50%
3 peter M california 0 0 100%
4 jeff M dc 3 5 0%
5 bill M california 2 2 0%
6 lisa F texas 1 2 0%
7 jose M texas 4 3 0%