Iterate over groups of rows in Pandas - python

I am new to pandas and python in general - grateful for any direction you can provide!
I have a csv file with 4 columns. I am trying to group together rows where the first three columns are the same on all rows (Column A Row 1 = Column A Row 2, Column B Row 1 = Column B Row 2, and so on)
My data look like this:
phone_number state date description
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
3 9991112222 NJ 2015-05-14 Apartment rental
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
6 4443337777 NJ 2015-05-14 condo
So in this data, rows 1, 2 and 3 would be in one group, and rows 4 and 5 would be in another group. Row 6 would not be in the group with 1, 2, and 3 because it has a different phone_number.
Then, for each row, I want to compare the string in the description column against each other description in that group using Levenshtein distance, and keep the rows where the descriptions are sufficiently similar.
"Condo" from row 1 would be compared to "Condo sales call" from row 2 and to "Apartment rental" in row 3. It would not be compared to "condo" from row 6.
In the end, the goal is to weed out rows where the description is not sufficiently similar to another description in the same group. Phrased differently, to print out all rows where description is at least somewhat similar to another (any other) description in that group. Ideal output:
phone_number state date description
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
Row 6 does not print because it was never in a group. Row 3 doesn't print because "Apartment rental" is insufficiently similar to "Condo" or "Condo sales call"
This is the code I have so far. I can't tell if this is the best way to do it. And if I have done it right so far, I can't figure out how to print the full row of interest:
import Levenshtein
import itertools
import pandas as pd
test_data = pd.DataFrame.from_csv('phone_state_etc_test.csv', index_col=None)
for pn in test_data['phone_number']:
for dt in test_data['date']:
for st in test_data['state']:
for a, b in itertools.combinations(test_data[
(test_data['phone_number'] == pn) &
(test_data['state'] == st) &
(test_data['date'] == dt)
]
['description'], 2):
if Levenshtein.ratio(a,b) > 0.35:
print pn, "|", dt, "|", st, "|" #description
This prints a bunch of duplicates of these lines:
9991112222 | NJ | 2015-05-14 |
6668885555 | CA | 2015-05-06 |
But if I add description to the end of the print line, I get a
SyntaxError: invalid syntax
Any thoughts on how I can print the full row? Whether in pandas dataframe, or some other format, doesn't matter - I just need to output to csv.

Why don't you use the pandas.groupby option to find the unique groups (based on phone-number, state and date). Doing this lets you treat all the Description values separately and do whatever you want to do with them.
For example, I'll groupby with the above said columns and get the unique values for the Description columns within this group -
In [49]: df.groupby(['phone_number','state','date']).apply(lambda v: v['description'].unique())
Out[49]:
phone_number state date
4443337777 NJ 2015-05-14 [condo]
6668885555 CA 2015-05-06 [Apartment, Apartment-rental]
9991112222 NJ 2015-05-14 [Condo, Condo-sales-call, Apartment-rental]
dtype: object
You can use any function within the apply. More examples here - http://pandas.pydata.org/pandas-docs/stable/groupby.html

I'm not entirely sure how best to do a calculation for all pairs of values in pandas- here I've made a matrix with the descriptions as both the rows and columns (so the main diagonal of the matrix compares the description with itself), but it doesn't seem entirely idiomatic:
def find_similar_rows(group, threshold=0.35):
sim_matrix = pd.DataFrame(index=group['description'],
columns=group['description'])
for d1 in sim_matrix.index:
for d2 in sim_matrix.columns:
# Leave diagonal entries as nan
if d1 != d2:
sim_matrix.loc[d1, d2] = Levenshtein.ratio(d1, d2)
keep = sim_matrix.gt(threshold, axis='columns').any()
# A bit of possibly unnecessary mucking around with the index
# here, could probably be cleaned up
rows_to_keep = group.loc[keep[group['description']].tolist(), :]
return rows_to_keep
grouped = test_data.groupby('phone_number', group_keys=False)
grouped.apply(find_similar_rows)
Out[64]:
phone_number state date description
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call

It seems form the data provided that you want to keep rows for which the first word in the description matches the most common first word for that group.
If that's the case, you can do this:
test_data['description_root'] = test_data.str.split().str[0]
# this adds a columns with the first word from the description column
grouped = test_data.groupby(['phone_number', 'state', 'date'])
most_frequent_root = grouped.description_root.transform(
lambda s: s.value_counts().idxmax())
# this is a series with the same index as the original df containing
# the most frequently occuring root for each group
test_data[test_data.description_root == most_frequent_root]
# this will give you the matching rows
You could also call .describe on grouped to give some additional information for each group. Sorry if this is off topic but I think the you might well find the Series string methods (.str) and the groupby useful.

Related

When I apply the groupby function of pandas I don't get a dataframe but a strange output

This is an example of my dataframe and I would like to apply the groupby function but I get the following output:
Example dataframe:
x sampling time y
1 morning 19
2 morning 19.1
3 morning 20
1 midday 17
2 midday 18
3 midday 19
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11F3DEB0>
My code:
file = pd.read_excel(r'C:/Users/...
r'2021_Licor_sinusoidal.xlsx')
df = file[['x', 'Sampling time', 'Y']].copy()
df.columns = ['x', 'Sampling time', 'Y']
grouped = df.groupby(['Sampling time'])
print(grouped)
Thank you in advance
Pandas' groupby is lazy in nature, which means that it doesn't return any actual data until you specifically ask for them. As mentioned in the comments, this returns a grouped iterator, which can be used with inbuilt functions like sum() and count(). This tutorial will give you more insight regarding the operation and the way it works.
In a little more detail:
by_state = df.groupby("state")
print(by_state)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107293278>
Here you get the 'strange' object/output you mentioned, since groupby delays the split-combine-apply process until you invoke a method on these data, as mentioned above. Since the function returns an iterator, you can of course iterate on this.
for state, frame in by_state:
print(f"First 2 entries for {state!r}")
print("------------------------")
print(frame.head(2), end="\n\n")
# First 2 entries for 'AK'
# ------------------------
# last_name first_name birthday gender type state party
# 6619 Waskey Frank 1875-04-20 M rep AK Democrat
# 6647 Cale Thomas 1848-09-17 M rep AK Independent
# First 2 entries for 'AL'
# ------------------------
# last_name first_name birthday gender type state party
# 912 Crowell John 1780-09-18 M rep AL Republican
# 991 Walker John 1783-08-12 M sen AL Republican

How to add rows in Data Frame while for loop?

I want to add a row in an existing data frame, where I don't have a matching regex value.
For example,
import pandas as pd
import numpy as np
import re
lst = ['Sarah Kim', 'Added by January 21']
df = pd.DataFrame(lst)
df.columns = ['Info']
name_pat = r"^[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+"
date_pat = r"\b(\w*Added on\w*)\b"
title_pat = r"\b(\w*at\w*)\b"
for index, row in dff.iterrows():
if re.findall(name_pat, str(row['Info'])):
print("Name matched")
elif re.findall(title_pat, str(row['Info'])):
print("Title matched")
if re.findall(title_pat, str(row['Info'])) == None:
# Add a row here in the dataframe
elif re.findall(date_pat, str(row['Info'])):
print("Date matched")
if re.findall(date_pat, str(row['Info'])) == None:
# Add a row here in the dataframe
So here in my dataframe df, I do not have a title, but just Name and Date. While looping df, I want to add an empty column for a title.
The output is:
Info
0 Sarah Kim
1 Added on January 21
My expected output is:
Info
0 Sarah Kim
1 None
2 Added on January 21
Is there any way that I can add an empty column, or is there a better way?
+++
The dataset I'm working with is just one column with many rows. The rows have some structure, that repeat data of "name, title, date". For example,
Info
0 Sarah Kim
1 Added on January 21
2 Jesus A. Moore
3 Marketer
4 Added on May 30
5 Bobbie J. Garcia
6 CEO
7 Anita Jobe
8 Designer
9 Added on January 3
...
998 Michael B. Reedy
999 Salesman
1000 Added on December 13
I have sliced the data frame, so I can only extract data frame looks like this:
Info
0 Sarah Kim
1 Added on January 21
And I'm trying to run a loop for each section, and if a date or title is missing, I will fill with an empty row. So that in the end, I will have:
Info
0 Sarah Kim
1 **NULL**
2 Added on January 21
3 Jesus A. Moore
4 Marketer
5 Added on May 30
6 Bobbie J. Garcia
7 CEO
8 **NULL**
9 Anita Jobe
10 Designer
11 Added on January 3
...
998 Michael B. Reedy
999 Salesman
1000 Added on December 13
I see you have a long dataframe with information and each set of information is different. I think the your goal is possibly to have a dataframe where you have 3 columns.
Name,Title and Date
Here is a way I would approach this problem and some code samples. I would take advantage of the df.shift method so I could tie information and use your existing dataframe to create a new one.
I am also making some assumptions based on what you have listed above. First I will assume that only the Title and Date field could be missing. Second I will assume that the order of the is Name,Title and Date like you have mentioned above.
#first step create test data
test_list = ['Sarah Kim','Added on January 21','Jesus A. Moore','Marketer','Added on May 30','Bobbie J. Garcia','CEO','Anita Jobe','Designer','Added on January 3']
test_df =pd.DataFrame(test_list,columns=['Info'])
# second step use your regex to get what type of column each info value is
name_pat = r"^[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+"
date_pat = r"\b(\w*Added on\w*)\b"
title_pat = r"\b(\w*at\w*)\b"
test_df['Col'] = test_df['Info'].apply(lambda x: 'Name' if re.findall(name_pat, x) else ('Date' if re.findall(date_pat,x) else 'Title'))
# third step is to get the next values from our dataframe using df.shift
test_df['Next_col'] = test_df['Col'].shift(-1)
test_df['Next_col2'] = test_df['Col'].shift(-2)
test_df['Next_val1'] = test_df['Info'].shift(-1)
test_df['Next_val2'] = test_df['Info'].shift(-2)
# Now filter to only the names and apply a function to get our name, title and date
new_df = test_df[test_df['Col']=='Name']
def apply_func(row):
name = row['Info']
title = None
date = None
if row['Next_col']=='Title':
title = row['Next_val1']
elif row['Next_col']=='Date':
date = row['Next_val1']
if row['Next_col2']=='Date':
date = row['Next_val2']
row['Name'] = name
row['Title'] = title
row['date'] = date
return row
final_df = new_df.apply(apply_func,axis=1)[['Name','Title','date']].reset_index(drop=True)
print(final_df)
Name Title date
0 Sarah Kim None Added on January 21
1 Jesus A. Moore Marketer Added on May 30
2 Bobbie J. Garcia CEO None
3 Anita Jobe Designer Added on January 3
There is probably a way that we could do this in less lines of code. I welcome anyone who can make this more efficient, but I believe this should work. Also if you wanted to flatten this back into an array.
flattened_df = pd.DataFrame(final_df.values.flatten(),columns=['Info'])
print(flattened_df)
Info
0 Sarah Kim
1 None
2 Added on January 21
3 Jesus A. Moore
4 Marketer
5 Added on May 30
6 Bobbie J. Garcia
7 CEO
8 None
9 Anita Jobe
10 Designer
11 Added on January 3

Pandas, join row from target file based on condition

I need to merge a row from a target DataFrame into my source DataFrame on a fuzzy matching condition that has already been developed, let's call the method fuzzyTest. If fuzzy test returns True, I want to merge the row from the target file into my source file when matched.
So basically do a left join where the TARGET COMPANY passes the fuzzyTest when compared to the SOURCE COMPANY.
Source DataFrame
SOURCE COMPANY
0 Cool Company
1 BigPharma
2 Tod Kompany
3 Wallmart
Target DataFrame
TARGET COMPANY
0 Kool Company
1 Big farma
2 Todd's Company
3 C-Mart
4 SuperMart
5 SmallStore
6 ShopRus
Hopefully after mapping through fuzzyTest the output would be:
SOURCE COMPANY TARGET COMPANY
0 Cool Company Kool Company
1 BigPharma Big farma
2 Tod Kompany Todd's Company
3 Wallmart NaN
So if your fuzzy logic only compare the two strings on each row, just wrap it as a function that takes in column source and column target.
Make both columns in one dataframe then run:
def FuzzyTest(source,target):
.....
if ...:
return target
else:
return None
df['Target Company'] = df.apply(lambda x: FuzzyTest(x['Source'],x['Target'])

Filling in a pandas column based on existing number of strings

I have a pandas data-frame that looks like this:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars NaN
5 Photo Andrew
6 Football NaN
.............. 1303 rows.
The number of Names filled in might be large then 2 as well. I would like to end up the entire Names column filled n equally into the names ( or+1 in the case of even number of rows). I already store into a variable number of names the total number of names. In the above case it's 2. I tried filtering and counting by each name but I don't know how to make this when the number of name is dynamic.
Expected Dataframe:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars Kevin
5 Photo Andrew
6 Football Andrew
I tried: replace NaN with 0 in Column Name using fillna. Filter the column and end up with a dataframe that has only the na fields and afterwards len(df) to get the number of nan and from here created 2 databases each containing half of the df. Bu I think this approach is completely wrong as I do not always have 2 Names. There could be2,3,4 etc. ( this is given by a dictionary)
Any help highly appreciated
Thanks.
It's difficult to tell but I think you need ffill
df['Name'] = df['Name'].ffill()

Fuzzy matching a sorted column with itself using python

I have a dataset of 200k rows with two columns: 1 - Unique customer id and address combination and 2 - revenue. The table is sorted by revenue and the goal is to clean up column 1 by doing a fuzzy match with itself to check if there are any close enough customer-address combinations with higher revenue that can be used to replace combinations with lesser revenue which most likely resulted from spelling differences.
Example:
In the above example the third row is very similar to the first row so I want it to take the value of the first row.
I have a working python code but it is too slow:
import pandas as pd
import datetime
import time
import numpy as np
from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance, normalized_damerau_levenshtein_distance_ndarray
data = pd.read_csv("CustomerMaster.csv", encoding="ISO-8859-1")
# Create lookup column from the dataframe itself:
lookup_data=data['UNIQUE_ID']
lookup_data=pd.Series.to_frame(lookup_data)
# Start iterating on row by row on lookup data to find the first closest fuzzy match and write that back into dataframe:
start = time.time()
for index,row in data.iterrows():
if index%5000==0:print(index, time.time()-start)
for index2, row2 in lookup_data.iterrows():
ratio_val=normalized_damerau_levenshtein_distance(row['UNIQUE_ID'],row2['UNIQUE_ID'])
if ratio_val<0.15:
data.set_value(index,'UPDATED_ID',row2['UNIQUE_ID'])
data.set_value(index,'Ratio_Val',ratio_val)
break
Currently this fuzzy matching block of code is taking too long to run - about 8 hours for the first 15k rows with time increasing exponentially as one would expect. Any suggestion on how to more efficiently write this code?
One immediate suggestion: Since matching is symmetric, you need to match each row only to the rows that have not been matched yet. Rewrite the inner loop to skip over the previously visited rows. E.g., add this:
if index2 <= index:
continue
This alone will speed up the matching by the factor of 2.
I had the same issue and resolved it with a combination of the levenshtein package (to create a distance matrix) and scikit's DBSCAN to cluster similar strings and to assing the same value to every element within the cluster.
You can check it out here: https://github.com/ebravofm/e_utils (homog_lev())
>>> from e_utils.utils import clean_df
>>> from e_utils.utils import homog_lev
>>> series
0 Bad Bunny
1 bad buny
2 bag bunny
3 Ozuna
4 De La Ghetto
5 de la geto
6 Daddy Yankee
7 dade yankee
8 Nicky Jam
9 nicky jam
10 J Balvin
11 jbalvin
12 Maluma
13 maluma
14 Anuel AA
>>> series2 = clean_df(series)
>>> series2 = homog_lev(series2, eps=3)
>>> pd.concat([series, series2.str.title()], axis=1, keys=['*Original*', '*Fixed*'])
*Original* *Fixed*
0 Bad Bunny Bad Bunny
1 bad buny Bad Bunny
2 bag bunny Bad Bunny
3 Ozuna Ozuna
4 De La Ghetto De La Ghetto
5 de la geto De La Ghetto
6 Daddy Yankee Daddy Yankee
7 dade yankee Daddy Yankee
8 Nicky Jam Nicky Jam
9 nicky jam Nicky Jam
10 J Balvin J Balvin
11 jbalvin J Balvin
12 Maluma Maluma
13 maluma Maluma
14 Anuel AA Anuel Aa

Categories