I have the following dataframe, where some of the rows have partial matches:
id name affiliation
1 David Department of Biology, State University
2 Peter Dept. of Chemistry
1 David Biology Department, University of State
2 Peter Chemistry Dept.
1 David Department for Zoology, 2nd State University
The id 1 and 2 are duplicated based on the id and names however the affiliations make these rows as unique (due to different spellings). I have tried the duplicate function in pandas but that matches complete cell strings and could not find a partial string match function.
Is there a way to get common strings from the affiliations to make the rows unique?
Desired dataframe should look like:
id name affiliation
1 David Biology Department, State University
2 Peter Dept. Chemistry
1 David Department for Zoology, 2nd State University
or
id name affiliation
1 David Department Biology, State University
2 Peter Chemistry Dept.
1 David Department for Zoology, 2nd State University
You could try using the Python standard library difflib module, which provides helpers for computing deltas, like this:
Setup
from difflib import SequenceMatcher
import pandas as pd
# Original dataframe
df = pd.DataFrame(
{
"id": [1, 2, 1, 2, 1],
"name": ["David", "Peter", "David", "Peter", "David"],
"affiliation": [
"Department of Biology, State University",
"Dept. of Chemistry",
"Biology Department, University of State",
"Chemistry Dept.",
"Department for Zoology, 2nd State University",
],
}
)
print(df)
# Outputs
id name affiliation
0 1 David Department of Biology, State University
1 2 Peter Dept. of Chemistry
2 1 David Biology Department, University of State
3 2 Peter Chemistry Dept.
4 1 David Department for Zoology, 2nd State University
Find and delete similar rows
# Sort "affiliation" column by increasing length
length_increasing = df["affiliation"].str.len().sort_values().index
df = df.reindex(length_increasing)
# Iterate on each value
for affiliation in df["affiliation"].values:
for row in df.iterrows():
if row[1]["affiliation"] == affiliation:
continue
# Words to compare
words = sorted(str(affiliation))
other_words = sorted(str(row[1]["affiliation"])[: len(words)])
# Replace "affiliation" value if significant match ratio
if SequenceMatcher(None, words, other_words).ratio() > 0.9:
df.loc[row[0], "affiliation"] = affiliation
# Drop duplicated rows and clean up
df = (
df.drop_duplicates(subset="affiliation")
.sort_values(["id", "affiliation"])
.reset_index(drop=True)
)
Result
print(df)
# Outputs
id name affiliation
0 1 David Department for Zoology, 2nd State University
1 1 David Department of Biology, State University
2 2 Peter Dept. of Chemistry
Related
I have a Pandas dataframe df
I want to populate subsequent values in a column based on the value that preceded it and when I come across another value do the same for that.
So the dept column is complete and I can merge this dataset with another to have departments linked info for PIs.
Don't know the best approach, is there a vectorized approach to this our would it require looping, maybe using iterrows() or itertuples().
data = {"dept": ["Emergency Medicine", "", "", "", "Family Practice", "", ""],
"pi": [NaN, "Tiger Woods", "Michael Jordan", "Roger Federer", NaN, "Serena Williams", "Alex Morgan"]
}
df = pd.DataFrame(data=data)
dept pi
0 Emergency Medicine
1 Tiger Woods
2 Michael Jordan
3 Roger Federer
4 Family Practice
5 Serena Williams
6 Alex Morgan
desired_df
dept pi
0 Emergency Medicine
1 Emergency Medicine Tiger Woods
2 Emergency Medicine Michael Jordan
3 Emergency Medicine Roger Federer
4 Family Practice
5 Family Practice Serena Williams
6 Family Practice Alex Morgan
Use where to mask those empty rows with nan, then ffill
# if you have empty strings
mask = df['dept'].ne('')
df['dept'] = df['dept'].where(mask).ffill()
# otherwise, just
# df['dept'] = df['dept'].ffill()
Output:
dept pi
0 Emergency Medicine NaN
1 Emergency Medicine Tiger Woods
2 Emergency Medicine Michael Jordan
3 Emergency Medicine Roger Federer
4 Family Practice NaN
5 Family Practice Serena Williams
6 Family Practice Alex Morgan
I have an issue where I have multiple rows in a csv file that have to be converted to a pandas data frame but there are some rows where the columns 'name' and 'business' have multiple names and businesses that should be in separate rows and need to be split up while keeping the data from the other columns the same for each row that is split.
Here is the example data:
input:
software
name
business
abc
Andrew Johnson, Steve Martin
Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones, Rick Paul, Johnny Jones
Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def
Tom D., Connie J., Ricky B.
Unspecified, Unspecified, Self-employed
output I need:
software
name
business
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones
Banking, 1001-5000 employees
xyz
Rick Paul
Construction, 51-200 employees
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
def
Tom D
Unspecified
def
Connie J
Unspecified
def
Ricky B
Self-employed
There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).
Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.
name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))
dict = df.to_dict('record')
for row in dict:
new_segment = {}
new_segment['name'] = str(row['name2'])
#df['name'] = str(row['name2'])
for col,content in new_segment.items():
row[col] = content
df = pd.DataFrame.from_dict(dict)
df = df.drop('name2', 1)
Here's an alternative solution I was trying as well but it gives me an error too:
review_path = r'data/base_data'
review_files = glob.glob(review_path + "/test_data.csv")
review_df_list = []
for review_file in review_files:
df = pd.read_csv(io.StringIO(review_file), sep = '\t')
print(df.head())
df["business"] = (df["business"].str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))").groupby(level=0).agg(list))
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
outPutPath = Path('data/base_data/test_data.csv')
df.to_csv(outPutPath, index=False)
Error Message for alternative solution:
Read:data/base_data/review_base.csv
Success!
Empty DataFrame
Columns: [data/base_data/test_data.csv]
Index: []
Try:
df["business"] = (
df["business"]
.str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))")
.groupby(level=0)
.agg(list)
)
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
Prints:
software name business
0 abc Andrew Johnson Outsourcing/Offshoring, 201-500 employees
0 abc Steve Martin Health, Wellness and Fitness, 5001-10,000 employees
1 xyz Jack Jones Banking, 1001-5000 employees
1 xyz Rick Paul Construction, 51-200 employees
1 xyz Johnny Jones Consumer Goods, 10,001+ employees
2 def Tom D. Unspecified
2 def Connie J. Unspecified
2 def Ricky B. Self-employed
I have a problem where there are multiple rows in a csv file that I have converted to a pandas data frame. However there are some rows where the columns 'name' and 'business' have multiple names and businesses that need to be split up and placed into individual rows while keeping the data from the other columns the same for each row that is split.
Here is the example data:
input:
software
name
business
abc
Andrew Johnson, Steve Martin
Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones, Rick Paul, Johnny Jones
Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def
Tom D., Connie J.
Unspecified, Unspecified
output i'd like to get:
software
name
business
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones
Banking, 1001-5000 employees
xyz
Rick Paul
Construction, 51-200 employees
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
def
Tom D
Unspecified
def
Connie J
Unspecified
There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).
Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.
name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))
dict = df.to_dict('record')
for row in dict:
new_segment = {}
new_segment['name'] = str(row['name2'])
#df['name'] = str(row['name2'])
for col,content in new_segment.items():
row[col] = content
df = pd.DataFrame.from_dict(dict)
df = df.drop('name2', 1)
Since pandas 1.3.0 it's possible to explode on multiple columns. So a simple solution would be to:
Split name on comma and business on commas after 'employees' or 'Unspecified' (implemented with regex below)
Explode on both name and business
This makes:
import pandas as pd
import io
data = '''software name business
abc Andrew Johnson, Steve Martin Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz Jack Jones, Rick Paul, Johnny Jones Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def Tom D., Connie J. Unspecified, Unspecified'''
df = pd.read_csv(io.StringIO(data), sep = '\t')
df['name'] = df['name'].str.split(', ')
df['business'] = df['business'].str.split(r'(?<=employees)\,\s*|(?<=Unspecified)\,\s*')
df = df.explode(['name','business']).reset_index(drop=True)
result:
software
name
business
0
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
1
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
2
xyz
Jack Jones
Banking, 1001-5000 employees
3
xyz
Rick Paul
Construction, 51-200 employees
4
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
5
def
Tom D.
Unspecified
6
def
Connie J.
Unspecified
I have a dataframe with 2 columns i.e. UserId in integer format and Actors in string format as shown below:
Userid Actors
u1 Tony Ward,Bruce LaBruce,Kevin P. Scott,Ivar Johnson, Naomi Watts, Tony Ward,.......
u2 Tony Ward,Bruce LaBruce,Kevin P. Scott, Luke Wilson, Owen Wilson, Lumi Cavazos,......
It represents actors from all movies watched by a particular user of the platform
I want an output where we have the count of each actor for each user as shown below:
UserId Tony Ward Bruce LaBruce Kevin P. Scott Ivar Johnson Luke Wilson Owen Wilson Lumi Cavazos
u1 2 1 1 1 0 0 0
u2 1 1 1 0 1 1 1
It is something similar to countvectoriser I reckon, but i just have nouns here.
Kindly help
Assuming its a pandas.Dataframe try this, DataFrame.explode Transform each element of a list-like (result of split) to a row DataFrame.groupby aggregates the data & DataFrame.unstack transforms to required format.
df['Actors'] = df['Actors'].str.replace(",\s", ",").str.split(",")
(
df.explode('Actors').
groupby(['Userid', 'Actors'], as_index=False).size().
unstack().fillna(0)
)
I have a dataframe with two columns Person Name and Company Name. I want to create two more columns called Name and Name_Type. Name would be concat of Person and Company Name and Name_Type column would determine if the name is Person type or Company type. Some rows have empty strings, which creates four possibilities:
1) Empty Person + Empty Company = Can be left blank.
2) Empty Person + Company Name = Company Name Value
3) Person Name + Empty Person = Person Name Value
4) Both Name = Split them into two rows. Cannot figure out how to do that.
I am a Python and Pandas beginner, I haven't come across an answer online. Hoping to find something here. Please excuse format or other errors.
Input:
df = pd.DataFrame({"Person_name": ["Aaron", "", "Phil", "Joe"],
"Company_name": ["", "XYZ Inc", "ABC LLC", ""]})
Company_name Person_name
0 Aaron
1 XYZ Inc
2 ABC LLC Phil
3 Joe
Expected output:
Company_name Person_name Name Name_Type
0 Aaron Aaron Person_name
1 XYZ Inc XYZ Inc Company_name
2 ABC LLC Phil Phil Person_name
2 ABC LLC Phil ABC LLC Company_name
3 Joe Joe Person_name
You can use apply, unstack and merge
df = pd.DataFrame({"Person_name": ["Aaron", "", "Phil", "Joe"],
"Company_name": ["", "XYZ Inc", "ABC LLC", ""]})
def logic(row):
if row.Company_name and row.Person_name:
return pd.Series([[row.Person_name, "Person_name"], [row.Company_name, "Company_name"]])
else:
return pd.Series([[row.Person_name, "Person_name"] if row.Person_name else [row.Company_name, "Company_name"]])
df2 = df.apply(logic, 1).unstack().apply(pd.Series).dropna().reset_index().set_index("level_1").sort_index()
dff = pd.merge(df,df2, left_index=True, right_index=True).iloc[:, [0,1,3,4]]
dff.columns = ["Company_name", "Person_name", "Name", "Name_Type"]
Output
Company_name Person_name Name Name_Type
0 Aaron Aaron Person_name
1 XYZ Inc XYZ Inc Company_name
2 ABC LLC Phil Phil Person_name
2 ABC LLC Phil ABC LLC Company_name
3 Joe Joe Person_name
Use:
(df1.melt('index', var_name='Name_Type', value_name='Name')
.replace('',np.nan).dropna()
.merge(df1, on='index').sort_values('index')
.set_index('index'))
Output:
Name_Type Name Person_name Company_name
index
0 Person_name Aaron Aaron
1 Company_name XYZ Inc XYZ Inc
2 Person_name Phil Phil ABC LLC
2 Company_name ABC LLC Phil ABC LLC
3 Person_name Joe Joe