I have a dataframe with two columns Person Name and Company Name. I want to create two more columns called Name and Name_Type. Name would be concat of Person and Company Name and Name_Type column would determine if the name is Person type or Company type. Some rows have empty strings, which creates four possibilities:
1) Empty Person + Empty Company = Can be left blank.
2) Empty Person + Company Name = Company Name Value
3) Person Name + Empty Person = Person Name Value
4) Both Name = Split them into two rows. Cannot figure out how to do that.
I am a Python and Pandas beginner, I haven't come across an answer online. Hoping to find something here. Please excuse format or other errors.
Input:
df = pd.DataFrame({"Person_name": ["Aaron", "", "Phil", "Joe"],
"Company_name": ["", "XYZ Inc", "ABC LLC", ""]})
Company_name Person_name
0 Aaron
1 XYZ Inc
2 ABC LLC Phil
3 Joe
Expected output:
Company_name Person_name Name Name_Type
0 Aaron Aaron Person_name
1 XYZ Inc XYZ Inc Company_name
2 ABC LLC Phil Phil Person_name
2 ABC LLC Phil ABC LLC Company_name
3 Joe Joe Person_name
You can use apply, unstack and merge
df = pd.DataFrame({"Person_name": ["Aaron", "", "Phil", "Joe"],
"Company_name": ["", "XYZ Inc", "ABC LLC", ""]})
def logic(row):
if row.Company_name and row.Person_name:
return pd.Series([[row.Person_name, "Person_name"], [row.Company_name, "Company_name"]])
else:
return pd.Series([[row.Person_name, "Person_name"] if row.Person_name else [row.Company_name, "Company_name"]])
df2 = df.apply(logic, 1).unstack().apply(pd.Series).dropna().reset_index().set_index("level_1").sort_index()
dff = pd.merge(df,df2, left_index=True, right_index=True).iloc[:, [0,1,3,4]]
dff.columns = ["Company_name", "Person_name", "Name", "Name_Type"]
Output
Company_name Person_name Name Name_Type
0 Aaron Aaron Person_name
1 XYZ Inc XYZ Inc Company_name
2 ABC LLC Phil Phil Person_name
2 ABC LLC Phil ABC LLC Company_name
3 Joe Joe Person_name
Use:
(df1.melt('index', var_name='Name_Type', value_name='Name')
.replace('',np.nan).dropna()
.merge(df1, on='index').sort_values('index')
.set_index('index'))
Output:
Name_Type Name Person_name Company_name
index
0 Person_name Aaron Aaron
1 Company_name XYZ Inc XYZ Inc
2 Person_name Phil Phil ABC LLC
2 Company_name ABC LLC Phil ABC LLC
3 Person_name Joe Joe
Related
I have an issue where I have multiple rows in a csv file that have to be converted to a pandas data frame but there are some rows where the columns 'name' and 'business' have multiple names and businesses that should be in separate rows and need to be split up while keeping the data from the other columns the same for each row that is split.
Here is the example data:
input:
software
name
business
abc
Andrew Johnson, Steve Martin
Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones, Rick Paul, Johnny Jones
Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def
Tom D., Connie J., Ricky B.
Unspecified, Unspecified, Self-employed
output I need:
software
name
business
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones
Banking, 1001-5000 employees
xyz
Rick Paul
Construction, 51-200 employees
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
def
Tom D
Unspecified
def
Connie J
Unspecified
def
Ricky B
Self-employed
There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).
Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.
name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))
dict = df.to_dict('record')
for row in dict:
new_segment = {}
new_segment['name'] = str(row['name2'])
#df['name'] = str(row['name2'])
for col,content in new_segment.items():
row[col] = content
df = pd.DataFrame.from_dict(dict)
df = df.drop('name2', 1)
Here's an alternative solution I was trying as well but it gives me an error too:
review_path = r'data/base_data'
review_files = glob.glob(review_path + "/test_data.csv")
review_df_list = []
for review_file in review_files:
df = pd.read_csv(io.StringIO(review_file), sep = '\t')
print(df.head())
df["business"] = (df["business"].str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))").groupby(level=0).agg(list))
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
outPutPath = Path('data/base_data/test_data.csv')
df.to_csv(outPutPath, index=False)
Error Message for alternative solution:
Read:data/base_data/review_base.csv
Success!
Empty DataFrame
Columns: [data/base_data/test_data.csv]
Index: []
Try:
df["business"] = (
df["business"]
.str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))")
.groupby(level=0)
.agg(list)
)
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
Prints:
software name business
0 abc Andrew Johnson Outsourcing/Offshoring, 201-500 employees
0 abc Steve Martin Health, Wellness and Fitness, 5001-10,000 employees
1 xyz Jack Jones Banking, 1001-5000 employees
1 xyz Rick Paul Construction, 51-200 employees
1 xyz Johnny Jones Consumer Goods, 10,001+ employees
2 def Tom D. Unspecified
2 def Connie J. Unspecified
2 def Ricky B. Self-employed
I have a problem where there are multiple rows in a csv file that I have converted to a pandas data frame. However there are some rows where the columns 'name' and 'business' have multiple names and businesses that need to be split up and placed into individual rows while keeping the data from the other columns the same for each row that is split.
Here is the example data:
input:
software
name
business
abc
Andrew Johnson, Steve Martin
Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones, Rick Paul, Johnny Jones
Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def
Tom D., Connie J.
Unspecified, Unspecified
output i'd like to get:
software
name
business
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones
Banking, 1001-5000 employees
xyz
Rick Paul
Construction, 51-200 employees
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
def
Tom D
Unspecified
def
Connie J
Unspecified
There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).
Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.
name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))
dict = df.to_dict('record')
for row in dict:
new_segment = {}
new_segment['name'] = str(row['name2'])
#df['name'] = str(row['name2'])
for col,content in new_segment.items():
row[col] = content
df = pd.DataFrame.from_dict(dict)
df = df.drop('name2', 1)
Since pandas 1.3.0 it's possible to explode on multiple columns. So a simple solution would be to:
Split name on comma and business on commas after 'employees' or 'Unspecified' (implemented with regex below)
Explode on both name and business
This makes:
import pandas as pd
import io
data = '''software name business
abc Andrew Johnson, Steve Martin Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz Jack Jones, Rick Paul, Johnny Jones Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def Tom D., Connie J. Unspecified, Unspecified'''
df = pd.read_csv(io.StringIO(data), sep = '\t')
df['name'] = df['name'].str.split(', ')
df['business'] = df['business'].str.split(r'(?<=employees)\,\s*|(?<=Unspecified)\,\s*')
df = df.explode(['name','business']).reset_index(drop=True)
result:
software
name
business
0
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
1
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
2
xyz
Jack Jones
Banking, 1001-5000 employees
3
xyz
Rick Paul
Construction, 51-200 employees
4
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
5
def
Tom D.
Unspecified
6
def
Connie J.
Unspecified
If I have this table:
City
State
Person 1
Person 2
Atlanta
GA
Bob
Fred
But, I want to convert it to:
City
State
Person#
Person Name
Atlanta
GA
1
Bob
Atlanta
GA
2
Fred
What is the most efficient way to accomplish this?
Use melt:
out = df.melt(['City', 'State'], var_name='Person#', value_name='Person Name')
out['Person#'] = out['Person#'].str.extract('(\d+)')
>>> out
City State Person# Person Name
0 Atlanta GA 1 Bob
1 Atlanta GA 2 Fred
I have the following dataframe, where some of the rows have partial matches:
id name affiliation
1 David Department of Biology, State University
2 Peter Dept. of Chemistry
1 David Biology Department, University of State
2 Peter Chemistry Dept.
1 David Department for Zoology, 2nd State University
The id 1 and 2 are duplicated based on the id and names however the affiliations make these rows as unique (due to different spellings). I have tried the duplicate function in pandas but that matches complete cell strings and could not find a partial string match function.
Is there a way to get common strings from the affiliations to make the rows unique?
Desired dataframe should look like:
id name affiliation
1 David Biology Department, State University
2 Peter Dept. Chemistry
1 David Department for Zoology, 2nd State University
or
id name affiliation
1 David Department Biology, State University
2 Peter Chemistry Dept.
1 David Department for Zoology, 2nd State University
You could try using the Python standard library difflib module, which provides helpers for computing deltas, like this:
Setup
from difflib import SequenceMatcher
import pandas as pd
# Original dataframe
df = pd.DataFrame(
{
"id": [1, 2, 1, 2, 1],
"name": ["David", "Peter", "David", "Peter", "David"],
"affiliation": [
"Department of Biology, State University",
"Dept. of Chemistry",
"Biology Department, University of State",
"Chemistry Dept.",
"Department for Zoology, 2nd State University",
],
}
)
print(df)
# Outputs
id name affiliation
0 1 David Department of Biology, State University
1 2 Peter Dept. of Chemistry
2 1 David Biology Department, University of State
3 2 Peter Chemistry Dept.
4 1 David Department for Zoology, 2nd State University
Find and delete similar rows
# Sort "affiliation" column by increasing length
length_increasing = df["affiliation"].str.len().sort_values().index
df = df.reindex(length_increasing)
# Iterate on each value
for affiliation in df["affiliation"].values:
for row in df.iterrows():
if row[1]["affiliation"] == affiliation:
continue
# Words to compare
words = sorted(str(affiliation))
other_words = sorted(str(row[1]["affiliation"])[: len(words)])
# Replace "affiliation" value if significant match ratio
if SequenceMatcher(None, words, other_words).ratio() > 0.9:
df.loc[row[0], "affiliation"] = affiliation
# Drop duplicated rows and clean up
df = (
df.drop_duplicates(subset="affiliation")
.sort_values(["id", "affiliation"])
.reset_index(drop=True)
)
Result
print(df)
# Outputs
id name affiliation
0 1 David Department for Zoology, 2nd State University
1 1 David Department of Biology, State University
2 2 Peter Dept. of Chemistry
I used pandas to get a list of all Email duplicates, but not all email duplicates are in fact duplicates of a contact, because the company may be small, so that all employees have the same email-address for example.
Email
FirstName
LastName
Phone
Mobile
Company
a#company-a.com
John
Doe
12342
65464
Company_a
a#company-a.com
John
Doe
43214
45645
Comp_ny A
a#company-a.com
Adam
Smith
34223
46456
Company A
b#company-b.com
Bill
Gates
23423
63453
Company B
b#company-b.com
Bill
Gates
32421
43244
Comp B
b#company-b.com
Elon
Musk
42342
34234
Company B
That's why I came up with the following condition to filter my Email duplicate list further down:
I want to extract all the cases where the Email, FirstName and LastName are equal in a dataframe because that almost certainly would mean that this is a real duplicate. The extracted dataframe should look like this in the end:
Email
FirstName
LastName
Phone
Mobile
Company
a#company-a.com
John
Doe
12342
65464
Company_a
a#company-a.com
John
Doe
43214
45645
Comp_ny A
b#company-b.com
Bill
Gates
23423
63453
Company B
b#company-b.com
Bill
Gates
32421
43244
Comp B
How can I get there? Is it possible to check for multiple equal conditions?
I would appreciate any feedback regarding the best practices.
Thank you!
Use pd.drop_duplicates
df.drop_duplicates(subset=['Email', 'FirstName', 'LastName'], keep='first')
output
Email FirstName LastName Phone Mobile Company
0 a#company-a.com John Doe 12342 65464 Company_a
2 a#company-a.com Adam Smith 34223 46456 Company A
3 b#company-b.com Bill Gates 23423 63453 Company B
5 b#company-b.com Elon Musk 42342 34234 Company B
To get the duplicates
df[~df.index.isin(df.drop_duplicates(subset=['Email', 'FirstName', 'LastName'], keep='first').index)]
output
Email FirstName LastName Phone Mobile Company
1 a#company-a.com John Doe 43214 45645 Comp_ny A
4 b#company-b.com Bill Gates 32421 43244 Comp B