Matching Strings and Count Frequency

Matching Strings and Count Frequency - python

I have a list of companies with their subsidiaries, the data looks as below:
CompanyName Employees
Microsoft China 1
Microsoft India 1
Microsoft Europe 1
Apple Inc 1
Apple Data Inc 1
Apple Customer Service Inc 1
Data Corp 1
Data SHCH 1
Data India 1
City Corp 1
Data City 1
If two companies have same words (e.g. Apple Inc and Apple Data Inc), they are considered one company. I will group those companies together, and calculate their total number of employees.
The expected return should be:
Company Employees
Microsft 3
Apple 3
Data 3
City 2
The company will return the common word
The Employees return the sum of company and its subsidiaries
Most of the pandas function doesn't really work in this case. Any suggestions on For Loop?

As you requested in the comments
If the company is always the first word in CompanyName
# extract company as word at index 0
df.CompanyName = df.CompanyName.str.split(expand=True)[0]
# groupby company name and count
dfg = df.groupby('CompanyName').agg({'CompanyName': 'count'})
# display(dfg)
CompanyName
CompanyName
Apple 3
City 1
Data 4
Microsoft 3

I don't think there's a 'very' simple way to do what you want. But it's not too complex too.
First, you need to define clearly the ~criterion to decide wich names are the same 'company'.
We can try with "get the first world and see if it matches", obviously it's not a perfect approach, but it'll do for now.
Then, you can create an object to store your new data. I would recommend a dictionary, with entries like company: (total employees).
You'll now iterate over the rows of the dataframe, with apply and a function to do what you want. It'll look like this:
dict = {}
def aggregator(row):
word1 = row.company.split(" ")[0]
if word1 in dict.keys:
dict[word1] += row.employees
else:
dict[word1] = row.employees
dataframe.apply(aggregator, axis = 1)

Related

Is there a way to groupby columns that are similar?

I have a table like this:
University
Course
Stanford
Financial Accounting
Stanford
Managerial Accounting
Yale
Managerial Accounting
Harvard
Corporate Finance
University of Oregon
Accounting & Finance
I would like to group them where part of the string matches and it only counts once per University like this:
Course
Count
Accounting
3
Finance
2
Managerial
2
Financial
1
Corporate
1
Currently I have this group by statement that shows top string matches:
groups = dataFrame.groupby('Course')['University'].nunique().sort_values(ascending=False)
And it returns:
Course
Count
Financial Accounting
1
Managerial Accounting
2
Corporate Finance
1
Does anyone know a way to group the similar courses based on similar matches rather than exact matches?
Any guidance would be greatly appreciated. Thank you

Maybe you could use Series.str.get_dummies and manually pick out the columns:
counts = df['Course'].drop_duplicates().str.get_dummies(' ').sum()[['Accounting', 'Finance']].to_frame().reset_index().set_axis(['Course', 'Count'], axis=1)
Output:
>>> counts
Course Count
0 Accounting 2
1 Finance 1
Or maybe it's simpler and you just want to select the last part of each value:
counts = df['Course'].drop_duplicates().str.split(' ').str[-1].value_counts().reset_index().set_axis(['Course', 'Count'], axis=1)
Output:
>>> counts
Course Count
0 Accounting 2
1 Finance 1

How to detect a string in dataframe column from a list of names in another dataframe column

I am trying to find whether a news article contains a specific name of the company which I already have established a list as a dataframe column. I have one dataframe that contains the text of article as a column, and another dataframe with the names of the companies. I would like to search each article text to detect whether any name from the list exists, and create separate variable containing that name of the company found within the text. Someone recommended me using 'merge', but since I do not have the common identifier, it was not possible. I hope following example illustrates the idea.
First Dataframe (Article):
Index
Text
1
Apple decided to launch new product....
2
Tesla is ...
3
IBM is paying dividend......
4
Amazon is relocating.....
......
........
Second Dataframe with company name (Compname):
Index
Name
1
BP
2
Tesla
3
Bank of America
4
Amazon
5
JP Morgan
6
Apple
.....
......
What I want to see in the end would be the following:
Index
Text
Name_found
1
Apple decided to launch new product....
Apple
2
Tesla is ...
Tesla
3
IBM is paying dividend......
NaN
4
Amazon is relocating.....
Amazon
....
.....
......
I tried something like the following, but didn't quite get the job done
for x in compname['Name']:
Article['Name_found']=Article['Text'].str.contains(x, na=False)
Thank you for your help. Truly appreciate it.

Do you want this - >
pattern = r'(' + '|'.join(df1['Name'].to_list()) + ')'
df2['Text'] = df2['Text'].str.extract(pat= pattern)
print(df2)
Idea is to make a regex pattern with multiple or conditions - Here, for this case pattern will look like this -
'(BP|Tesla|Bank of America|Amazon|JP Morgan|Apple)'
Output- >
Index Text
0 1 Apple
1 2 Tesla
2 3 NaN
3 4 Amazon

Summing with on multiple conditions

I am trying count the total number of visitors to all restaurants in 2017(The total number of people to visit any restaurant, not individual restaurants). I only want to count the restaurants numbers if its store_id appears in the relation_table, but I can't get my code to work. I get a syntax error on "no_visitors"
UPDATE: My problem was with a previous line
total_visits = reservations.loc[reservations["store_id"].isin(relation_table["store_id"]) & (reservations.year==2017), "no_visitors"].sum()
Example dataframe
RESERVATIONS RELATION_TABLE
store_id year no_visitors store_id
mcdonalds 2017 4 mcdonalds
kfc 2016 5 kfc
burgerking 2017 2

One way to filter your data (df) is to do df[filter_condition] which returns the rows for which the given condition is true. Now all you need is to take the sum of the column you are interested in (no_visitors).
# df = reservations
df[(df.store_id != "") & (df.year == 2017)].no_visitors.sum()

nested for loops with pandas dataframe

I am looping through a dataframe column of headlines (sp500news) and comparing against a dataframe of company names (co_names_df). I am trying to update the frequency each time a company name appears in a headline.
My current code is below and is not updating the frequency columns. Is there a cleaner, faster implementation - maybe without the for loops?
for title in sp500news['title']:
for string in title:
for co_name in co_names_df['Name']:
if string == co_name:
co_names_index = co_names_df.loc[co_names_df['Name']=='string'].index
co_names_df['Frequency'][co_names_index] += 1
co_names_df sample
Name Frequency
0 3M 0
1 A.O. Smith 0
2 Abbott 0
3 AbbVie 0
4 Accenture 0
5 Activision 0
6 Acuity Brands 0
7 Adobe Systems 0
...
sp500news['title'] sample
title
0 Italy will not dismantle Montis labour reform minister
1 Exclusive US agency FinCEN rejected veterans in bid to hire lawyers
4 Xis campaign to draw people back to graying rural China faces uphill battle
6 Romney begins to win over conservatives
8 Oregon mall shooting survivor in serious condition
9 Polands PGNiG to sign another deal for LNG supplies from US CEO

You can probably speed this up; you're using dataframes where other structures would work better. Here's what I would try.
from collections import Counter
counts = Counter()
# checking membership in a set is very fast (O(1))
company_names = set(co_names_df["Name"])
for title in sp500news['title']:
for word in title: # did you mean title.split(" ")? or is title a list of strings?
if word in company_names:
counts.update([word])
counts is then a dictionary {company_name: count}. You can just do a quick loop over the elements to update the counts in your dataframe.

Pandas, join row from target file based on condition

I need to merge a row from a target DataFrame into my source DataFrame on a fuzzy matching condition that has already been developed, let's call the method fuzzyTest. If fuzzy test returns True, I want to merge the row from the target file into my source file when matched.
So basically do a left join where the TARGET COMPANY passes the fuzzyTest when compared to the SOURCE COMPANY.
Source DataFrame
SOURCE COMPANY
0 Cool Company
1 BigPharma
2 Tod Kompany
3 Wallmart
Target DataFrame
TARGET COMPANY
0 Kool Company
1 Big farma
2 Todd's Company
3 C-Mart
4 SuperMart
5 SmallStore
6 ShopRus
Hopefully after mapping through fuzzyTest the output would be:
SOURCE COMPANY TARGET COMPANY
0 Cool Company Kool Company
1 BigPharma Big farma
2 Tod Kompany Todd's Company
3 Wallmart NaN

So if your fuzzy logic only compare the two strings on each row, just wrap it as a function that takes in column source and column target.
Make both columns in one dataframe then run:
def FuzzyTest(source,target):
.....
if ...:
return target
else:
return None
df['Target Company'] = df.apply(lambda x: FuzzyTest(x['Source'],x['Target'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching Strings and Count Frequency - python

Related

Is there a way to groupby columns that are similar?

How to detect a string in dataframe column from a list of names in another dataframe column

Summing with on multiple conditions

nested for loops with pandas dataframe

Pandas, join row from target file based on condition

Categories

Resources