Is there a way to groupby columns that are similar? - python

I have a table like this:
University
Course
Stanford
Financial Accounting
Stanford
Managerial Accounting
Yale
Managerial Accounting
Harvard
Corporate Finance
University of Oregon
Accounting & Finance
I would like to group them where part of the string matches and it only counts once per University like this:
Course
Count
Accounting
3
Finance
2
Managerial
2
Financial
1
Corporate
1
Currently I have this group by statement that shows top string matches:
groups = dataFrame.groupby('Course')['University'].nunique().sort_values(ascending=False)
And it returns:
Course
Count
Financial Accounting
1
Managerial Accounting
2
Corporate Finance
1
Does anyone know a way to group the similar courses based on similar matches rather than exact matches?
Any guidance would be greatly appreciated. Thank you

Maybe you could use Series.str.get_dummies and manually pick out the columns:
counts = df['Course'].drop_duplicates().str.get_dummies(' ').sum()[['Accounting', 'Finance']].to_frame().reset_index().set_axis(['Course', 'Count'], axis=1)
Output:
>>> counts
Course Count
0 Accounting 2
1 Finance 1
Or maybe it's simpler and you just want to select the last part of each value:
counts = df['Course'].drop_duplicates().str.split(' ').str[-1].value_counts().reset_index().set_axis(['Course', 'Count'], axis=1)
Output:
>>> counts
Course Count
0 Accounting 2
1 Finance 1

Related

Count the number of strings with length in pandas

I am trying to calculate the number of strings in a column with length of 5 or more. These strings are in a column separated by comma.
df= pd.DataFrame(columns=['first'])
df['first'] = ['Jack Ryan, Tom O','Stack Over Flow, StackOverFlow','Jurassic Park, IT', 'GOT']
Code I have used till now but not creating a new column with counts of strings of more than 5 characters.
df['countStrings'] = df['first'].str.split(',').count(r'[a-zA-Z0-9]{5,}')
Expected Output: Counting Strings of length 5 or More.
first
countString
Jack Ryan, Tom O
0
Stack Over Flow, StackOverFlow
2
Jurassic Park, IT
1
GOT
0
Edge Case: Strings of length more than 5 separated by comma and have multiple spaces
first
wrongCounts
rightCounts
Accounts Payable Goods for Resale
4
1
Corporate Finance, Financial Engineering
4
2
TBD
0
0
Goods for Not Resale, SAP
2
1
Pandas str.len() method is used to determine length of each string in a Pandas series. This method is only for series of strings.
Since this is a string method, .str has to be prefixed everytime before calling this method.
Yo can try this :
import pandas as pd
df = pd.DataFrame(columns=['first'])
df['first'] = ['jack,utah,TOMHAWK
Somer,SORITNO','jill','bob,texas','matt,AR','john']
df['first'].replace(',',' ', regex=True, inplace=True)
df['first'].str.count(r'\w+').sum()
You can match 5 chars and on the left and right match optional chars other than a comma.
[^,]*[A-Za-z0-9]{5}[^,]*
See a regex demo with the matches.
Example
import pandas as pd
df = pd.DataFrame(columns=['first'])
df['first'] = [
'Accounts Payable Goods for Resale',
'Corporate Finance, Financial Engineering',
'TBD',
'Goods for Not Resale, SAP',
'Jack Ryan, Tom O',
'Stack Over Flow, StackOverFlow',
'Jurassic Park, IT',
'GOT'
]
df['countStrings'] = df['first'].str.count(r'[^,]*[A-Za-z0-9]{5}[^,]*')
print(df)
Output
first countStrings
0 Accounts Payable Goods for Resale 1
1 Corporate Finance, Financial Engineering 2
2 TBD 0
3 Goods for Not Resale, SAP 1
4 Jack Ryan, Tom O 0
5 Stack Over Flow, StackOverFlow 2
6 Jurassic Park, IT 1
7 GOT 0
This is how i would try to get the number of strings with len>=5 in a column:
data=[i for k in df['first']
for i in k.split(',')
if len(i)>=5]
result=len(data)

How to detect a string in dataframe column from a list of names in another dataframe column

I am trying to find whether a news article contains a specific name of the company which I already have established a list as a dataframe column. I have one dataframe that contains the text of article as a column, and another dataframe with the names of the companies. I would like to search each article text to detect whether any name from the list exists, and create separate variable containing that name of the company found within the text. Someone recommended me using 'merge', but since I do not have the common identifier, it was not possible. I hope following example illustrates the idea.
First Dataframe (Article):
Index
Text
1
Apple decided to launch new product....
2
Tesla is ...
3
IBM is paying dividend......
4
Amazon is relocating.....
......
........
Second Dataframe with company name (Compname):
Index
Name
1
BP
2
Tesla
3
Bank of America
4
Amazon
5
JP Morgan
6
Apple
.....
......
What I want to see in the end would be the following:
Index
Text
Name_found
1
Apple decided to launch new product....
Apple
2
Tesla is ...
Tesla
3
IBM is paying dividend......
NaN
4
Amazon is relocating.....
Amazon
....
.....
......
I tried something like the following, but didn't quite get the job done
for x in compname['Name']:
Article['Name_found']=Article['Text'].str.contains(x, na=False)
Thank you for your help. Truly appreciate it.
Do you want this - >
pattern = r'(' + '|'.join(df1['Name'].to_list()) + ')'
df2['Text'] = df2['Text'].str.extract(pat= pattern)
print(df2)
Idea is to make a regex pattern with multiple or conditions - Here, for this case pattern will look like this -
'(BP|Tesla|Bank of America|Amazon|JP Morgan|Apple)'
Output- >
Index Text
0 1 Apple
1 2 Tesla
2 3 NaN
3 4 Amazon

Matching Strings and Count Frequency

I have a list of companies with their subsidiaries, the data looks as below:
CompanyName Employees
Microsoft China 1
Microsoft India 1
Microsoft Europe 1
Apple Inc 1
Apple Data Inc 1
Apple Customer Service Inc 1
Data Corp 1
Data SHCH 1
Data India 1
City Corp 1
Data City 1
If two companies have same words (e.g. Apple Inc and Apple Data Inc), they are considered one company. I will group those companies together, and calculate their total number of employees.
The expected return should be:
Company Employees
Microsft 3
Apple 3
Data 3
City 2
The company will return the common word
The Employees return the sum of company and its subsidiaries
Most of the pandas function doesn't really work in this case. Any suggestions on For Loop?
As you requested in the comments
If the company is always the first word in CompanyName
# extract company as word at index 0
df.CompanyName = df.CompanyName.str.split(expand=True)[0]
# groupby company name and count
dfg = df.groupby('CompanyName').agg({'CompanyName': 'count'})
# display(dfg)
CompanyName
CompanyName
Apple 3
City 1
Data 4
Microsoft 3
I don't think there's a 'very' simple way to do what you want. But it's not too complex too.
First, you need to define clearly the ~criterion to decide wich names are the same 'company'.
We can try with "get the first world and see if it matches", obviously it's not a perfect approach, but it'll do for now.
Then, you can create an object to store your new data. I would recommend a dictionary, with entries like company: (total employees).
You'll now iterate over the rows of the dataframe, with apply and a function to do what you want. It'll look like this:
dict = {}
def aggregator(row):
word1 = row.company.split(" ")[0]
if word1 in dict.keys:
dict[word1] += row.employees
else:
dict[word1] = row.employees
dataframe.apply(aggregator, axis = 1)

Pandas: sum of values in one dataframe based on the group in a different dataframe

I have a dataframe such contains companies with their sectors
Symbol Sector
0 MCM Industrials
1 AFT Health Care
2 ABV Health Care
3 AMN Health Care
4 ACN Information Technology
I have another dataframe that contains companies with their positions
Symbol Position
0 ABC 1864817
1 AAP -3298989
2 ABV -1556626
3 AXC 2436387
4 ABT 878535
What I want is to get a dataframe that contains the aggregate positions for sectors. So sum the positions of all the companies in a given sector. I can do this individually by
df2[df2.Symbol.isin(df1.groupby('Sector').get_group('Industrials')['Symbol'].to_list())]
I am looking for a more efficient pandas approach to do this rather than looping over each sector under the group_by. The final dataframe should look like the following:
Sector Sum Position
0 Industrials 14567232
1 Health Care -329173249
2 Information Technology -65742234
3 Energy 6574352342
4 Pharma 6342387658
Any help is appreciated.
If I understood the question correctly, one way to do it is joining both data frames and then group by sector and sum the position column, like so:
df_agg = df1.join(df2['Position']).drop('Symbol', axis=1)
df_agg.groupby('Sector').sum()
Where, df1 is the df with Sectors and df2 is the df with Positions.
You can map the Symbol column to sector and use that Series to group.
df2.groupby(df2.Symbol.map(df1.set_index('Symbol').Sector)).Position.sum()
let us just do merge
df2.merge(df1,how='left').groupby('Sector').Position.sum()

nested for loops with pandas dataframe

I am looping through a dataframe column of headlines (sp500news) and comparing against a dataframe of company names (co_names_df). I am trying to update the frequency each time a company name appears in a headline.
My current code is below and is not updating the frequency columns. Is there a cleaner, faster implementation - maybe without the for loops?
for title in sp500news['title']:
for string in title:
for co_name in co_names_df['Name']:
if string == co_name:
co_names_index = co_names_df.loc[co_names_df['Name']=='string'].index
co_names_df['Frequency'][co_names_index] += 1
co_names_df sample
Name Frequency
0 3M 0
1 A.O. Smith 0
2 Abbott 0
3 AbbVie 0
4 Accenture 0
5 Activision 0
6 Acuity Brands 0
7 Adobe Systems 0
...
sp500news['title'] sample
title
0 Italy will not dismantle Montis labour reform minister
1 Exclusive US agency FinCEN rejected veterans in bid to hire lawyers
4 Xis campaign to draw people back to graying rural China faces uphill battle
6 Romney begins to win over conservatives
8 Oregon mall shooting survivor in serious condition
9 Polands PGNiG to sign another deal for LNG supplies from US CEO
You can probably speed this up; you're using dataframes where other structures would work better. Here's what I would try.
from collections import Counter
counts = Counter()
# checking membership in a set is very fast (O(1))
company_names = set(co_names_df["Name"])
for title in sp500news['title']:
for word in title: # did you mean title.split(" ")? or is title a list of strings?
if word in company_names:
counts.update([word])
counts is then a dictionary {company_name: count}. You can just do a quick loop over the elements to update the counts in your dataframe.

Categories