I am looping through a dataframe column of headlines (sp500news) and comparing against a dataframe of company names (co_names_df). I am trying to update the frequency each time a company name appears in a headline.
My current code is below and is not updating the frequency columns. Is there a cleaner, faster implementation - maybe without the for loops?
for title in sp500news['title']:
for string in title:
for co_name in co_names_df['Name']:
if string == co_name:
co_names_index = co_names_df.loc[co_names_df['Name']=='string'].index
co_names_df['Frequency'][co_names_index] += 1
co_names_df sample
Name Frequency
0 3M 0
1 A.O. Smith 0
2 Abbott 0
3 AbbVie 0
4 Accenture 0
5 Activision 0
6 Acuity Brands 0
7 Adobe Systems 0
...
sp500news['title'] sample
title
0 Italy will not dismantle Montis labour reform minister
1 Exclusive US agency FinCEN rejected veterans in bid to hire lawyers
4 Xis campaign to draw people back to graying rural China faces uphill battle
6 Romney begins to win over conservatives
8 Oregon mall shooting survivor in serious condition
9 Polands PGNiG to sign another deal for LNG supplies from US CEO
You can probably speed this up; you're using dataframes where other structures would work better. Here's what I would try.
from collections import Counter
counts = Counter()
# checking membership in a set is very fast (O(1))
company_names = set(co_names_df["Name"])
for title in sp500news['title']:
for word in title: # did you mean title.split(" ")? or is title a list of strings?
if word in company_names:
counts.update([word])
counts is then a dictionary {company_name: count}. You can just do a quick loop over the elements to update the counts in your dataframe.
Related
I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to count the number of events where the "target" is a government building. One of the columns is called "targettype" or "targettype_txt" and there are 5 different entries in this column I want to count (government building, military, police, diplomatic building etc). The targettype is also coded as a number if that is easier (i.e. there is another column where gov't building is 2, military installation is 4 etc..)
FYI This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).
Right now my data looks like this (there are a ton of other columns but they aren't important for this):
eventID
iyear
country_txt
nkill
nwounded
nhostages
targettype_txt
10000102
2000
Nigeria
3
10
0
government building
10000103
2000
Mali
1
3
15
military installation
10000103
2000
Nigeria
15
0
0
government building
10000103
2001
Benin
1
0
0
police
10000103
2001
Nigeria
1
3
15
private business
.
.
.
And I would like it to look like this:
country_txt
iyear
total_nkill
total_nwounded
total_nhostages
total public_target
Nigeria
2000
200
300
300
15
Nigeria
2001
250
450
15
17
I was able to get the total number for nkill,nwounded, and nhostages using this super simple line:
df2 = cdf.groupby(['country','country_txt', 'iyear'])['nkill', 'nwound','nhostkid'].sum()
But this is a little different because I want to only count certain entries and sum up the total number of times they occur. Any thoughts or suggestions are really appreciated!
Try:
cdf['CountCondition'] = (cdf['targettype_txt']=='government building') |
(cdf['targettype_txt']=='military installation') |
(cdf['targettype_txt']=='police')
df2 = cdf[cdf['CountCondition']].groupby(['country','country_txt', 'iyear', 'CountCondition']).count()
You create a new column 'CountCondition' which just marks as true or false if the condition in the statement holds. Then you just count the number of times the CountCondition is True. Hope this makes sense.
It is possible to combine all this into one statement and NOT create an additional column but the statement gets quite convaluted and more difficult to understand how it works:
df2 = cdf[(cdf['targettype_txt']=='government building') |
(cdf['targettype_txt']=='military installation') |
(cdf['targettype_txt']=='police')].groupby(['country','country_txt', 'iyear']).count()
I have a table like this:
University
Course
Stanford
Financial Accounting
Stanford
Managerial Accounting
Yale
Managerial Accounting
Harvard
Corporate Finance
University of Oregon
Accounting & Finance
I would like to group them where part of the string matches and it only counts once per University like this:
Course
Count
Accounting
3
Finance
2
Managerial
2
Financial
1
Corporate
1
Currently I have this group by statement that shows top string matches:
groups = dataFrame.groupby('Course')['University'].nunique().sort_values(ascending=False)
And it returns:
Course
Count
Financial Accounting
1
Managerial Accounting
2
Corporate Finance
1
Does anyone know a way to group the similar courses based on similar matches rather than exact matches?
Any guidance would be greatly appreciated. Thank you
Maybe you could use Series.str.get_dummies and manually pick out the columns:
counts = df['Course'].drop_duplicates().str.get_dummies(' ').sum()[['Accounting', 'Finance']].to_frame().reset_index().set_axis(['Course', 'Count'], axis=1)
Output:
>>> counts
Course Count
0 Accounting 2
1 Finance 1
Or maybe it's simpler and you just want to select the last part of each value:
counts = df['Course'].drop_duplicates().str.split(' ').str[-1].value_counts().reset_index().set_axis(['Course', 'Count'], axis=1)
Output:
>>> counts
Course Count
0 Accounting 2
1 Finance 1
I have a list of companies with their subsidiaries, the data looks as below:
CompanyName Employees
Microsoft China 1
Microsoft India 1
Microsoft Europe 1
Apple Inc 1
Apple Data Inc 1
Apple Customer Service Inc 1
Data Corp 1
Data SHCH 1
Data India 1
City Corp 1
Data City 1
If two companies have same words (e.g. Apple Inc and Apple Data Inc), they are considered one company. I will group those companies together, and calculate their total number of employees.
The expected return should be:
Company Employees
Microsft 3
Apple 3
Data 3
City 2
The company will return the common word
The Employees return the sum of company and its subsidiaries
Most of the pandas function doesn't really work in this case. Any suggestions on For Loop?
As you requested in the comments
If the company is always the first word in CompanyName
# extract company as word at index 0
df.CompanyName = df.CompanyName.str.split(expand=True)[0]
# groupby company name and count
dfg = df.groupby('CompanyName').agg({'CompanyName': 'count'})
# display(dfg)
CompanyName
CompanyName
Apple 3
City 1
Data 4
Microsoft 3
I don't think there's a 'very' simple way to do what you want. But it's not too complex too.
First, you need to define clearly the ~criterion to decide wich names are the same 'company'.
We can try with "get the first world and see if it matches", obviously it's not a perfect approach, but it'll do for now.
Then, you can create an object to store your new data. I would recommend a dictionary, with entries like company: (total employees).
You'll now iterate over the rows of the dataframe, with apply and a function to do what you want. It'll look like this:
dict = {}
def aggregator(row):
word1 = row.company.split(" ")[0]
if word1 in dict.keys:
dict[word1] += row.employees
else:
dict[word1] = row.employees
dataframe.apply(aggregator, axis = 1)
I have a dataframe like I am trying to keep rows that have more than 5 characters. Here is what I tried, but it removes 'of', 'U.', 'and','Arts',...etc. I just need to remove characters in a row that have len less than 5.
id schools
1 University of Hawaii
2 Dept in Colorado U.
3 Dept
4 College of Arts and Science
5 Dept
6 Bldg
wrong output from my code:
0 University Hawaii
1 Colorado
2
3 College Science
4
5
Looking for output like this:
id schools
1 University of Hawaii
2 Dept in Colorado U.
4 College of Arts and Science
Code:
l = [1,2,3,4,5,6]
s = ['University of Hawaii', 'Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
df1 = pd.DataFrame({'id':l, 'schools':s})
df1 = df1['schools'].str.findall('\w{5,}').str.join(' ') # not working
df1
Using a regex is a huge (and slow) overkill for this task. You can use simple pandas indexing:
filtrered_df = df1[df1['schools'].str.len() > 5] # or >= depending on the required logic
There is a simpler filter for your data.
mask = df1['schools'].str.len() > 5
Then create a new data frame from the filter
df2 = df1[mask].copy()
import pandas as pd
name = ['University of Hawaii','Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
labels =['schools']
df =pd.DataFrame.from_records([[i] for i in name],columns=labels)
df[df['schools'].str.len() >5 ]
I'm trying to extract outliers from my dataset and tag them accordingly.
Sample Data
Doctor Name Hospital Assigned Region Claims Illness Claimed
1 Albert Some hospital Center R-1 20 Sepsis
2 Simon Another hospital Center R-2 21 Pneumonia
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz
So I'm trying to group every Doctor that Claimed a certain Illness in a certain Region and trying to find outliers among them.
Doctor Name Hospital Assigned Region Claims Illness Claimed is_outlier
1 Albert Some hospital Center R-1 20 Sepsis 1
2 Simon Another hospital Center R-2 21 Pneumonia 0
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz
I can do this in Power BI. But being fairly new to Python, I can't seem to figure this out.
This is what I'm trying to achieve:
Algo goes like:
Read data
Group data by Illness
Group by Region
get IQR based on Claims Count
if claims count > than (Q3 + 1.5) * IQR
then tag it as outlier = 1
else
not an outlier = 0
Export data
Any ideas?
Assuming you use pandas for data analysis (and you should!) You can use pandas dataframe boxplot to produce a plot similar to yours:
import pandas as pd
import numpy as np
df.boxplot(column=['b'], whis=[10, 90], vert=False,
flierprops=dict(markerfacecolor='g', marker='D'))
or, if you want to mark them 0,1 as you requested, use dataframe quantile() method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html
df.assign(outlier=df[df>=df.quantile(.9)].any(axis=1)).astype(np.int8)
a b outlier
0 1 1 0
1 2 10 0
2 3 100 1
3 4 100 1