I need to merge a row from a target DataFrame into my source DataFrame on a fuzzy matching condition that has already been developed, let's call the method fuzzyTest. If fuzzy test returns True, I want to merge the row from the target file into my source file when matched.
So basically do a left join where the TARGET COMPANY passes the fuzzyTest when compared to the SOURCE COMPANY.
Source DataFrame
SOURCE COMPANY
0 Cool Company
1 BigPharma
2 Tod Kompany
3 Wallmart
Target DataFrame
TARGET COMPANY
0 Kool Company
1 Big farma
2 Todd's Company
3 C-Mart
4 SuperMart
5 SmallStore
6 ShopRus
Hopefully after mapping through fuzzyTest the output would be:
SOURCE COMPANY TARGET COMPANY
0 Cool Company Kool Company
1 BigPharma Big farma
2 Tod Kompany Todd's Company
3 Wallmart NaN
So if your fuzzy logic only compare the two strings on each row, just wrap it as a function that takes in column source and column target.
Make both columns in one dataframe then run:
def FuzzyTest(source,target):
.....
if ...:
return target
else:
return None
df['Target Company'] = df.apply(lambda x: FuzzyTest(x['Source'],x['Target'])
Related
I am trying to find whether a news article contains a specific name of the company which I already have established a list as a dataframe column. I have one dataframe that contains the text of article as a column, and another dataframe with the names of the companies. I would like to search each article text to detect whether any name from the list exists, and create separate variable containing that name of the company found within the text. Someone recommended me using 'merge', but since I do not have the common identifier, it was not possible. I hope following example illustrates the idea.
First Dataframe (Article):
Index
Text
1
Apple decided to launch new product....
2
Tesla is ...
3
IBM is paying dividend......
4
Amazon is relocating.....
......
........
Second Dataframe with company name (Compname):
Index
Name
1
BP
2
Tesla
3
Bank of America
4
Amazon
5
JP Morgan
6
Apple
.....
......
What I want to see in the end would be the following:
Index
Text
Name_found
1
Apple decided to launch new product....
Apple
2
Tesla is ...
Tesla
3
IBM is paying dividend......
NaN
4
Amazon is relocating.....
Amazon
....
.....
......
I tried something like the following, but didn't quite get the job done
for x in compname['Name']:
Article['Name_found']=Article['Text'].str.contains(x, na=False)
Thank you for your help. Truly appreciate it.
Do you want this - >
pattern = r'(' + '|'.join(df1['Name'].to_list()) + ')'
df2['Text'] = df2['Text'].str.extract(pat= pattern)
print(df2)
Idea is to make a regex pattern with multiple or conditions - Here, for this case pattern will look like this -
'(BP|Tesla|Bank of America|Amazon|JP Morgan|Apple)'
Output- >
Index Text
0 1 Apple
1 2 Tesla
2 3 NaN
3 4 Amazon
I have a list of companies with their subsidiaries, the data looks as below:
CompanyName Employees
Microsoft China 1
Microsoft India 1
Microsoft Europe 1
Apple Inc 1
Apple Data Inc 1
Apple Customer Service Inc 1
Data Corp 1
Data SHCH 1
Data India 1
City Corp 1
Data City 1
If two companies have same words (e.g. Apple Inc and Apple Data Inc), they are considered one company. I will group those companies together, and calculate their total number of employees.
The expected return should be:
Company Employees
Microsft 3
Apple 3
Data 3
City 2
The company will return the common word
The Employees return the sum of company and its subsidiaries
Most of the pandas function doesn't really work in this case. Any suggestions on For Loop?
As you requested in the comments
If the company is always the first word in CompanyName
# extract company as word at index 0
df.CompanyName = df.CompanyName.str.split(expand=True)[0]
# groupby company name and count
dfg = df.groupby('CompanyName').agg({'CompanyName': 'count'})
# display(dfg)
CompanyName
CompanyName
Apple 3
City 1
Data 4
Microsoft 3
I don't think there's a 'very' simple way to do what you want. But it's not too complex too.
First, you need to define clearly the ~criterion to decide wich names are the same 'company'.
We can try with "get the first world and see if it matches", obviously it's not a perfect approach, but it'll do for now.
Then, you can create an object to store your new data. I would recommend a dictionary, with entries like company: (total employees).
You'll now iterate over the rows of the dataframe, with apply and a function to do what you want. It'll look like this:
dict = {}
def aggregator(row):
word1 = row.company.split(" ")[0]
if word1 in dict.keys:
dict[word1] += row.employees
else:
dict[word1] = row.employees
dataframe.apply(aggregator, axis = 1)
I currently have this table in CSV. MOL stands for Male and FOL stands for Female. The second part SAL_Tagalog is the language and SOLSE_NotWell means they do not speak the language well, whereas SOLSE_Well means they speak it well.
MOL_SAL_English_SOLSE_NotWell FOL_SAL_English_SOLSE_Well Postcode State Year
0 8 2600 ACT 2016
0 9 2601 ACT 2016
0 16 2602 ACT 2016
0 7 2603 ACT 2016
3 6 2604 ACT 2016
0 20 2605 ACT 2016
I need to turn each cell into its own row like below:
Sex Language Proficiency Count Postcode State Year
Male English NotWell 0 2600 ACT 2016
Female English Well 9 2600 ACT 2016
Male English NotWell 0 2601 ACT 2016
Female English Well 9 2601 ACT 2016
Can anyone help me on how to approach this?
So far, I have a function that splits the column names by "_" so I end up with [MOL, SAL_English, SOLSE_NotWell]. These are mean to go into the cells of my new CSV.
I am currently trying to write it row by row into a new CSV file with the columns Sex, Language, Proficiency, Count, Postcode, State and Year. This is seen below but I am unsure of how to get the relevant Postcode and State etc.. from the same row.
for column_name, count in df.iteritems():
new_df.Language = column_name
new_df.Count = count
You could run it using Pandas' built in tools :
Flip the column from wide to long, using pandas' melt method this allows easier extraction of details
The rest is a combination of pandas string methods,regex, and map
#mapping for sex column which we'll create
sex_map = {'MOL':'Male', 'FOL':'Female'}
(df
#flip the columns from wide to long
.melt(id_vars=['Postcode','State','Year'],value_name='count')
#extract the required columns
#sex is either MOL or FOL, which is the first characters before _
.assign(Sex=lambda x:x.variable.str.split('_').str[0],
#proficiency is the grp of characters after the last _
Proficiency = lambda x: x.variable.str.split('_').str[-1],
# a bit of regex here,
#basically get the word between SAL_ and _SOL
Language = lambda x: x.variable.str.extract(r'((?<=SAL_).*(?=_SOL))')
)
#use the sex_map dictionary to get the right entries(male|female)
.assign(Sex = lambda x: x.Sex.map(sex_map))
#remove the unwanted column
.drop('variable',axis=1)
)
And here is another option, but possibly longer: it is a combination of wide to long, stack and string methods.
sex_mapping = {'MOL_SAL':'Male', 'FOL_SAL':'Female'}
(pd.wide_to_long(df,
stubnames=['MOL_SAL','FOL_SAL'],
i=['Postcode','State', 'Year'],
j='proficiency',
sep='_',
suffix='[\w_]+'
)
.stack()
.reset_index()
.rename(columns={'level_4':'Sex',0:'count'})
.assign(language = lambda x: x.proficiency.str.extract(r'(.*?(?=_))'),
proficiency = lambda x: x.proficiency.str.rsplit('_').str[-1],
Sex = lambda x: x.Sex.map(sex_mapping)
)
)
#first couple of rows
Postcode State Year proficiency Sex count language
0 2600 ACT 2016 NotWell Male 0.0 English
1 2600 ACT 2016 Well Female 8.0 English
2 2601 ACT 2016 NotWell Male 0.0 English
3 2601 ACT 2016 Well Female 9.0 English
4 2602 ACT 2016 NotWell Male 0.0 English
5 2602 ACT 2016 Well Female 16.0 English
This is an example of that common unfortunate scenario where the underscore _ is being used both as a delimiter separating attributes and as a space character.
First, pull the postcode, state and year out into the index, just to get them out of the way.
Then, use a helper function to split your remaining columns into meaningful tuples
then upgrade those tuples it to a multiindex. At that point,
we're done - but we just need to reshape the data.
Use your helper function to parse the columns into tuples (Sex, Language, Proficiency). I also add a fourth element here to indicate what kind of data it is - this ends up being the "Count" in your final data.
The helper function you have to write is usually complicated and challenging.
# Helper function
def helper_function(name):
#Use the first character to get the gender
if name[0] == 'F':
sex = 'Female'
elif name[0] == 'M':
sex = 'Male'
else:
sex = 'Other'
# The third word is the language
lang = name.split('_')[2]
# The last word is the proficiency
prof = name.split('_')[-1]
return (sex, lang, prof,'Count')
Applying that to a column name will allow us to split out the important details.
# Step 1
new_df = df.set_index(['Postcode', 'State', 'Year'])
# Step 2
column_tuples = [helper_function(name) for name in new_df.columns]
# Step 3
new_df.columns = pd.MultiIndex.from_tuples(column_tuples, names=['Sex','Language', 'Proficiency', 'Measure'])
# Step 4
new_df = new_df.stack(['Sex','Language', 'Proficiency']).reset_index()
Multiindexes are worthwhile learning about!
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html
I have a dataframe such contains companies with their sectors
Symbol Sector
0 MCM Industrials
1 AFT Health Care
2 ABV Health Care
3 AMN Health Care
4 ACN Information Technology
I have another dataframe that contains companies with their positions
Symbol Position
0 ABC 1864817
1 AAP -3298989
2 ABV -1556626
3 AXC 2436387
4 ABT 878535
What I want is to get a dataframe that contains the aggregate positions for sectors. So sum the positions of all the companies in a given sector. I can do this individually by
df2[df2.Symbol.isin(df1.groupby('Sector').get_group('Industrials')['Symbol'].to_list())]
I am looking for a more efficient pandas approach to do this rather than looping over each sector under the group_by. The final dataframe should look like the following:
Sector Sum Position
0 Industrials 14567232
1 Health Care -329173249
2 Information Technology -65742234
3 Energy 6574352342
4 Pharma 6342387658
Any help is appreciated.
If I understood the question correctly, one way to do it is joining both data frames and then group by sector and sum the position column, like so:
df_agg = df1.join(df2['Position']).drop('Symbol', axis=1)
df_agg.groupby('Sector').sum()
Where, df1 is the df with Sectors and df2 is the df with Positions.
You can map the Symbol column to sector and use that Series to group.
df2.groupby(df2.Symbol.map(df1.set_index('Symbol').Sector)).Position.sum()
let us just do merge
df2.merge(df1,how='left').groupby('Sector').Position.sum()
I am new to pandas and python in general - grateful for any direction you can provide!
I have a csv file with 4 columns. I am trying to group together rows where the first three columns are the same on all rows (Column A Row 1 = Column A Row 2, Column B Row 1 = Column B Row 2, and so on)
My data look like this:
phone_number state date description
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
3 9991112222 NJ 2015-05-14 Apartment rental
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
6 4443337777 NJ 2015-05-14 condo
So in this data, rows 1, 2 and 3 would be in one group, and rows 4 and 5 would be in another group. Row 6 would not be in the group with 1, 2, and 3 because it has a different phone_number.
Then, for each row, I want to compare the string in the description column against each other description in that group using Levenshtein distance, and keep the rows where the descriptions are sufficiently similar.
"Condo" from row 1 would be compared to "Condo sales call" from row 2 and to "Apartment rental" in row 3. It would not be compared to "condo" from row 6.
In the end, the goal is to weed out rows where the description is not sufficiently similar to another description in the same group. Phrased differently, to print out all rows where description is at least somewhat similar to another (any other) description in that group. Ideal output:
phone_number state date description
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
Row 6 does not print because it was never in a group. Row 3 doesn't print because "Apartment rental" is insufficiently similar to "Condo" or "Condo sales call"
This is the code I have so far. I can't tell if this is the best way to do it. And if I have done it right so far, I can't figure out how to print the full row of interest:
import Levenshtein
import itertools
import pandas as pd
test_data = pd.DataFrame.from_csv('phone_state_etc_test.csv', index_col=None)
for pn in test_data['phone_number']:
for dt in test_data['date']:
for st in test_data['state']:
for a, b in itertools.combinations(test_data[
(test_data['phone_number'] == pn) &
(test_data['state'] == st) &
(test_data['date'] == dt)
]
['description'], 2):
if Levenshtein.ratio(a,b) > 0.35:
print pn, "|", dt, "|", st, "|" #description
This prints a bunch of duplicates of these lines:
9991112222 | NJ | 2015-05-14 |
6668885555 | CA | 2015-05-06 |
But if I add description to the end of the print line, I get a
SyntaxError: invalid syntax
Any thoughts on how I can print the full row? Whether in pandas dataframe, or some other format, doesn't matter - I just need to output to csv.
Why don't you use the pandas.groupby option to find the unique groups (based on phone-number, state and date). Doing this lets you treat all the Description values separately and do whatever you want to do with them.
For example, I'll groupby with the above said columns and get the unique values for the Description columns within this group -
In [49]: df.groupby(['phone_number','state','date']).apply(lambda v: v['description'].unique())
Out[49]:
phone_number state date
4443337777 NJ 2015-05-14 [condo]
6668885555 CA 2015-05-06 [Apartment, Apartment-rental]
9991112222 NJ 2015-05-14 [Condo, Condo-sales-call, Apartment-rental]
dtype: object
You can use any function within the apply. More examples here - http://pandas.pydata.org/pandas-docs/stable/groupby.html
I'm not entirely sure how best to do a calculation for all pairs of values in pandas- here I've made a matrix with the descriptions as both the rows and columns (so the main diagonal of the matrix compares the description with itself), but it doesn't seem entirely idiomatic:
def find_similar_rows(group, threshold=0.35):
sim_matrix = pd.DataFrame(index=group['description'],
columns=group['description'])
for d1 in sim_matrix.index:
for d2 in sim_matrix.columns:
# Leave diagonal entries as nan
if d1 != d2:
sim_matrix.loc[d1, d2] = Levenshtein.ratio(d1, d2)
keep = sim_matrix.gt(threshold, axis='columns').any()
# A bit of possibly unnecessary mucking around with the index
# here, could probably be cleaned up
rows_to_keep = group.loc[keep[group['description']].tolist(), :]
return rows_to_keep
grouped = test_data.groupby('phone_number', group_keys=False)
grouped.apply(find_similar_rows)
Out[64]:
phone_number state date description
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
It seems form the data provided that you want to keep rows for which the first word in the description matches the most common first word for that group.
If that's the case, you can do this:
test_data['description_root'] = test_data.str.split().str[0]
# this adds a columns with the first word from the description column
grouped = test_data.groupby(['phone_number', 'state', 'date'])
most_frequent_root = grouped.description_root.transform(
lambda s: s.value_counts().idxmax())
# this is a series with the same index as the original df containing
# the most frequently occuring root for each group
test_data[test_data.description_root == most_frequent_root]
# this will give you the matching rows
You could also call .describe on grouped to give some additional information for each group. Sorry if this is off topic but I think the you might well find the Series string methods (.str) and the groupby useful.