Contains function in Pandas - python

I am performing matching (kind of fuzzy matching) between the company names of two data frames. For doing so, first I am performing full merge between all the company names, where the starting alphabet matches. Which means all the companies starting with 'A' would be matched to all the companies starting with 'A' in other data frame. This is done as follows:
df1['df1_Start'] = df1['company1'].astype(str).str.slice(0,2)
df2['df2_Start'] = df2['company2'].astype(str).str.slice(0,2)
Merge = pd.merge(df1,df2, left_on='df1_Start',right_on='df2_Start')
Now I want to have all the rows from FullMerge where company in df1 contains the company in df2. This is because companies in df1 have elongated names.
Merge1=Merge[Merge['company1'].str.contains(Merge['company2'].str)]
This isn't working for me. How do I perform this task? Also, Please suggest what other ways can I use to match the company names. Because companies might be same in two data frames but are not written in the exactly same way.

I think you need | with join for generating all values separated by | (or in regex) for str.contains:
Merge1=Merge[FullMerge['company1'].str.contains("|".join(Merge['company2'].tolist())]

Related

Creating a column based on matches from a list

I have a data frame with a column of job titles and the company name in the same string of each row, I also have a list of all possible company names.
How do I search the column of my data frame to see if it contains one of the companies in my list and then create a new column with just the company names if there is a match in some rows? Attached two photos.
I tried a few solutions but can't find one that works.
The original logic I followed is;
df['Company'] = df['Title'].str.contains(x for x in joblist)
but obviously that throws an error.
Any help is appreciated, thanks.
Use Series.str.contains with joined values by | for regex or for test values:
df['test'] = df['Title'].str.contains('|'.join(joblist))
and if want extract values by list use Series.str.extract:
df['Company'] = df['Title'].str.extract(f'({"|".join(joblist)})', expand=False)
You need to access all the items in the list with companies and compare them with each value of column "Title".
You can check if a String contains another String using operator in.
all_titles = df['Title']
for x in all_titles:
for y in df:
if (y in x ):
//your code here

How to group values in a dataframe and keeping the values associated

I am trying to solve a problem in Python, I feel like groupby is the solution but I can't tell how should I use it.
I have a dataframe budget where every team in the first league of soccer in France is associated with the budget of the team.
It looks like this
DB
The seasons go from '2010/2011' to '2019/2020'. I want to find a way to be able to GroupBy eachteamm with the budget associated for each season. I think I could do it by iterating through every columns and finding the index associated with the team value and finding what's the budget's value for each season. But maybe there's is a more efficient way that you could help me find.
Thank you very much
Based on the DB Schema, I would first extract data into separate dataframes for each year and have df_year2011['Team','Budget','Year'],df_year2012['Team','Budget','Year'] ..and so on and so forth.
Than I would concat them to create a df :
frames=[df_year2011,df_year2012,df_year2013]
df = pd.concat(frames)
And than to group the data in the dataframe, I would apply this :
df_grouped=df.groupby('Team')
And if you need to group by multiple columns you do :
df_grouped=df.groupby(['Column1_Name','Column2_Name'])

How to compare two dataframes and create a new one for those entries which are the same across two columns in the same row

I have been trying to make a comparison of two dataframes, creating new dataframes for the ones which have the same entries in two columns. I thought I had cracked it but the code I have now just looks at the two columns of interest and if the string is found anywhere in that column it considers it a match. I need the two strings to be common on the same row across the columns. A sample of the code follows.
#produce table with common items
vto_in_jeff = df_vto[(df_vto['source'].isin(df_jeff['source']) & df_vto['target'].isin(df_jeff['target']))].dropna().reset_index(drop=True)
#vto_in_jeff.index = vto_in_jeff.index + 1
vto_in_jeff['compare'] = 'Common_terms'
print(vto_in_jeff)
vto_in_jeff.to_csv(output_path+'vto_in_'+f+'.csv', index=False)
So this code comes out with a table which has a list of the rows which has both source and target strings, but not the source and target strings necessarily having to appear in the same row. Can anyone help me look specifically row by row?
you can use the pandas merge method
result = pd.merge(df1, df2, on='key')
here are more details:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra

How can I find very similar but NOT equal rows within two pandas dataframe columns?

I'm trying to compare two columns from two different dataframes to get similar values. The values are strings, so they are not just the same, but very similar. How can I get those similar values?
The dataframes that I use are like the following:
Dataframe 1, column "Company", row = "Company_name"
Dataframe 2, column "Company", row = "Company_name_INC"
What I would like to get:
Dataframe 3, column "Company_source_1" row = "Company_name", column "Company_source_2", row = "Company_name_INC"
I need to find those names that are almost the same, in order to find the companies that appear in both dataframes.
You can use regular expressions:
Regular expressions (https://docs.python.org/3/howto/regex.html) can be used to do exactly what you are asking. For example, if you are looking for a company related to 'Regex' such as:
Regex
Regex_inc
NotRegex
You can do the following:
[Note that I have converted the DataFrame column Name to a Series and use the .str.contains() method, which can be used to index the appropriate rows from your original DataFrame (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html )]
import pandas as pd
data = [['Regex', 'company_1'],
['Regex_inc', 'company_2'],
['NotRegex', 'company_3']]
df = pd.DataFrame(data).rename(columns={0:'Name', 1:'Company'})
df_sorted = df[pd.Series(df['Name']).str.contains(r'^Regex')]
print df
print df_sorted
Returns
Name Company
0 Regex company_1
1 Regex_inc company_2
2 NotRegex company_3
for df, and:
Name Company
0 Regex company_1
1 Regex_inc company_2
for df_sorted
The argument for the pd.Series.str.contains() method was '^Regex' which states that for a string to return a True value, it must begin with 'Regex'.
I used this regex cheatsheet (https://www.rexegg.com/regex-quickstart.html) for the special characters. I'm not an expert on Regex, but plenty of material can be found online, also with the links contained in this answer. Also here (https://regex101.com/) is a regex tester than can be used to test your patterns.

Finding rows which aren't in two dataframes

I have the following dataframe:
symbol, name
abc Jumping Jack
xyz Singing Sue
rth Fat Frog
I then have another dataframe with the same structure (symbol + name). I need to output all the symbols which are in the first dataframe but not the second.
The name column is allowed to differ. For example I could have symbol = xyz in both dataframes but with different names. That is fine. I am simply trying to get the symbols which do not appear in both dataframes.
I am sure this can be done using pandas merge and then outputting the rows that didn't merge, but I just can't seem to get it right.
Use isin and negate the condition using ~:
df[~df['symbol'].isin(df1['symbol'])]
This will return rows where 'symbol' is present in your first df and not in the other df

Categories