Creating a column based on matches from a list

Creating a column based on matches from a list - python

I have a data frame with a column of job titles and the company name in the same string of each row, I also have a list of all possible company names.
How do I search the column of my data frame to see if it contains one of the companies in my list and then create a new column with just the company names if there is a match in some rows? Attached two photos.
I tried a few solutions but can't find one that works.
The original logic I followed is;
df['Company'] = df['Title'].str.contains(x for x in joblist)
but obviously that throws an error.
Any help is appreciated, thanks.

Use Series.str.contains with joined values by | for regex or for test values:
df['test'] = df['Title'].str.contains('|'.join(joblist))
and if want extract values by list use Series.str.extract:
df['Company'] = df['Title'].str.extract(f'({"|".join(joblist)})', expand=False)

You need to access all the items in the list with companies and compare them with each value of column "Title".
You can check if a String contains another String using operator in.
all_titles = df['Title']
for x in all_titles:
for y in df:
if (y in x ):
//your code here

Related

how to delete rows that contain a word from a list in python

As stated in the title I have a pandas data frame with string sentences in the column "title". I know want to filter all rows, where the title column contains one of the words specified in the list "keywords".
keywords = ["Simon", "Mustermann"]
df =
Title
Bla
Simon is a python beginner
...
Second balaola
...
Simon
...
Since "Simon" is found in rows with index 0 and 2, they should be retained.
My code atm is the following:
new_df = df[df["title"].isin(keywords)]
However, it only contains the third row but not the first one. How can I fix this? Thanks a lot for your support and time!

This snippet should work for you
keywords = ["Simon", "Mustermann"]
# filter rows where column title contains one of the keywords
df_filtered = df[df["title"].str.contains("|".join(keywords))]

Lookup Each item from a list to items from List2. If there's a match return such value, if not delete the entire row

I have two lists that were created from columns from two different dataframes. The two dataframes have the following structure:
In [73][dev]: cw.shape
Out[73]: (4666, 13)
In [74][dev]: ml.shape
Out[74]: (815, 5)
and the two lists are identifier objects intended to match data from one dataframe with another. My intention is conceptually equivalent to a vlookup in Excel, which is to look up whether an item from list ID is in list ID2, and if so, returns the appropriate 'class1' value from the second list into this new "Class" that I've created. If the "vlookup" (pardon my Excel reference here but hopefully you catch my drift) doesn't find the relevant value, the drop all rows.
import pandas as pd
cw = pd.read_excel("abc.xlsx")
ml = pd.read_excel("xyz.xlsx")
ID = cw['Identifier']
cw["Class"] = ""
asc = cw["Class"]
ID2 = ml['num']
bac = ml['class1']
for item in ID:
if item in ID2:
asc[item] = bac[item]
else:
cw.drop(cw.index, inplace = True)
Unfortunately the pasted script drops all rows in cw, rendering it a blank dataframe. Not what I intended. Again, what I'm targeting for here is to remove rows that don't get a match between two ID identifiers, and return class1 values for those rows with matching IDs into this new Class column that I've just created.
In [76][dev]: cw.shape
Out[76]: (0, 13)
I hope I've made this clear. I suspect I didn't setup the if statement correctly but not sure. Thank you very much for helping a beginner here.

I found a simpler and more straight forward solution by using pandas merge.
# Merge with master list
cw_ac = pd.merge(cw, ml, on='cusip', how='inner')
This acts like an inner join in SQL based on the identifier and remove non-matching IDs.

Filter rows based on list of values and number in a column

I am trying to filter rows from a dataframe by matching a list of number values with the number present in a column. The problem is the column contains a mixture of numbers and characters.
Eg:
mylist = [2012, 2045]
Dept No
2012 - Management
2045 - Designing
I have tried the following but it isn't working
df_new = df[df['Dept No'].str.split(pat="-")[0].str.strip().isin(mylist)]
Can you suggest some other ways?

All we need to do is add .str[0] inside the code and it'll work:
df_new = df[df['Dept No'].str.split(pat="-").str[0].str.strip().isin(mylist)]

Create a list from rows values with no duplicates

I would need to extract the following words from a dataframe.
car+ferrari
The dataset is
Owner Sold
type
car+ferrari J.G £500000
car+ferrari R.R.T. £276,550
car+ferrari
motobike+ducati
motobike+ducati
...
I need to create a list with words from type, but distinguishing them separately. So in this case I need only car and ferrari.
The list should be
my_list=['car','ferrari']
no duplicates.
So what I should do is select type car+ferrari and extract the all the words, adding them into a list as shown above, without duplicates (I have many car+ferrari rows, but since I need to create a list with the terms, I need only extract these terms once).
Any help will be appreciated
EDIT: type column is an index

def lister(x): #function to split by '+'
return set(x.split('+'))
df['listcol']=df['type'].apply(lister) # applying the function on the type column and saving output to new column
Adding #AMC's suggestion of a rather inbuilt solution to split series in pandas:
df['type'].str.split(pat='+')
for details refer pandas.Series.str.split
Converting pandas index to series:
pd.Series(df.index)
Apply a function on index:
pd.Series(df.index).apply(lister)
or
pd.Series(df.index).str.split(pat = '+')
or
df.index.to_series().str.split("+")

Checking panda dataframe column for a match in a list

I have a pandas dataframe with two columns, a file id number and a list of keywords from that file. I essentially want to be able to iterate through each row and see if a chosen keyword is in the list of file key words and if it is print out the file id. Or I could make a new dataframe with all positive matches and print the file id's from there.
After researching it I was wanting to use
df.loc[df['key words'] == key_word, :]
which would give me a new dataframe of all the postive matches. The issue with this was that there were no positive matches as I forgot my 'key words' column has a list of key words in each row. Would anyone be able to help me find a solution? Much appreciated
EDIT: I'm unable to provide a snippet of my table as the data is sensitive, however this is the general idea of what it's like:

A solution can be pandas inner join: You'd better first convert your key_word array to a pandas dataframe. let's say you have saved the array as "key_words.csv" and give the label of "my_key" to that:
col_name = ['my_key']
df1 = pd.read_csv("key_words.csv", names = col_name ,skiprows=[0],encoding ='utf-8')
use skip_rows[0] if your first line is comment if not ignore it.
!!!Note that: It is very important that both of your key_words encodings be exactly the same as they are string if not your code won't find any match**!!!**
To apply my comment you can do (sometimes it works without using convert_dtypes but some times not!):
df1[col_name] = df1[col_name].astype(str)
df1 = df1.convert_dtypes()
you need to repeat the same dtype converting for your df['key words'] column, too.
and you then can use inner join:
df12 = df1.merge(df, how ='inner', left_on = key1, right_on = key)
Key1 and Key being your labels of the columns you want to compare.
df12 includes only the rows with a common keyword string, that you can save it in a separate file.

I managed to get the code right. I did:
for i in range(len(df['file id'])):
if keyword in df.loc[i, 'key words']:
print("https://www.website" + df.loc[i, 'file id'])
A bit easier than I thought. Thanks everyone for your answers though.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a column based on matches from a list - python

Use Series.str.contains with joined values by | for regex or for test values: df['test'] = df['Title'].str.contains('|'.join(joblist)) and if want extract values by list use Series.str.extract: df['Company'] = df['Title'].str.extract(f'({"|".join(joblist)})', expand=False)

You need to access all the items in the list with companies and compare them with each value of column "Title". You can check if a String contains another String using operator in. all_titles = df['Title'] for x in all_titles: for y in df: if (y in x ): //your code here

Related

how to delete rows that contain a word from a list in python

Lookup Each item from a list to items from List2. If there's a match return such value, if not delete the entire row

Filter rows based on list of values and number in a column

Create a list from rows values with no duplicates

Checking panda dataframe column for a match in a list

Categories

Resources