Reference only DataFrames where condidtion is True Python Pandas - python

Similar to this question but somewhat different (and that answer did not work). I am trying to reference DataFrames where a condition is true. In my case, whether or not a word from a word bank is contained in the string. If the word is in the string, I want to be able to use that specific DataFrame later (like pull out the link if true and continue searching). So I have:
wordBank = ['bomb', 'explosion', 'protest',
'port delay', 'port closure', 'hijack',
'tropical storm', 'tropical depression']
rss = pd.read_csv('RSSfeed2019.csv')
# print(rss.head())
feeds = [] # list of feed objects
for url in rss['URL'].head(5):
feeds.append(feedparser.parse(url))
# print(feeds)
posts = [] # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
for post in feed.entries:
if hasattr(post, 'summary'):
posts.append((post.title, post.link, post.summary))
else:
posts.append((post.title, post.link))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
if (df['summary'].str.find(wordBank)) or (df['title'].str.find(wordBank)):
print(df['title'])
and tried from the other question...
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
for word in wordBank:
mask = (df['summary'].str.find(word)) or (df['title'].str.find(word))
df.loc[mask, 'summary'] = word
df.loc[mask, 'title'] = word
How can I just get it to print the titles of the fields where the words are contained in either the summary or title? I want to be able to manipulate only those frames further. With current code, it prints every title in the DataFrame because I THINK since one is true, it thinks to print ALL the titles. How can I only reference titles where true?

Given the following setup:
posts = [["Global protest Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "The world’s total cellular containership fleet has passed 23 million TEU for the first time, according to shipping experts Alphaliner."],
["Global TEU Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "The world’s total cellular containership fleet has passed 23 million TEU for the first time, according to shipping experts Alphaliner."],
["Global TEU Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "There is a tropical depression"]]
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
print(df)
SETUP
title ... summary
0 Global protest Breaks Record ... The world’s total cellular containership fleet...
1 Global TEU Breaks Record ... The world’s total cellular containership fleet...
2 Global TEU Breaks Record ... There is a tropical depression
You could:
# create mask
mask = df['summary'].str.contains(rf"\b{'|'.join(wordBank)}\b", case=False) | df['title'].str.contains(rf"\b{'|'.join(wordBank)}\b", case=False)
# extract titles
titles = df['title'].values
# print them
for title in titles[mask]:
print(title)
Output
Global protest Breaks Record
Global TEU Breaks Record
Notice that the first row has protest in the title, and the last row has tropical depression in the summary. The key idea is to use a regex to match one of the alternatives in wordBank. See more about regex, here and the documentation of str.contains.

Related

Need help running a full match between two large dataframes based off the full business name compared to the description of a newsfeed

I have two dataframes:
One with a single column of business names that I call 'bus_names_2' with a column name of 'BUSINESS_NAME'
One with an array of records and fields that was pulled from a RSS feed that I call 'df_newsfeed'. The import field is 'Description_2' field which represents the RSS feeds contents after scrubbing stopwords and symbols. This was also conducted on the 'bus_names_2' dataframe as well.
I am trying to look through each record in the 'df_newsfeed's 'Description_2' field to see if any array of words contains a business name from the 'bus_names_2' dataframe. This is easily done using the following:
def IdentityResolution_demo(bus_names, df, col='Description_2', upper=True):
n_rows = df.shape[0]
description_col = df.columns.get_loc(col)
df['Company'] = ''
company_col = df.columns.get_loc('Company')
if upper:
df.loc[:,col] = df.loc[:,col].str.upper()
for ind in range(n_rows):
businesses = []
description = df.iloc[ind,description_col]
for bus_name in bus_names:
if bus_name in description:
businesses.append(bus_name)
if len(businesses) > 0:
company = '|'.join(businesses)
df.iloc[ind,company_col] = company
df = df[['Source', 'RSS', 'Company', 'Title', 'PublishedDate', 'Description', 'Link']].drop_duplicates()
return df
bus_names_3 = list(set(bus_names_2['BUSINESS_NAME'].tolist()))
test = IdentityResolution_demo(bus_names_3, df_newsfeed.iloc[:10])
test[test['Company']!='']
This issue with this asides from the length of time it takes is that it is bringing back everything in a contains manner. I only want full word matches. Meaning if I have a company in my 'bus_names_2' dataframe called 'Bank of A' that it only brings back that name into the company category if the full word of 'Bank of A' exist in the 'Description_2' column of the 'df_newsfeed' dataframe and not when 'Bank of America' shows up.
Essentially, I need something like this ingrained in my function to produce the proper output for the 'Company' column but I don't know how to implement it. The below code gets the point accross.
Description_2 = 'GUARDFORCE AI CO LIMITED AI GFAIW RIVERSOFT INC PEAKWORK COMPANY GFAIS CONCIERGE GUARDFORCE AI RIVERSOFT ROBOT TRAVEL AGENCY'
bus_name_2 = ['GUARDFORCE AI CO']
for i in bus_name_2:
bus_name = re.compile(fr'\b{i}\b')
print(f"{i if bus_name.match(Description_2) else ''}")
This would produce an output of 'GUARDFORCE AI CO' but if I change the bus_name_2 to:
bus_name_2 = ['GUARDFORCE AI C']
It would produce a null output.
This function is written in the way it is because comparing two dataframes turned into a very long query and so optimization required a non-dataframe format.

str.contains not working when there is not a space between the word and special character

I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
columns=['Title'])
watched_df.head()
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
Output:
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series

Check words from a list in column of dataframe and print the words in another created column of the same dataframe [Python]

I have a list of keywords and I wish to write a Python program that can iterate each word in the list and check if the words from the list exist in each row of a column of data frame and print those words in another column of the same data frame.
e.g.
keywords = ['registration', 'al', 'branch']
df = pd.DataFrame({'message': ['wonderful registration process', 'i hate this branch', 'this branch has a great registration process','I don't like this place']})
I want to check the matched words in the list with each row of the message in the data frame and print the matched words in another created column named "keywords" of the data frame.
So the output of the above code should be
df
message
0 wonderful registration process
1 i hate this branch
2 this branch has a great registration process
3 I don't like this place
df
message keywords
0 wonderful registration process registration
1 i hate this branch branch
2 this branch has a great registration process registration, branch
3 I don't like this place none
It will be great if anyone could guide me.
this is your solution working like a charm.
import pandas as pd
keywords = ['registration', 'al', 'branch']
df = pd.DataFrame({'message': ["wonderful registration process", "i hate this branch", "this branch has a great registration process","I don't like this place"]})
# first of all when you have word like don't try to use ("") not ('') when defining string
#(keyword if (keyword in element)
def operation(element):
res=",".join([(keyword) for keyword in keywords if (keyword in element)])
if res=="":
return "none" #handling no keyword situation
else:
return res
df.insert(1, "keywords", list(map(operation,list(df.to_dict()['message'].values()))), True)#insert of new array
print(df)
Happy codding, any question you can text me on stackoverflow .

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")
One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

How can I extract a pattern from a text when it involves a new line?

Say I have following text in a cell of a dataset (csv file):
I want to extract the words/phrase that appears after the keywords Decision and reason. I can do it like so:
import pandas as pd
text = '''Decision: Postpone\n\nreason:- medical history - information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''
keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)
a = text.split('\n')
for cell in a:
for keyword in keywords:
if keyword in cell.lower():
if len(cell.split(':'))>1:
new_df[keyword][0]=cell.split(':')[1]
new_df
However, in some of the cells, the words/phrases appear in a new line after the keyword, in which case this program is unable to extract it:
import pandas as pd
text = '''Decision: Postpone\n\nreason: \n- medical history \n- information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''
keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)
a = text.split('\n')
for cell in a:
for keyword in keywords:
if keyword in cell.lower():
if len(cell.split(':'))>1:
new_df[keyword][0]=cell.split(':')[1]
new_df
How can I fix this?
Use Regular expression to split data this would reduced number of loops
import re
import pandas as pd
text = '''Decision: Postpone\n\nreason: \n- medical history \n- information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''
keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)
text =text.lower()
tokens = re.findall(r"[\w']+", text)
for key in keywords:
if key =='decision':
index = tokens.index(key)
new_df[key][0] = ''.join(tokens[index+1:index+2])
if key =='reason':
index = tokens.index(key)
meta = tokens.index('review')
new_df[key][0] = " ".join(tokens[index + 1:meta -1])
print(new_df)
If the content is in another row, you definitely may not split the
source string into rows and then look for all "tokens" in thee
current row.
Instead you should:
prepare a regex with 2 capturing groups (keyword and content),
look for matches, e.g. using finditer.
Example code can be as follows:
df = pd.DataFrame(columns=keywords)
keywords = ['decision', 'reason']
it = re.finditer(r'(?P<kwd>\w+):\n?(?P<cont>.+?(?=\n\w+:|$))',
text, flags=re.DOTALL)
row = dict.fromkeys(keywords, '')
for m in it:
kwd = m.group('kwd').lower()
cont = m.group('cont').strip()
if kwd in keywords:
row[kwd] = cont
df = df.append(row, ignore_index=True)
Of course, you should start from import re.
And maybe you should also read a little about regular expressions.

Categories