Pandas: Article category prediction performance slow - python

I'm working on a naive multinomial bayes classifier for articles in Pandas and have run into a bit of an issue with performance. My repo is here if you want the full code and the dataset I'm using: https://github.com/kingcodefish/multinomial-bayesian-classification/blob/master/main.ipynb
Here's my current setup with two dataframes: df for the articles with lists of tokenized words and word_freq to store precomputed frequency and P(word | category) values.
for category in df['category'].unique():
category_filter = word_freq.loc[word_freq['category'] == category]
cat_articles = df.loc[df['category'] == category].shape[0] # The number of categorized articles
p_cat = cat_articles / df.shape[0] # P(Cat) = # of articles per category / # of articles
df[category] = df['content'].apply(lambda x: category_filter[category_filter['word'].isin(x)]['p_given_cat'].prod()) * p_cat
Example data:
df
category content
0 QUEER VOICES [online, dating, thoughts, first, date, grew, ...
1 COLLEGE [wishes, class, believe, generation, better, j...
2 RELIGION [six, inspiring, architectural, projects, revi...
3 WELLNESS [ultramarathon, runner, micah, true, died, hea...
4 ENTERTAINMENT [miley, cyrus, ball, debuts, album, art, cyrus...
word_freq
category word freq p_given_cat
46883 MEDIA seat 1.0 0.333333
14187 CRIME ends 1.0 0.333333
81317 WORLD NEWS seat 1.0 0.333333
12463 COMEDY living 1.0 0.200000
20868 EDUCATION director 1.0 0.500000
Please note that the word_freq table is a cross product of the categories x words, so every word appears once and only once in each category, so the table does contain duplicates. Also, the freq column has been increased by 1 to avoid zero values (Laplace smoothed).
After running the above, I do this to find the max category P (each category's P is stored in a column after its name) and get the following:
df['predicted_category'] = df[df.columns.difference(['category', 'content'])].idxmax(axis=1)
df = df.drop(df.columns.difference(['category', 'content', 'predicted_category']), axis=1).reset_index(drop = True)
category content \
0 POLITICS [bernie, sanders, campaign, split, whether, fi...
1 COMEDY [bill, maher, compares, police, unions, cathol...
2 WELLNESS [busiest, people, earth, find, time, relax, th...
3 ENTERTAINMENT [lamar, odom, gets, standing, ovation, first, ...
4 GREEN [lead, longer, life, go, gut]
predicted_category
0 ARTS
1 ARTS
2 ARTS
3 TASTE
4 GREEN
This method seems to work well, but it is unfortunately really slow. I am using a large dataset of 200,000 articles with short descriptions and operating on only 1% of this is taking almost a minute. I know it's because I am looping through the categories instead of relying on vectorization, but I am very very new to Pandas and trying to formulate this in a groupby succinctly escapes me (especially with the two data tables, also might be unnecessary), so I'm looking for suggestions here.
Thanks!

Just in case someone happens to come across this later...
Instead of representing my categories x words as a cross product of every possible word of every category, which inflated to over 3 million rows in my data set, I decided to reduce them to only the necessary ones per category and provide a default value for ones that did not exist, which ended up being about 600k rows.
But the biggest speedup came from changing to the following:
for category in df['category'].unique():
# Calculate P(Category)
category_filter = word_freq.loc[word_freq['category'] == category]
cat_articles = df.loc[df['category'] == category].shape[0]
p_cat = cat_articles / df.shape[0]
# Create a word->P(word | category) dictionary for quick lookups
category_dict = category_filter.set_index('word').to_dict()['p_given_cat']
# For every article, find the product of P(word | category) values of the words, then multiply by P(category) to get bayes.
df[category] = df['content'].apply(lambda x: np.prod([category_dict.get(y, 0.001 / (cat_articles + 0.001)) for y in x])) * p_cat
I created a dictionary from the two columns word and the P(word | category) as the key-value respectively. This reduced the problem to a quick dictionary lookup for each element of each list and computing that product.
This ended up being about 100x faster, parsing the whole dataset in ~40 seconds.

Related

sentiment analysis of a dataframe

i have a project that involves determining the sentiments of a text based on the adjectives. The dataframe to be used is the adjectives column which i derived like so:
def getAdjectives(text):
blob=TextBlob(text)
return [ word for (word,tag) in blob.tags if tag == "JJ"]
dataset['adjectives'] = dataset['text'].apply(getAdjectives)`
I obtained the dataframe from a json file using this code:
with open('reviews.json') as project_file:
data = json.load(project_file)
dataset=pd.json_normalize(data)
print(dataset.head())
i have done the sentiment analysis for the dataframe using this code:
dataset[['polarity', 'subjectivity']] = dataset['text'].apply(lambda text: pd.Series(TextBlob(text).sentiment))
print(dataset[['adjectives', 'polarity']])
this is the output:
adjectives polarity
0 [] 0.333333
1 [right, mad, full, full, iPad, iPad, bad, diff... 0.209881
2 [stop, great, awesome] 0.633333
3 [awesome] 0.437143
4 [max, high, high, Gorgeous] 0.398333
5 [decent, easy] 0.466667
6 [it’s, bright, wonderful, amazing, full, few... 0.265146
7 [same, same] 0.000000
8 [old, little, Easy, daily, that’s, late] 0.161979
9 [few, huge, storage.If, few] 0.084762
The code has no issue except I want it to output the polarity of each adjective with the adjective, like for example right, 0.00127, mad, -0.9888 even though they are in the same row of the dataframe.
Try this:
dataset = dataset.explode("adjectives")
Note that [] will result in a np.NaN row which you might want to remove beforehand/afterwards.

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")
One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

Pandas map two dataframes using regex

I've two dataframes, one with text information and another with regex and patterns, what I need to do is to map a column from the second dataframe using regex
edit: What I need to do is to apply each regex on all df['text'] rows, and if there is a match, add the Pattern into a new column
Sample data
text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
'sales for the overseas customers',
'marketing approach is driving strong play from top tier customers',
'employees in India have been the continuance of remote work will impact productivity',
'sales due to higher customer']}
regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
'(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
'(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}
df
text
0 customer and increased repair and remodel acti...
1 sales for the overseas customers
2 marketing approach is driving strong play from...
3 employees in India have been the continuance o...
4 sales due to higher customer
regex
Pattern regex
0 Sales + customer (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1 Marketing + customer (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2 Employee * Productivity (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...
Desired output
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer
tried the following, created a function that returns the Pattern in case there is a match, then I iterate over all the columns in the regex dataframe
def finding_keywords(regex, match, keyword):
if re.search(regex, match):
return keyword
else:
pass
for index, row in regex.iterrows():
df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))
the problem with this is that in every iteration, it erases the previous mappings, as you can see below. As I'm foo foo was the last iteration, is the only one remaining with a pattern
text Pattern
0 foo None
1 bar None
2 foo foo I'm foo foo
3 foo bar None
4 bar bar None
One solution could be to run the iteration over regex dataframe, and then iterate over df, this way I avoid loosing information, but I'm looking for a fastest solution
You can loop through the unique values of the regex dataframe and apply to the text of the df frame and return the pattern in a new regex column. Then, merge in the Pattern column and drop the regex column.
The key to my approach was to first create the column as NaN and then fillna with each iteration so the columns didn't get overwritten.
import re
import numpy as np
srs = regex['regex'].unique()
df['regex'] = np.nan
for reg in srs:
df['regex'] = df['regex'].fillna(df['text'].apply(lambda x: reg
if re.search(reg, x) else np.NaN))
df = pd.merge(df, regex, how='left', on='regex').drop('regex', axis=1)
df
Out[1]:
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer

fuzzy matching for SQL using fuzzy wuzzy and pandas

i have the following table in SQL and want to use Fuzzy Wuzzy to compare all the records in the table for any potential duplicates which in this instance line 1 is a duplicate of line 2 (or vice versa). can someone explain how i can add two additional columns to this table (Highest Score and Record Line Num) using Fuzzy Wuzzy and pandas? thanks.
Input:
Vendor Doc Date Invoice Date Invoice Ref Num Invoice Amount
ABC 5/12/2019 5/10/2019 ABCDE56. 56
ABC 5/13/2019 5/10/2019 ABCDE56 56
TIM 4/15/2019 4/10/2019 RTET5SDF 100
Desired Output:
Vendor Doc Date Invoice Date Invoice Ref Num Invoice Amount Highest Score Record Line Num
ABC 5/12/2019 5/10/2019 ABCDE56. 56 96 2
ABC 5/13/2019 5/10/2019 ABCDE56 56 96 1
TIM 4/15/2019 4/10/2019 RTET5SDF 100 0 N/A
Since you are looking for duplicates, you should filter your data frame first using the vendor name. This is to ensure it doesn't match with invoices of other vendors and reduce the processing time. However, since you didn't mention anything about it, you can skip it.
Decide on a threshold for duplicates based on the length of your invoice references. For example if the average is 5 chars, make the threshould 80%. Then, use fuzzywuzzy to get the best match.
from fuzzywuzzy import fuzz, process
# Assuming no NaNs in invoices references
inv_list = df['Invoice Ref'].to_list()
for i, inv in enumerate(inv_list)
result = process.extractOne(inv, inv_list, scorer=fuzz.token_sort_ratio)
if result[1] >= your_threshould:
df.loc[i, 'Highest Score'] = result[1]
df.loc[i, 'Record Line Num'] = inv_list.index(result[0])

Pandas - Matching reference number to find earliest date

I'm hoping to pick your brains on optimization. I am still learning more and more about python and using it for my day to day operation analyst position. One of the tasks I have is sorting through approx 60k unique record identifiers, and searching through another dataframe that has approx 120k records of interactions, the employee who authored the interaction and the time it happened.
For Reference, the two dataframes at this point look like:
main_data = Unique Identifier Only
nok_data = Authored By Name, Unique Identifer(known as Case File Identifier), Note Text, Created On.
My set up currently runs it at approximately sorting through and matching my data at 2500 rows per minute, so approximately 25-30 minutes or so for a run. What I am curious is are there any steps I performed that are:
Redundant and inefficient overall slowing my process
A poor use of syntax to work around my lack of knowledge.
Below is my code:
nok_data = pd.read_csv("raw nok data.csv") #Data set from warehouse
main_data = pd.read_csv("exampledata.csv") #Data set taken from iTx ids from referral view
row_count = 0
error_count = 0
print(nok_data.columns.values.tolist())
print(main_data.columns.values.tolist()) #Commented out, used to grab header titles if needed.
data_length = len(main_data) #used for counting how many records left.
earliest_nok = {}
nok_data["Created On"] = pd.to_datetime(nok_data["Created On"]) #convert all dates to datetime at beginning.
for row in main_data["iTx Case ID"]:
list_data = []
nok = nok_data["Case File Identifier"] == row
matching_dates = nok_data[["Created On", "Authored By Name"]][nok == True] #takes created on date only if nok shows row was true
if len(matching_dates) > 0:
try:
min_dates = matching_dates.min(axis=0)
earliest_nok[row] = [min_dates[0], min_dates[1]]
except ValueError:
error_count += 1
earliest_nok[row] = None
row_count += 1
print("{} out of {} records").format(row_count, data_length)
with open('finaloutput.csv','wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in earliest_nok.items():
writer.writerow([key, value])
Looking for any advice or expertise from those performing code like this much longer then I have. I appreciate all of you who even just took the time to read this. Happy Tuesday,
Andy M.
**** EDIT REQUESTED TO SHOW DATA
Sorry for my novice move there not including any data type.
main_data example
ITX Case ID
2017-023597
2017-023594
2017-023592
2017-023590
nok_data aka "raw nok data.csv"
Authored By: Case File Identifier: Note Text: Authored on
John Doe 2017-023594 Random Text 4/1/2017 13:24:35
John Doe 2017-023594 Random Text 4/1/2017 13:11:20
Jane Doe 2017-023590 Random Text 4/3/2017 09:32:00
Jane Doe 2017-023590 Random Text 4/3/2017 07:43:23
Jane Doe 2017-023590 Random Text 4/3/2017 7:41:00
John Doe 2017-023592 Random Text 4/5/2017 23:32:35
John Doe 2017-023592 Random Text 4/6/2017 00:00:35
It looks like you want to group on the Case File Identifier and get the minimum date and corresponding author.
# Sort the data by `Case File Identifier:` and `Authored on` date
# so that you can easily get the author corresponding to the min date using `first`.
nok_data.sort_values(['Case File Identifier:', 'Authored on'], inplace=True)
df = (
nok_data[nok_data['Case File Identifier:'].isin(main_data['ITX Case ID'])]
.groupby('Case File Identifier:')['Authored on', 'Authored By:'].first()
)
d = {k: [v['Authored on'], v['Authored By:']] for k, v in df.to_dict('index').iteritems()}
>>> d
{'2017-023590': ['4/3/17 7:41', 'Jane Doe'],
'2017-023592': ['4/5/17 23:32', 'John Doe'],
'2017-023594': ['4/1/17 13:11', 'John Doe']}
>>> df
Authored on Authored By:
Case File Identifier:
2017-023590 4/3/17 7:41 Jane Doe
2017-023592 4/5/17 23:32 John Doe
2017-023594 4/1/17 13:11 John Doe
It is probably easier to use df.to_csv(...).
The items from main_data['ITX Case ID'] where there is no matching record have been ignored but could be included if required.

Categories