i have a project that involves determining the sentiments of a text based on the adjectives. The dataframe to be used is the adjectives column which i derived like so:
def getAdjectives(text):
blob=TextBlob(text)
return [ word for (word,tag) in blob.tags if tag == "JJ"]
dataset['adjectives'] = dataset['text'].apply(getAdjectives)`
I obtained the dataframe from a json file using this code:
with open('reviews.json') as project_file:
data = json.load(project_file)
dataset=pd.json_normalize(data)
print(dataset.head())
i have done the sentiment analysis for the dataframe using this code:
dataset[['polarity', 'subjectivity']] = dataset['text'].apply(lambda text: pd.Series(TextBlob(text).sentiment))
print(dataset[['adjectives', 'polarity']])
this is the output:
adjectives polarity
0 [] 0.333333
1 [right, mad, full, full, iPad, iPad, bad, diff... 0.209881
2 [stop, great, awesome] 0.633333
3 [awesome] 0.437143
4 [max, high, high, Gorgeous] 0.398333
5 [decent, easy] 0.466667
6 [it’s, bright, wonderful, amazing, full, few... 0.265146
7 [same, same] 0.000000
8 [old, little, Easy, daily, that’s, late] 0.161979
9 [few, huge, storage.If, few] 0.084762
The code has no issue except I want it to output the polarity of each adjective with the adjective, like for example right, 0.00127, mad, -0.9888 even though they are in the same row of the dataframe.
Try this:
dataset = dataset.explode("adjectives")
Note that [] will result in a np.NaN row which you might want to remove beforehand/afterwards.
Related
I'm working on a naive multinomial bayes classifier for articles in Pandas and have run into a bit of an issue with performance. My repo is here if you want the full code and the dataset I'm using: https://github.com/kingcodefish/multinomial-bayesian-classification/blob/master/main.ipynb
Here's my current setup with two dataframes: df for the articles with lists of tokenized words and word_freq to store precomputed frequency and P(word | category) values.
for category in df['category'].unique():
category_filter = word_freq.loc[word_freq['category'] == category]
cat_articles = df.loc[df['category'] == category].shape[0] # The number of categorized articles
p_cat = cat_articles / df.shape[0] # P(Cat) = # of articles per category / # of articles
df[category] = df['content'].apply(lambda x: category_filter[category_filter['word'].isin(x)]['p_given_cat'].prod()) * p_cat
Example data:
df
category content
0 QUEER VOICES [online, dating, thoughts, first, date, grew, ...
1 COLLEGE [wishes, class, believe, generation, better, j...
2 RELIGION [six, inspiring, architectural, projects, revi...
3 WELLNESS [ultramarathon, runner, micah, true, died, hea...
4 ENTERTAINMENT [miley, cyrus, ball, debuts, album, art, cyrus...
word_freq
category word freq p_given_cat
46883 MEDIA seat 1.0 0.333333
14187 CRIME ends 1.0 0.333333
81317 WORLD NEWS seat 1.0 0.333333
12463 COMEDY living 1.0 0.200000
20868 EDUCATION director 1.0 0.500000
Please note that the word_freq table is a cross product of the categories x words, so every word appears once and only once in each category, so the table does contain duplicates. Also, the freq column has been increased by 1 to avoid zero values (Laplace smoothed).
After running the above, I do this to find the max category P (each category's P is stored in a column after its name) and get the following:
df['predicted_category'] = df[df.columns.difference(['category', 'content'])].idxmax(axis=1)
df = df.drop(df.columns.difference(['category', 'content', 'predicted_category']), axis=1).reset_index(drop = True)
category content \
0 POLITICS [bernie, sanders, campaign, split, whether, fi...
1 COMEDY [bill, maher, compares, police, unions, cathol...
2 WELLNESS [busiest, people, earth, find, time, relax, th...
3 ENTERTAINMENT [lamar, odom, gets, standing, ovation, first, ...
4 GREEN [lead, longer, life, go, gut]
predicted_category
0 ARTS
1 ARTS
2 ARTS
3 TASTE
4 GREEN
This method seems to work well, but it is unfortunately really slow. I am using a large dataset of 200,000 articles with short descriptions and operating on only 1% of this is taking almost a minute. I know it's because I am looping through the categories instead of relying on vectorization, but I am very very new to Pandas and trying to formulate this in a groupby succinctly escapes me (especially with the two data tables, also might be unnecessary), so I'm looking for suggestions here.
Thanks!
Just in case someone happens to come across this later...
Instead of representing my categories x words as a cross product of every possible word of every category, which inflated to over 3 million rows in my data set, I decided to reduce them to only the necessary ones per category and provide a default value for ones that did not exist, which ended up being about 600k rows.
But the biggest speedup came from changing to the following:
for category in df['category'].unique():
# Calculate P(Category)
category_filter = word_freq.loc[word_freq['category'] == category]
cat_articles = df.loc[df['category'] == category].shape[0]
p_cat = cat_articles / df.shape[0]
# Create a word->P(word | category) dictionary for quick lookups
category_dict = category_filter.set_index('word').to_dict()['p_given_cat']
# For every article, find the product of P(word | category) values of the words, then multiply by P(category) to get bayes.
df[category] = df['content'].apply(lambda x: np.prod([category_dict.get(y, 0.001 / (cat_articles + 0.001)) for y in x])) * p_cat
I created a dictionary from the two columns word and the P(word | category) as the key-value respectively. This reduced the problem to a quick dictionary lookup for each element of each list and computing that product.
This ended up being about 100x faster, parsing the whole dataset in ~40 seconds.
I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")
One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.
There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.
I am facing a problem in applying fuzzy logic for data cleansing in python. My data looks something like this
data=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "Count":['140','120','50','45','30','20','10','5']})
data
I am using fuzzy logic to compare the values in the data frame. The final output should have a third column with result like this:
data_out=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "New_Column":["Deloitte",'Accenture','Accenture','Accenture','Ernst & young','Ernst & young','Tata Consultancy Services','Deloitte']})
data_out
So if you see, I want less occurring values to have a new entry as a new column with the most occurred value of its type. That is where fuzzy logic is helpful.
Most of your duplicate companies can be detected using fuzzy string matching quite easily, however the replacement Ernst & young <-> EY is not really similar at all, which is why I am going to ignore this replacement here. This solution is using my library RapidFuzz, but you could implement something similar using FuzzyWuzzy aswell (with a little more code, since it does not has the extractIndices processor).
import pandas as pd
from rapidfuzz import process, utils
def add_deduped_employer_colum(data):
values = data.values.tolist()
employers = [employer for employer, _ in values]
# preprocess strings beforehand (lowercase + remove punctuation),
# so this is not done multiple times
processed_employers = [utils.default_process(employer)
for employer in employers]
deduped_employers = employers.copy()
replaced = []
for (i, (employer, processed_employer)) in enumerate(
zip(employers, processed_employers)):
# skip elements that already got replaced
if i in replaced:
continue
duplicates = process.extractIndices(
processed_employer, processed_employers[i+1:],
processor=None, score_cutoff=90, limit=None)
for (c, _) in duplicates:
deduped_employers[i+c+1] = employer
"""
by replacing the element with an empty string the index from
extractIndices stays correct but it can be skipped a lot
faster, since the compared strings will have very different
lengths
"""
processed_employers[i+c+1] = ""
replaced.append(i+c+1)
data['New_Column'] = deduped_employers
data=pd.DataFrame({
'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'],
"Count":['140','120','50','45','30','20','10','5']})
add_deduped_employer_colum(data)
print(data)
which results in the following dataframe:
Employer Count New_Column
0 Deloitte 140 Deloitte
1 Accenture 120 Accenture
2 Accenture Solutions Ltd 50 Accenture
3 Accenture USA 45 Accenture
4 Ernst & young 30 Ernst & young
5 EY 20 EY
6 Tata Consultancy Services 10 Tata Consultancy Services
7 Deloitte Uk 5 Deloitte
I have not used fuzzy but can assist as follows
Data
df=pd.DataFrame({'Employer':['Accenture','Accenture Solutions Ltd','Accenture USA', 'hjk USA', 'Tata Consultancy Services']})
df
You did not give an explanation why Tata remains with the full name. Hence I assume it is special and mask it.
m=df.Employer.str.contains('Tata')
I then use np.where to replace anything after the first name for the rest
df['New_Column']=np.where(m, df['Employer'], df['Employer'].str.replace(r'(\s+\D+)',''))
df
Output
I have a dataset with a column that has comments. This comments are words separated by commas.
df_pat['reason'] =
chest pain
chest pain, dyspnea
chest pain, hypertrophic obstructive cariomyop...
chest pain
chest pain
cad, rca stents
non-ischemic cardiomyopathy, chest pain, dyspnea
I would like to generate separated columns in the dataframe so that a column represent each word from all the set of words, and then have 1 or 0 to the rows where I initially had that word in the comment.
For example:
df_pat['chest_pain'] =
1
1
1
1
1
1
0
1
df_pat['dyspnea'] =
0
1
0
0
0
0
1
And so on...
Thank you!
sklearn.feature_extraction.text has something for you! It looks like you may be trying to predict something. If so - and if you're planning to use sci-kit learn at some point, then you can bypass making a dataframe with len(set(words)) number of columns and just use CountVectorizer. This method will return a matrix with dimensions (rows, columns) = (number of rows in dataframe, number of unique words in entire 'reason' column).
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'reason': ['chest pain', 'chest pain, dyspnea', 'chest pain, hypertrophic obstructive cariomyop', 'chest pain', 'chest pain', 'cad, rca stents', 'non-ischemic cardiomyopathy, chest pain, dyspnea']})
# turns body of text into a matrix of features
# split string on commas instead of spaces
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(","))
# X is now a n_documents by n_distinct_words-dimensioned matrix of features
X = vectorizer.fit_transform(df['reason'])
pandas plays really nicely with sklearn.
Or, a strict pandas solution that should probably be vectorized, but if you don't have that much data, should work:
# split on the comma instead of spaces to get "chest pain" instead of "chest" and "pain"
reasons = [reason for case in df['reason'] for reason in case.split(",")]
for reason in reasons:
for idx in df.index:
if reason in df.loc[idx, 'reason']:
df.loc[idx, reason] = 1
else:
df.loc[idx, reason] = 0