How to use pandas apply to replace iterrows? - python

I am calculating the sentiment value on every row in the dataset based on news headline. I used iterrows to achieve this:
field = 'headline'
dfp = pd.DataFrame(columns=('pos', 'neg', 'neu'))
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")
for index, row in df.iterrows():
text = row[field]
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
probs = torch.nn.functional.softmax(output[0], dim=-1)
probs_arr = probs.cpu().detach().numpy()
dfp = dfp.append({'pos': probs_arr[0][0],
'neg': probs_arr[0][1],
'neu': probs_arr[0][2]
}, ignore_index=True)
However, the processing time is taking too long (>30 minutes runtime and it is not done yet). I have 16.6k rows in my dataset.
This is a small section of the dataset:
datetime headline
0 2020-03-17 16:57:07 12 best noise-cancelling headphones: In-ear an...
1 2020-06-08 14:00:55 5G Stocks To Buy And Watch: Pricing of 5G Smar...
2 2020-06-19 10:00:00 10 best wireless printers that will make your ...
3 2020-08-19 00:00:00 Apple Confirms Solid New iOS 14 Security Move ...
4 2020-08-19 00:00:00 Apple Becomes First U.S. Company Worth More Th...
I have read that iterrows is not recommended in most situation unless the dataset is small and optimization is not a concern. The alternative to it, it seem, is to use apply since apply go through each pandas row and is optimized.
Some of the SO topics I read suggested to put create a function and run it in apply. This is what I attempted:
def calPred(text):
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
probs = torch.nn.functional.softmax(output[0], dim=-1)
probs_arr = probs.cpu().detach().numpy()
dfp = dfp.append({'pos': probs_arr[0][0],
'neg': probs_arr[0][1],
'neu': probs_arr[0][2]
}, ignore_index=True)
df['headline'].apply(lambda x: calPred(x))
It returned an error UnboundLocalError: local variable 'dfp' referenced before assignment.
Appreciate if someone can guide me on how to optimize and use apply correctly. Thanks in advance.

The problem with your code is that when you do dfp = dfp.append..., dfp is already defined as global and you cannot reassign it (use another variable name) i.e dfp_temp = dfp.append....
However I think that apply is not what you want. Most models in ML will take as input an array-like so you can pass the whole column in the model (or at least a big chunk of it) and not each row.
Something like this
field = 'headline'
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")
texts = df[field].values
encoded_input = tokenizer(texts, return_tensors='pt')
output = model(encoded_input)
probs = torch.nn.functional.softmax(output, dim=-1)
probs = probs.cpu().detach().numpy()
dfp = pd.DataFrame({
'pos': probs[:, 0],
'neg': probs[:, 1],
'neu': probs[:, 2]
})
Edit: Tokenizer does not support an array
you can try vectorizing the tokenizer like this
NOTE: np.vectorize and apply will not give you any significant boost since they still iterate over each element. However it is better to use apply and np.vectorize to the minimum possible extent.
...
tokenizer_func = lambda text: tokenizer(text, return_tensors='pt')
encoded_input = np.vectorize(tokenizer_func)(texts)
...

Related

Properly calculate cosine similarities for low memory on large datasets?

I am following this tutorial here to just learn a bit about content recommenders: https://www.datacamp.com/community/tutorials/recommender-systems-python
but i ran into a Memory Error when running the "content based" part of the tutorial. Upon some reading I found that this has to do with just how large the dataset being used it. I couldn't really find an exact way for this specific case on how to run this with low memory, so instead i modified this a little bit to split the original dataframe up into 6 pieces, run this cosine similarity calculation for each split dataframe, merge together the results, then run this one last time to get a final result. here is my code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, indices, cosine_sim, final=False):
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 10 most similar movies
if not final:
return metadata.iloc[movie_indices, :]
else:
return metadata['title'].iloc[movie_indices]
# Load Movies Metadata
metadata = pd.read_csv('dataset/movies_metadata.csv', low_memory=False)
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')
#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')
split_db = np.array_split(metadata, 6)
source_db = None
search_db = None
db_remove_idx = None
new_db_list = list()
for x, db in enumerate(split_db):
search = db.loc[db['title'] == 'The Dark Knight Rises']
if not search.empty:
source_db = db
new_db_list.append(source_db)
search_db = search
db_remove_idx = x
break
split_db.pop(db_remove_idx)
for x, db in enumerate(split_db):
new_db_list.append(db.append(search_db, ignore_index=True))
del(split_db)
refined_db = None
for db in new_db_list:
small_db = db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(small_db['overview'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(small_db.index, index=small_db['title']).drop_duplicates()
result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim))
if type(refined_db) != pd.core.frame.DataFrame:
refined_db = result.append(search_db, ignore_index=True)
else:
refined_db = refined_db.append(result, ignore_index=True)
final_db = refined_db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(final_db['overview'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(final_db.index, index=final_db['title']).drop_duplicates()
final_result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim, final=True))
print(final_result)
i thought this would work, but the results are not even close to what is given in the tutorial:
11 Dracula: Dead and Loving It
13 Nixon
12 Balto
15 Casino
20 Get Shorty
18 Ace Ventura: When Nature Calls
14 Cutthroat Island
16 Sense and Sensibility
19 Money Train
17 Four Rooms
Name: title, dtype: object
could anyone explain what i am doing wrong here? i figured since the dataset was too large by splitting it up, running this "cosine similarity" process as first a refinement, then using the resulting data and running the process again would give a similar result, but then why is the result i am getting so different than what is expected?
And this is the data i am using this against: https://www.kaggle.com/rounakbanik/the-movies-dataset/data

Apply NLTK Rake to each row in Dataframe

I'd like to apply the Rake function (https://pypi.org/project/rake-nltk/) to each row in my dataframe.
I can apply the function individually to a specific row, but not append it to the dataframe.
This is what I have so far:
r = Rake(ranking_metric= Metric.DEGREE_TO_FREQUENCY_RATIO, language= 'english', min_length=1, max_length=4)
r.extract_keywords_from_text(test.document[177])
r.get_ranked_phrases() #prints a list of keywords
test['keywords'] = test.applymap(lambda x: r.extract_keywords_from_text(x)) #trying to apply it to each row.
It just runs indefinitely. I just want to append a new column to my dataframe 'test' called "keywords" that has the list of keywords from r.get_ranked_phrases().
r.extract_keywords_from_text(x) will return you None
import pandas as pd
from rake_nltk import Rake
r = Rake()
df=pd.DataFrame(data = ['machine learning and fraud detection are a must learn',
'monte carlo method is great and so is hmm,pca, svm and neural net',
'clustering and cloud',
'logistical regression and data management and fraud detection'] ,columns = ['Comments'])
def rake_implement(x,r):
r.extract_keywords_from_text(x)
return r.get_ranked_phrases()
df['new_col'] =df['Comments'].apply(lambda x: rake_implement(x,r))
print(df['new_col'])
#o/p
0 [must learn, machine learning, fraud detection]
1 [monte carlo method, neural net, svm, pca, hmm...
2 [clustering, cloud]
3 [logistical regression, fraud detection, data ...
Name: new_col, dtype: object

Oversampling a class in classification problem

I have nearly 100000 data point with 15 features for 'disease' and 'no disease' as target.
But my data is imbalanced. 97% of my data is no disease and 3% is disease.
To overcome this I manually created disease data by creating 7 copies from the actual data and merged it with the original data.
using this code.
#selecting data with disease is 1
# Even created unique 'patient ID' by adding a dummy letter as a suffix to the #original ID.
ia = df[df['disease']==1]
dup = pd.DataFrame()
for i,j in zip(['a','b','c','d','e','f'],['B','C','E','F','G','H']):
i = ia.copy()
i['dum'] = j
i["patient ID"] = i["Employee Code"]+ i['dum']
dup= pd.concat([dup,i])
# adding the copies to the original data
df = pd.concat([dup,df])
Please let me know if this is the correct method for oversampling.

Generate features from "comments" column in dataframe

I have a dataset with a column that has comments. This comments are words separated by commas.
df_pat['reason'] =
chest pain
chest pain, dyspnea
chest pain, hypertrophic obstructive cariomyop...
chest pain
chest pain
cad, rca stents
non-ischemic cardiomyopathy, chest pain, dyspnea
I would like to generate separated columns in the dataframe so that a column represent each word from all the set of words, and then have 1 or 0 to the rows where I initially had that word in the comment.
For example:
df_pat['chest_pain'] =
1
1
1
1
1
1
0
1
df_pat['dyspnea'] =
0
1
0
0
0
0
1
And so on...
Thank you!
sklearn.feature_extraction.text has something for you! It looks like you may be trying to predict something. If so - and if you're planning to use sci-kit learn at some point, then you can bypass making a dataframe with len(set(words)) number of columns and just use CountVectorizer. This method will return a matrix with dimensions (rows, columns) = (number of rows in dataframe, number of unique words in entire 'reason' column).
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'reason': ['chest pain', 'chest pain, dyspnea', 'chest pain, hypertrophic obstructive cariomyop', 'chest pain', 'chest pain', 'cad, rca stents', 'non-ischemic cardiomyopathy, chest pain, dyspnea']})
# turns body of text into a matrix of features
# split string on commas instead of spaces
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(","))
# X is now a n_documents by n_distinct_words-dimensioned matrix of features
X = vectorizer.fit_transform(df['reason'])
pandas plays really nicely with sklearn.
Or, a strict pandas solution that should probably be vectorized, but if you don't have that much data, should work:
# split on the comma instead of spaces to get "chest pain" instead of "chest" and "pain"
reasons = [reason for case in df['reason'] for reason in case.split(",")]
for reason in reasons:
for idx in df.index:
if reason in df.loc[idx, 'reason']:
df.loc[idx, reason] = 1
else:
df.loc[idx, reason] = 0

Python Scikit-Learn PCA: Get Component Score

I am trying to perform a Principal Component Analysis for work. While i have successful in getting the the Principal Components laid out, i don't really know how to assign the resulting Component Score to each line item. I am looking for an output sort of like this.
Town PrinComponent 1 PrinComponent 2 PrinComponent 3
Columbia 0.31989 -0.44216 -0.44369
Middletown -0.37101 -0.24531 -0.47020
Harrisburg -0.00974 -0.06105 0.32792
Newport -0.38678 0.40935 -0.62996
The scikit-learn docs are not being helpful in this circumstance. Can anybody explain to me how i can reach this output?
The code i have so far is below.
def perform_PCA(df):
threshold = 0.1
pca = decomposition.PCA(n_components=3)
numpyMatrix = df.as_matrix().astype(float)
scaled_data = preprocessing.scale(numpyMatrix)
pca.fit(scaled_data)
pca.transform(scaled_data)
pca_components_df = pd.DataFrame(data = pca.components_,columns = df.columns.values)
#print pca_components_df
#pca_components_df.to_csv('pca_components_df.csv')
filtered = pca_components_df[abs(pca_components_df) > threshold]
trans_filtered= filtered.T
#print filtered.T #Tranformed Dataframe
trans_filtered.to_csv('trans_filtered.csv')
print pca.explained_variance_ratio_
I pumped the transformed array into the data portion of the DataFrame function, and then defined the index and columns the by putting them into columns= and index= respectively.
pd.DataFrame(data=transformed, columns=["PC1", "PC2"], index=df.index)

Categories