Make Python Gensim Search Functions Efficient - python

I have a DataFrame that has a text column. I am splitting the DataFrame into two parts based on the value in another column. One of those parts is indexed into a gensim similarity model. The other part is then fed into the model to find the indexed text that is most similar. This involves a couple of search functions to enumerate over each item in the indexed part. With the toy data, it is fast, but with my real data, it is much too slow using apply. Here is the code example:
import pandas as pd
import gensim
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
d = {'number': [1,2,3,4,5], 'text': ['do you like python', 'do you hate python','do you like apples','who is nelson mandela','i am not interested'], 'answer':['no','yes','no','no','yes']}
df = pd.DataFrame(data=d)
df_yes = df[df['answer']=='yes']
df_no = df[df['answer']=='no']
df_no = df_no.reset_index()
docs = df_no['text'].tolist()
genDocs = [[w.lower() for w in word_tokenize(text)] for text in docs]
dictionary = gensim.corpora.Dictionary(genDocs)
corpus = [dictionary.doc2bow(genDoc) for genDoc in genDocs]
tfidf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.MatrixSimilarity(tfidf[corpus], num_features=len(dictionary))
def search(row):
query = [w.lower() for w in word_tokenize(row)]
query_bag_of_words = dictionary.doc2bow(query)
query_tfidf = tfidf[query_bag_of_words]
return query_tfidf
def searchAll(row):
max_similarity = max(sims[search(row)])
index = [i for i, j in enumerate(sims[search(row)]) if j == max_similarity]
return max_similarity, index
df_yes = df_yes.copy()
df_yes['max_similarity'], df_yes['index'] = zip(*df_yes['text'].apply(searchAll))
I have tried converting the operations to dask dataframes to no avail, as well as python multiprocessing. How would I make these functions more efficient? Is it possible to vectorize some/all of the functions?

Your code's intent and operation is very unclear. Assuming it works, explaining the ultimate goal, and showing more example data, more example queries, and the desired results in your question could help.
Perhaps it could be improved to not repeat certain operations over and over. Some ideas could include:
only tokenize each row once, and cache the tokenization
only doc2bow() each row once, and cache the BOW representation
don't call sims(search[row]) twice inside searchAll()
don't iterate twice – once to find the max, then again to find the index – but just once
(More generally, though, efficient text keyword search often uses specialized reverse-indexes for efficiency, to avoid a costly iteration over every document.)

Related

Nested Loop Optimisation in Python for a list of 50K items

I have a csv file with roughly 50K rows of search engine queries. Some of the search queries are the same, just in a different word order, for example "query A this is " and "this is query A".
I've tested using fuzzywuzzy's token_sort_ratio function to find matching word order queries, which works well, however I'm struggling with the runtime of the nested loop, and looking for optimisation tips.
Currently the nested for loops take around 60 hours to run on my machine. Does anyone know how I might speed this up?
Code below:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
from tqdm import tqdm
filePath = '/content/queries.csv'
df = pd.read_csv(filePath)
table1 = df['keyword'].to_list()
table2 = df['keyword'].to_list()
data = []
for kw_t1 in tqdm(table1):
for kw_t2 in table2:
score = fuzz.token_sort_ratio(kw_t1,kw_t2)
if score == 100 and kw_t1 != kw_t2:
data +=[[kw_t1, kw_t2, score]]
data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])
Any advice would be appreciated.
Thanks!
Since what you are looking for are strings consisting of identical words (just not necessarily in the same order), there is no need to use fuzzy matching at all. You can instead use collections.Counter to create a frequency dict for each string, with which you can map the strings under a dict of lists keyed by their frequency dicts. You can then output sub-lists in the dicts whose lengths are greater than 1.
Since dicts are not hashable, you can make them keys of a dict by converting them to frozensets of tuples of key-value pairs first.
This improves the time complexity from O(n ^ 2) of your code to O(n) while also avoiding overhead of performing fuzzy matching.
from collections import Counter
matches = {}
for query in df['keyword']:
matches.setdefault(frozenset(Counter(query.split()).items()), []).append(query)
data = [match for match in matches.values() if len(match) > 1]
Demo: https://replit.com/#blhsing/WiseAfraidBrackets
I don't think you need fuzzywuzzy here: you are just checking for equality (score == 100) of the sorted queries, but with token_sort_ratio you are sorting the queries over and over. So I suggest to:
create a "base" list and a "sorted-elements" one
iterate on the elements.
This will still be O(n^2), but you will be sorting 50_000 strings instead of 2_500_000_000!
filePath = '/content/queries.csv'
df = pd.read_csv(filePath)
table_base = df['keyword'].to_list()
table_sorted = [sorted(kw) for kw in table_base]
data = []
ln = len(table_base)
for i in range(ln-1):
for j in range(i+1,ln):
if table_sorted[i] == table_sorted[j]:
data +=[[table_base[i], table_base[j], 100]]
data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])
Apply in pandas as usually works faster:
kw_t2 = df['keyword'].to_list()
def compare(kw_t1):
found_duplicates = []
score = fuzz.token_sort_ratio(kw_t1, kw_t2)
if score == 100 and kw_t1 != kw_t2:
found_duplicates.append(kw_t2)
return found_duplicates
df["duplicates"] = df['keyword'].apply(compare)

Apriori Results in Python

I am trying to run an apriori algorithm in python. My specific problem is when I use the apriori function, I specify the min_length as 2. However, when I print the rules, I get rules that contain only 1 item. I am wondering why apriori does not filter out items less than 2, because I specified I only want rules with 2 things in the itemset.
from apyori import apriori
#store the transactions
transactions = []
total_transactions = 0
with open('browsing.txt', 'r') as file:
for transaction in file:
total_transactions += 1
items = []
for item in transaction.split():
items.append(item)
transactions.append(items)
#
support_threshold = (100/total_transactions)
print(support_threshold)
minimum_support = 100
frequent_items = apriori(transactions, min_length = 2, min_support = support_threshold)
association_results = list(frequent_items)
print(association_results[0])
print(association_results[1])
My results:
RelationRecord(items=frozenset({'DAI11223'}), support=0.004983762579981351, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'DAI11223'}), confidence=0.004983762579981351, lift=1.0)])
RelationRecord(items=frozenset({'DAI11778'}), support=0.0037619369152117293, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'DAI11778'}), confidence=0.0037619369152117293, lift=1.0)])
A look into the code from here: https://github.com/ymoch/apyori/blob/master/apyori.py revealed that there is no min_length keyword (only max_length). They way apyori is implemented it does not raise any warning or error when passing keyword arguments which are not used.
Why not filter the result afterwards?
association_results = filter(lambda x: len(x.items) > 1, association_results)
Limitation of first approach was need to converted data in a list fomat. when we see real life a store has many thousands of sku in that case it is computationally expensive.
Apyori package is outdated. i mean there is no recent update from past few years.
Results are coming in improper format which need to represent properly and that need computational operation to perform.
mlxtend used two way based approach which generate frequent itemset and association rules over that. -check here for more info
mlxtend are proper and has community support.

How should I load a memory-intensive helper object per-worker in dask distributed?

I am currently trying to parse a very large number of text documents using dask + spaCy. SpaCy requires that I load a relatively large Language object, and I would like to load this once per worker. I have a couple of mapping functions that I would like to apply to each document, and I would hopefully not have to reinitialize this object for each future / function call. What is the best way to handle this?
Example of what I'm talking about:
def text_fields_to_sentences(
dataframe:pd.DataFrame,
...
)->pd.DataFrame:
# THIS IS WHAT I WOULD LIKE TO CHANGE
nlp, = setup_spacy(scispacy_version)
def field_to_sentences(row):
result = []
doc = nlp(row[text_field])
for sentence_tokens in doc.sents:
sentence_text = "".join([t.string for t in sentence_tokens])
r = text_data.copy()
r[sentence_text_field] = sentence_text
result.append(r)
return result
series = dataframe.apply(
field_to_sentences,
axis=1
).explode()
return pd.DataFrame(
[s[new_col_order].values for s in series],
columns=new_col_order
)
input_data.map_partitions(text_fields_to_sentences)
You could create the object as a delayed object
corpus = dask.delayed(make_corpus)("english")
And then use this lazy value in place of the full value:
df = df.text.apply(parse, corpus=corpus)
Dask will call make_corpus once on one machine and then pass it around to the workers as it is needed. It will not recompute any task.

Fast Named Entity Removal with NLTK

I wrote a couple of user defined functions to remove named entities (using NLTK) in Python from a list of text sentences/paragraphs. The problem I'm having is that my method is very slow, especially for large amounts of data. Does anyone have a suggestion for how to optimize this to make it run faster?
import nltk
import string
# Function to reverse tokenization
def untokenize(tokens):
return("".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip())
# Remove named entities
def ne_removal(text):
tokens = nltk.word_tokenize(text)
chunked = nltk.ne_chunk(nltk.pos_tag(tokens))
tokens = [leaf[0] for leaf in chunked if type(leaf) != nltk.Tree]
return(untokenize(tokens))
To use the code I typically have a text list and call the ne_removal function through a list comprehension. Example below:
text_list = ["Bob Smith went to the store.", "Jane Doe is my friend."]
named_entities_removed = [ne_removal(text) for text in text_list]
print(named_entities_removed)
## OUT: ['went to the store.', 'is my friend.']
UPDATE: I tried switching to batch version with this code, but it's only slightly faster. Will keep exploring. Thanks for the input so far.
def extract_nonentities(tree):
tokens = [leaf[0] for leaf in tree if type(leaf) != nltk.Tree]
return(untokenize(tokens))
def fast_ne_removal(text_list):
token_list = [nltk.word_tokenize(text) for text in text_list]
tagged = nltk.pos_tag_sents(token_list)
chunked = nltk.ne_chunk_sents(tagged)
non_entities = []
for tree in chunked:
non_entities.append(extract_nonentities(tree))
return(non_entities)
Every time you call ne_chunk(), it needs to initialize a chunker object and load the statistical model for chunking from disk. Ditto for pos_tag(). So instead of calling them on one sentence at a time, call their batch versions on the complete list of texts:
all_data = [ nltk.word_tokenize(sent) for sent in list_of_all_sents ]
tagged = nltk.pos_tag_sents(all_data)
chunked = nltk.ne_chunk_sents(tagged)
This should give you a considerable speed-up. If that's still too slow for your needs, try profiling your code and consider whether you need to switch to more high-powered tools, like #Lenz suggested.

WordNet: Iterate over synsets

For a project I would like to measure the amount of ‘human centered’ words within a text. I plan on doing this using WordNet. I have never used it and I am not quite sure how to approach this task. I want to use WordNet to count the amount of words that belong to certain synsets, for example the sysnets ‘human’ and ‘person’.
I came up with the following (simple) piece of code:
word = 'girlfriend'
word_synsets = wn.synsets(word)[0]
hypernyms = word_synsets.hypernym_paths()[0]
for element in hypernyms:
print element
Results in:
Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('causal_agent.n.01')
Synset('person.n.01')
Synset('friend.n.01')
Synset('girlfriend.n.01')
My first question is, how do I properly iterate over the hypernyms? In the code above it prints them just fine. However, when using an ‘if’ statement, for example:
count_humancenteredness = 0
for element in hypernyms:
if element == 'person':
print 'found person hypernym'
count_humancenteredness +=1
I get ‘AttributeError: 'str' object has no attribute '_name'’. What method can I use to iterate over the hypernyms of my word and perform an action (e.g. increase the count of human centerdness) when a word does indeed belong to the ‘person’ or ‘human’ synset.
Secondly, is this an efficient approach? I assume that iterating over several texts and iterating over the hypernyms of each noun will take quite some time.. Perhaps that there is another way to use WordNet to perform my task more efficiently.
Thanks for your help!
wrt the error message
hypernyms = word_synsets.hypernym_paths() returns a list of list of SynSets.
Hence
if element == 'person':
tries to compare a SynSet object against a string. That kind of comparison is not supported by the SynSet.
Try something like
target_synsets = wn.synsets('person')
if element in target_synsets:
...
or
if u'person' in element.lemma_names():
...
instead.
wrt efficiency
Currently, you do a hypernym-lookup for every word inside your input text. As you note, this is not necessarily efficient. However, if this is fast enough, stop here and do not optimize what is not broken.
To speed up the lookup, you can pre-compile a list of "person related" words in advance by making use of the transitive closure over the hyponyms as explained here.
Something like
person_words = set(w for s in p.closure(lambda s: s.hyponyms()) for w in s.lemma_names())
should do the trick. This will return a set of ~ 10,000 words, which is not too much to store in main memory.
A simple version of the word counter then becomes something on the lines of
from collections import Counter
word_count = Counter()
for word in (w.lower() for w in words if w in person_words):
word_count[word] += 1
You might also need to pre-process the input words using stemming or other morphologic reductions before passing the words on to WordNet, though.
To get all the hyponyms of a synset, you can use the following function (tested with NLTK 3.0.3, dhke's closure trick doesn't work on this version):
def get_hyponyms(synset):
hyponyms = set()
for hyponym in synset.hyponyms():
hyponyms |= set(get_hyponyms(hyponym))
return hyponyms | set(synset.hyponyms())
Example:
from nltk.corpus import wordnet
food = wordnet.synset('food.n.01')
print(len(get_hyponyms(food))) # returns 1526

Categories