I have two lists. List X contains 1000 words. List Y contains 500 words. I am trying to find similar words for List X with respect to Y.
I am using Spacy's similarity function.
The problem I am facing is that it takes a long time for the for loop part of the execution. I have understood from research that in python, multi threading only gives a illusion of concurrency and hence does not have any real performance increase. Thus I thought multiprocessing is the way but I am new to multiprocessing usage, hence request help.
How do I speed up the execution of the for loop part through multiprocessing in python?
The following is my code.
import en_vectors_web_lg
nlp = en_vectors_web_lg.load()
ListX =['HSBC', 'JP Morgan',......] #500 words lists
ListY = ['Currency','Blockchain'.......] #1000 words lists
s_words = []
for token1 in ListY:
list_to_sort = []
for token2 in ListX:
list_to_sort.append((token1, token2,nlp(str(token1)).similarity(nlp(str(token2)))))
sorted_list = sorted(list_to_sort, key = itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)
You can try this:
import en_vectors_web_lg
nlp = en_vectors_web_lg.load()
def compare_function(token1, token2, nlp):
return token1, token2, nlp(str(token1)).similarity(nlp(str(token2)))
from multiprocessing import Pool
import itertools
tokenlist = [(a,b, nlp) for a, b in itertools.product(ListX, ListY)]
p = Pool(8)
results = p.map(compare_function, tokenlist)
If you are under windows, use
if __name__ == '__main__':
results = p.map(compare_function, tokenlist)
Related
I have been using parfor in MATLAB to run parallel for loops for quite some time. I need to do something similar in Python but I cannot find any simple solution. This is my code:
t = list(range(1,3,1))
G = list(range(0,3,2))
results = pandas.DataFrame(columns = ['tau', 'p_value','G','t_i'],index=range(0,len(G)*len(t)))
counter = 0
for iteration_G in list(range(0,len(G))):
for iteration_t in list(range(0,len(t))):
matrix_1,matrix_2 = bunch of code
tau, p_value = scipy.stats.kendalltau(matrix_1, matrix_2)
results['tau'][counter] = tau
results['p_value'][counter] = p_value
results['G'][counter] = G[iteration_G]
results['t_i'][counter] = G[iteration_t]
counter = counter + 1
I would like to use the parfor equivalent in the first loop.
I'm not familiar with parfor, but you can use the joblib package to run functions in parallel.
In this simple example there's a function that prints its argument and we use Parallel to execute it multiple times in parallel with a for-loop
import multiprocessing
from joblib import Parallel, delayed
# function that you want to run in parallel
def foo(i):
print(i)
# define the number of cores (this is how many processes wil run)
num_cores = multiprocessing.cpu_count()
# execute the function in parallel - `return_list` is a list of the results of the function
# in this case it will just be a list of None's
return_list = Parallel(n_jobs=num_cores)(delayed(foo)(i) for i in range(20))
If this doesn't work for what you want to do, you can try to use numba - it might be a bit more difficult to set-up, but in theory with numba you can just add #njit(parallel=True) as a decorator to your function and numba will try to parallelise it for you.
I found a solution using parfor. It is still a bit more complicated than MATLAB's parfor but it's pretty close to what I am used to.
t = list(range(1,16,1))
G = list(range(0,62,2))
for iteration_t in list(range(0,len(t))):
#parfor(list(range(0,len(G))))
def fun(iteration_G):
result = pandas.DataFrame(columns = ['tau', 'p_value'],index=range(0,1))
matrix_1,matrix_2 = bunch of code
tau, p_value = scipy.stats.kendalltau(matrix_1, matrix_2)
result['tau'] = tau
result['p_value'] = p_value
fun = numpy.array([tau,p_value])
return fun
I have to analyse a large text dataset using Spacy. the dataset contains about 120000 records with a typical text length of about 1000 words. Lemmatizing the text takes quite some time so I looked for methods to reduce that time. This arcicle describes how to speed up the computations using joblib. That works reasonably well: 16 cores reduce the CPU time with a factor of 10, the hyperthreads reduce it with an extra 7%.
Recently I realized that I wanted to compute similarities between docs and probably more analyses with docs later on. So I decided to generate a Spacy document instance (<class 'spacy.tokens.doc.Doc'>) for all documents and use that for analyses (lemmatizing, vectorizing, and probably more) later on. This is where the trouble started.
The analyses of the parallel lemmatizer take place in the function below:
def lemmatize_pipe(doc):
lemma_list = [str(tok.lemma_).lower() for tok in doc
if tok.is_alpha]
return lemma_list
(the full demo code can be found at the end of the post). All I have to do is returning doc instead of lemma_list and I'm ready. I thought.
def lemmatize_pipe(doc):
return doc
The sequential version runs in 73 seconds, the parallel version returning lemma_list takes 7 seconds while the version returning doc runs in 127 seconds: twice as much as the sequential version. Full code below.
import time
import pandas as pd
from joblib import Parallel, delayed
import gensim.downloader as api
import spacy
from pdb import set_trace as breakpoint
# Initialize spacy with the small english language model
nlp = spacy.load('en', disable=['parser', 'ner', 'tagger'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
# Import the dataset and get the text
dataset = api.load("text8")
data = [d for d in dataset]
doc_requested = False
print(len(data), 'documents in original data')
df_data = pd.DataFrame(columns=['content'])
df_data['content'] = df_data['content'].astype(str)
# Content is a list of words, convert is to strings
for doc in data:
sentence = ' '.join([word for word in doc])
df_data.loc[len(df_data)] = [sentence]
### === Sequential processing ===
def lemmatize(text):
doc = nlp(text)
lemma_list = [str(tok.lemma_).lower() for tok in doc
if tok.is_alpha]
return doc if doc_requested else lemma_list
cpu = time.time()
df_data['sequential'] = df_data['content'].apply(lemmatize)
print('\nSequential processing in {:.0f} seconds'.format(time.time() - cpu))
df_data.head(3)
### === Parallel processing ===
def lemmatize_pipe(doc):
lemma_list = [str(tok.lemma_).lower() for tok in doc
if tok.is_alpha]
return doc if doc_requested else lemma_list
def chunker(iterable, total_length, chunksize):
return (iterable[pos: pos + chunksize] for pos in range(0, total_length,
chunksize))
def process_chunk(texts):
preproc_pipe = []
for doc in nlp.pipe(texts, batch_size=20):
preproc_pipe.append(lemmatize_pipe(doc))
return preproc_pipe
def preprocess_parallel(data, chunksize):
executor = Parallel(n_jobs=31, backend='multiprocessing', prefer="processes")
do = delayed(process_chunk)
tasks = (do(chunk) for chunk in chunker(data, len(data), chunksize=chunksize))
result = executor(tasks)
flattened = [item for sublist in result for item in sublist]
return flattened
cpu = time.time()
df_data['parallel'] = preprocess_parallel(df_data['content'], chunksize=1)
print('\nParallel processing in {:.0f} seconds'.format(time.time() - cpu))
I have searched and tried all kind of things but could not find a solution. In the end I have decided to compute the similarities together with the lemma's, but that is a workaround. What actually is the cause of the time increase? And is there a way to get the docs without losing that much time?
A pickled doc is quite large and contains a lot of data that isn't needed to reconstruct the doc itself, including the entire model vocab. Using doc.to_bytes() will be a major improvement, and you can improve it a bit more by using exclude to exclude data that you don't need, like doc.tensor:
data = doc.to_bytes(exclude=["tensor"])
...
reloaded_doc = Doc(nlp.vocab)
reloaded_doc.from_bytes(data)
To compare:
doc = nlp("test")
len(pickle.dumps(doc)) # 1749721
len(pickle.dumps(doc.to_bytes())) # 750
len(pickle.dumps(doc.to_bytes(exclude=["tensor"]))) # 316
You can also use doc.to_array() instead of doc.to_bytes() to export only the annotation layers that you need, but reloading the doc from the array is slightly more complicated.
See:
https://spacy.io/usage/saving-loading#docs
https://spacy.io/api/doc#serialization-fields
I have a corpus of about 5 million words that I want to put into a Fuzzy set.
Currently, it takes almost 5 minutes. Is there any faster way to do that?
this is my code:
import fuzzyset
fuzzy_set = fuzzyset.FuzzySet()
for word in list_of_words: # len(list_of_words)=~5M
fuzzy_set.add(word)
I know that for-loop is not the fastest way to do things in Python but couldn't find any documentation to add a list to FuzzySet.
Thanks for the help.
You could use multi processing.
Split your list_of_words up into chunks and run it in a pool.
import fuzzyset
import multiprocessing as mp
fuzzy_set = fuzzyset.FuzzySet()
def add_words(chunk):
for word chunk:
fuzzy_set.add(word)
if __name__ == '__main__':
n = 500000 # or whatever size you want your chunks split up into
chunks = [list_of_words[x:x + n ] for x in range(0, len(list_of_words), n )]
pool = mp.Pool(mp.cpu_count())
pool.map(add_words, chunks)
I'm trying to measure four similarities(cosine_similarity, jaccard, Sequence Matcher similarity, jaccard_variants similarity) over 800K pairs of documents.
Every document file is txt format and about 100KB ~ 300KB(About 1500000 characters).
I have two questions regarding how to make my python scripts faster:
MY PYTHON SCRIPTS:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import SequenceMatcher
def get_tf_vectors(doc1, doc2):
text = [doc1, doc2]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def measure_sim(doc1, doc2):
a, b = doc1.split(), doc2.split()
c, d = set(a), set(b)
vectors = [t for t in get_tf_vectors(doc1, doc2)]
return cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)), \
1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])), \
SequenceMatcher(None, a, b).ratio()
#items in doc_pair list are like('ID', 'doc1_directory', 'doc2_directory')
def data_analysis(doc_pair_list):
result = {}
for item in doc_pair_list:
f1 = open(item[1], 'rb')
doc1 = f1.read()
f1.close()
f2 = oepn(item[2], 'rb')
doc2 = f2.read()
f2.close()
result[item[0]] = measure_sim(doc1, doc2)
However, this code uses only 10% of my CPU and it takes almost 20 days to this task to be done. So I want to ask if there would be any way to make this code more efficient.
Q1. Since Documents are saved in HDD, I thought loading those text data should take some time. Hence, I suspect that loading only two documents every time the computer computes the similarities might not be efficient. Hence I am going to try loading 50 pairs of documents at once and computes similarity respectively. Would it be helpful?
Q2. Most of the postings about "How to make your codes run faster" said that I should use Python module based on C-code. However, since I'm using sklearn module which is known to be quite efficient, I wonder there would be any better way.
Is there any way that could help this python script to use more computer resources and become faster??
There are maybe better solutions, but you may try something like this, if the counting of similarities is the blocker:
1) A separate process to read all the files one by one and put them to a multiprocessing.Queue
2) Pool of multiple worker processes to count the similarities and put results into multiprocessing.Queue.
3) Main thread then simply loads results from results_queue and save them to dictionary as you have it now.
I don't know your hardware limitations (number and speed of CPU cores, RAM size, disk read speed) and I don't have any samples to test it on.
EDIT: Below is provided the described code. Please try and check if it is faster and let me know. If the main blocker is loading of files, we can create more loader processes (e.g. 2 processes and each loads half of the files). If the blocker is calculating similarities, then you can create more worker processes (just change worker_count). Finally 'results' is the dictionary with all the results.
import multiprocessing
import os
from difflib import SequenceMatcher
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_tf_vectors(doc1, doc2):
text = [doc1, doc2]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def calculate_similarities(doc_pairs_queue, results_queue):
""" Pick docs from doc_pairs_queue and calculate their similarities, save the result to results_queue. Repeat infinitely (until process is terminated). """
while True:
pair = doc_pairs_queue.get()
pair_id = pair[0]
doc1 = pair[1]
doc2 = pair[2]
a, b = doc1.split(), doc2.split()
c, d = set(a), set(b)
vectors = [t for t in get_tf_vectors(doc1, doc2)]
results_queue.put((pair_id, cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)),
1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])),
SequenceMatcher(None, a, b).ratio()))
def load_files(doc_pair_list, loaded_queue):
"""
Pre-load files and put them to a queue, so working processes can get them.
:param doc_pair_list: list of files to be loaded (ID, doc1_path, doc2_path)
:param loaded_queue: multiprocessing.Queue that will hold pre-loaded data
"""
print("Started loading files...")
for item in doc_pair_list:
with open(item[1], 'rb') as f1:
with open(item[2], 'rb') as f2:
loaded_queue.put((item[0], f1.read(), f2.read())) # if queue is full, this automatically waits until there is space
print("Finished loading files.")
def data_analysis(doc_pair_list):
# create a loader process that will pre-load files (it does no calculations, so it loads much faster)
# loader puts loaded files to a queue; 1 pair ~ 500 KB, 1000 pairs ~ 500 MB max size of queue (RAM memory)
loaded_pairs_queue = multiprocessing.Queue(maxsize=1000)
loader = multiprocessing.Process(target=load_files, args=(doc_pair_list, loaded_pairs_queue))
loader.start()
# create worker processes - these will do all calculations
results_queue = multiprocessing.Queue(maxsize=1000) # workers put results to this queue
worker_count = os.cpu_count() if os.cpu_count() else 2 # number of worker processes
workers = [] # create list of workers, so we can terminate them later
for i in range(worker_count):
worker = multiprocessing.Process(target=calculate_similarities, args=(loaded_pairs_queue, results_queue))
worker.start()
workers.append(worker)
# main process just picks the results from queue and saves them to the dictionary
results = {}
i = 0 # results counter
pairs_count = len(doc_pair_list)
while i < pairs_count:
res = results_queue.get(timeout=600) # timeout is just in case something unexpected happened (results are calculated much quicker)
# Queue.get() is blocking - if queue is empty, get() waits until something is put into queue and then get it
results[res[0]] = res[1:] # save to dictionary by ID (first item in the result)
# clean up the processes (so there aren't any zombies left)
loader.terminate()
loader.join()
for worker in workers:
worker.terminate()
worker.join()
Let me know about the results please, I am quite interested in it and will assist you further if needed ;)
First thing to do is see if you can find the real bottleneck and I think using cProfile might confirm your suspicion or shed some more light on your problem.
You should be able to run your code unmodified using cProfile like this:
python -m cProfile -o profiling-results python-file-to-test.py
After that you can analyze the results using pstats like this:
import pstats
stats = pstats.Stats("profiling-results")
stats.sort_stats("tottime")
stats.print_stats(10)
More on profiling your code is on Marco Bonazanin's blog article My Python Code is Slow? Tips for Profiling
This is my first time trying to use multiprocessing in Python. I'm trying to parallelize my function fun over my dataframe df by row. The callback function is just to append results to an empty list that I'll sort through later.
Is this the correct way to use apply_async? Thanks so much.
import multiprocessing as mp
function_results = []
async_results = []
p = mp.Pool() # by default should use number of processors
for row in df.iterrows():
r = p.apply_async(fun, (row,), callback=function_results.extend)
async_results.append(r)
for r in async_results:
r.wait()
p.close()
p.join()
It looks like using map or imap_unordered (dependending on whether you need your results to be ordered or not) would better suit your needs
import multiprocessing as mp
#prepare stuff
if __name__=="__main__":
p = mp.Pool()
function_results = list(p.imap_unorderd(fun,df.iterrows())) #unordered
#function_results = p.map(fun,df.iterrows()) #ordered
p.close()