Library called spaCy has some problems when being shared across executors (problem with pickling). One of the workarounds is to import it independently on each map execution but the load takes a while.
I'm new to Spark and so I don't understand what's the exact mechanism behind map. What will happen in example case below?
I'm afraid of the worst case scenario, where individual lines of text are processed independently and for each one it will import spacy. Fresh import can take good 10+ s and we have 1,000,000+ lines of text.
class SpacyMagic(object):
_spacys = {}
#classmethod
def get(cls, lang):
if lang not in cls._spacys:
import spacy
cls._spacys[lang] = spacy.load(lang)
return cls._spacys[lang]
def run_spacy(sent):
nlp = SpacyMagic.get('en')
return [wrd.text for wrd in nlp(sent)]
sc = SparkContext(appName="LineTokenizer")
data = sc.textFile(s3in)
res = data.map(process_line)
print res.take(100)
sc.stop()
Related
I am a fairly beginner programmer with python and in general with not that much experience, and currently I'm trying to parallelize a process that is heavily CPU bound in my code. I'm using anaconda to create environments and Visual Code to debug.
A summary of the code is as following :
from tkinter import filedialog
import myfuncs as mf, concurrent.futures
file_path = filedialog.askopenfilename('Ask for a file containing data')
# import data from file_path
a = input('Ask the user for input')
Next calculations are made from these and I reach a stage where I need to iterate of a list of lists. These lists may contain up to two values and calls are made to a separate file.
For example the inputs are :
sub_data1 = [test1]
sub_data2 = [test1, test2]
dataset = [sub_data1, sub_data2]
This is the stage I use concurrent.futures.ProcessPoolExecutor()-instance and its .map() method :
with concurrent.futures.ProcessPoolExecutor() as executor:
sm_res = executor.map(mf.process_distr, dataset)
While inside a myfuncs.py, the mf.process_distr() function works like this :
def process_distr(tests):
sm_reg = []
for i in range(len(tests)):
if i==0:
# do stuff
sm_reg.append(result1)
else:
# do stuff
sm_reg.append(result2)
return sm_reg
The problem is that when I try to execute this code on the main.py file, it seems that the main.py starts running multiple times, and asks for user inputs and file dialog pops up multiple times (same amount as cores count).
How can I resolve this matter?
Edit: After reading more into it, encapsulating the whole main.py code with:
if __name__ == '__main__':
did the trick. Thank you to anyone who gave time to help with my rookie problem.
I am writing a genetic optimization algorithm based on the deap package in python 2.7 (goal is to migrate to python 3 soon). As it is a pretty heavy process, some parts of the optimisation are processed using the multiprocessing package. Here is a summary outline of my program:
Configurations are read in and saved in a config object
Some additional pre-computations are made and saved as well in the config object
The optimisation starts (population is initialized randomly and mutations, crossover is applied to find a better solution) and some parts of it (evaluation function) are executed in multiprocessing
The results are saved
For the evaluation function, we need to have access to some parts of the config object (which after phase 2 stays a constant). Therefore we make it accessible to the different cores using a global (constant) variable:
from deap import base
import multiprocessing
toolbox = base.Toolbox()
def evaluate(ind):
# compute evaluation using config object
return(obj1,obj2)
toolbox.register('evaluate',evaluate)
def init_pool_global_vars(self, _config):
global config
config = _config
...
# setting up multiprocessing
pool = multiprocessing.Pool(processes=72, initializer=self.init_pool_global_vars,
initargs=[config])
toolbox.register('map', pool.map_async)
...
while tic < max_time:
# creating new individuals
# computing in optimisation the objective function on the different individuals
jobs = toolbox.map(toolbox.evaluate, ind)
fits = jobs.get()
# keeping best individuals
We basically make different iterations (big for loop) until a maximum time is reached. I have noticed that if I make the config object bigger (i.e. add big attributes to it, like a big numpy array) even if the code is still same it runs much slower (fewer iterations for the same timespan). So I thought I would make a specific config_multiprocessing object that contains only the attributes needed in the multiprocessing part and pass that as a global variable, but when I run it on 3 cores it is slower than with the big config object and on 72 cores, it is slightly faster, but not much.
What should I do in order to make sure my loops don't suffer in speed from the config object or from any other data manipulations I make before launching the multiprocessing loops?
Running in a Linux docker image on a linux VM in the cloud.
The joblib package is designed to handle cases where you have large numpy arrays to distribute to workers with shared memory. This is especially useful if you are treating the data in shared memory as "read-only" like what you describe in your scenario. You can also create writable shared memory as described in the docs.
Your code might look something like:
import os
import numpy as np
from joblib import Parallel, delayed
from joblib import dump, load
folder = './joblib_memmap'
try:
os.mkdir(folder)
except FileExistsError:
pass
def evaluate(ind, data):
# compute evaluation using shared memory data
return(obj1, obj2)
# just used to initialize memory mapped data
def init_memmap_data(original_data):
data_filename_memmap = os.path.join(folder, 'data_memmap')
dump(original_data, data_filename_memmap)
shared_data = load(data_filename_memmap, mmap_mode='r')
return shared_data
...
# however you set up indices needs to be changed here
indexes = range(10)
# however you load your numpy data needs to be done here
shared_data = init_memmap_data(numpy_array_to_share)
# change n_jobs as appropriate
results = Parallel(n_jobs=2)(delayed(evaluate)(ind, shared_data) for ind in indexes)
# get index of the maximum as the "best" individual
best_fit_individual = indexes[results.argmax()]
Additionally, joblib supports a threading backend that may be faster than the process based one. It will be easy to test both with joblib.
I have written python script which have this below function. Lemmatized function taking so much time which is affecting the code efficiency. I am using spacy module for lemmatization.
def lemmatization(cleaned_data, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
try:
logging.info("loading function lemmatization")
texts = list(sent_to_words(cleaned_data))
texts_out = []
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# Run in terminal: python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] \
else '' for token in doc if token.pos_ in allowed_postags]))
except Exception as error:
logging.info("Error occured in Lemmatization method. Error is %s", error)
return texts_out
Is there any way to optimize it?
Thanks in advance!
Variable names and variable transformations. I do not quite understand behind what the data variables are. cleaned_data is text, texts are again a list of word and what is sent in texts? Things can improve if you change variable names, document args in fucntion docstrings and add type annotations (python 3.6+). This is very typical when you work with program as a script, but unclear variables haunt both outsiode reader like myself and probably authors of code in 2-3 months from now, so better change.
Ideas for speedup. As for speedup there can be following cases, I think:
nlp function is slow itself
nlp() encouters lots of errors and does a lot of logging
something is slow in the rest of script (but these things seem rather minimal)
sent_to_words() not shown, maybe somethign happens there
Refactoring. For profiling the program you need to split it to fucntions to see what actiually takes a lot of time. See a refactoring below, hope it helps.
import logging
import spacy
from profilehooks import profile
# your actaul fucntion here
def sent_to_words(x):
pass
# a small speedup comes from == vs in
def exclude_pron(token):
x = token.lemma_
if x == '-PRON-':
return ''
return x
# functional approach, could be faster than signle comprehension
def extract_lemmas(doc, allowed_postags):
gen = (token for token in doc if token.pos_ in allowed_postags)
return map(exclude_pron, gen)
def make_model():
"""Initialize spacy 'en' model, keeping only tagger component for efficiency.
Run in terminal: python3 -m spacy download en
"""
return spacy.load('en', disable=['parser', 'ner'])
def make_texts_out(texts, nlp, allowed_postags):
texts_out = []
for sent in texts:
# really important and bothering = what is 'sent'?
doc = nlp(" ".join(sent))
res = extract_lemmas(doc, allowed_postags)
texts_out.append(res)
return res
# FIXME:
# - *clean_data* is too generic variable name, better rename
# - flow of variables is unclear: cleaned_data is split to words,
# and then combined to text " ".join(sent) again,
# it is not so clear what happens
#profile(immediate=True, entries=20)
def lemmatization(cleaned_data: list, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
logging.info("loading function lemmatization")
texts = list(sent_to_words(cleaned_data))
nlp = make_model()
try:
texts_out = list(make_texts_out(texts, nlp, allowed_postags))
except Exception as error:
logging.info("Error occured in lemmatization method. Error is %s", error)
return texts_out
I would like to call model.wv.most_similar_cosmul, on the same copy of model object, using multiple cores, on batches of input pairs.
The multiprocessing module requires multiple copies of model, which will require too much RAM because my model is 30+ GB in RAM.
I have tried to evaluate my query pairs. It took me ~12 hours for the first round. There may be more rounds coming. That's why I am looking for a threading solution. I understand Python has Global Interpreter Lock issue.
Any suggestions?
Forking off processes using multiprocessing after your text-vector model is in memory and unchanging might work to let many processes share the same object-in-memory.
In particular, you'd want to be sure that the automatic generation of unit-normed vectors (into a syn0norm or doctag_syn0norm) has already happened. It'll be automatically triggered the first time it's needed by a most_similar() call, or you can force it with the init_sims() method on the relevant object. If you'll only be doing most-similar queries between unit-normed vectors, never needing the original raw vectors, use init_sims(replace=True) to clobber the raw mixed-magnitude syn0 vectors in-place and thus save a lot of addressable memory.
Gensim also has options to use memory-mapped files as the sources of model giant arrays, and when multiple processes use the same read-only memory-mapped file, the OS will be smart enough to only map that file into physical memory once, providing both processes pointers to the shared array.
For more discussion of the tricky parts of using this technique in a similar-but-not-identical use case, see my answer at:
How to speed up Gensim Word2vec model load time?
Gensim v4.x.x simplified a lot of what #gojomo described above, as he also explained in his other answer here. Based on those answers, here's an example of how you can multiprocess most_similar in a memory-efficient way, including logging of progress with tqdm. Swap in your own model/dataset to see how this works at scale.
import multiprocessing
from functools import partial
from typing import Dict, List, Tuple
import tqdm
from gensim.models.word2vec import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
from gensim.test.utils import common_texts
def get_most_similar(
word: str, keyed_vectors: KeyedVectors, topn: int
) -> List[Tuple[str, float]]:
try:
return keyed_vectors.most_similar(word, topn=topn)
except KeyError:
return []
def get_most_similar_batch(
word_batch: List[str], word_vectors_path: str, topn: int
) -> Dict[str, List[Tuple[str, float]]]:
# Load the keyedvectors with mmap, so memory isn't duplicated
keyed_vectors = KeyedVectors.load(word_vectors_path, mmap="r")
return {word: get_most_similar(word, keyed_vectors, topn) for word in word_batch}
def create_batches_from_iterable(iterable, batch_size=1000):
return [iterable[i : i + batch_size] for i in range(0, len(iterable), batch_size)]
if __name__ == "__main__":
model = Word2Vec(
sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4
)
# Save wv, so it can be reloaded with mmap later
word_vectors_path = "word2vec.wordvectors"
model.wv.save(word_vectors_path)
# Dummy set of words to find most similar words for
words_to_match = list(model.wv.key_to_index.keys())
# Multiprocess
batches = create_batches_from_iterable(words_to_match, batch_size=2)
partial_func = partial(
get_most_similar_batch,
word_vectors_path=word_vectors_path,
topn=5,
)
words_most_similar = dict()
num_workers = multiprocessing.cpu_count()
with multiprocessing.Pool(num_workers) as pool:
max_ = len(batches)
with tqdm.tqdm(total=max_) as pbar:
# imap required for tqdm to function properly
for result in pool.imap(partial_func, batches):
words_most_similar.update(result)
pbar.update()
I have written a simple MapReduce flow to read in lines from a CSV from a file on Google Cloud Storage and subsequently make an Entity. However, I can't seem to get it to run on more than one shard.
The code makes use of mapreduce.control.start_map and looks something like this.
class LoadEntitiesPipeline(webapp2.RequestHandler):
id = control.start_map(map_name,
handler_spec="backend.line_processor",
reader_spec="mapreduce.input_readers.FileInputReader",
queue_name=get_queue_name("q-1"),
shard_count=shard_count,
mapper_parameters={
'shard_count': shard_count,
'batch_size': 50,
'processing_rate': 1000000,
'files': [gsfile],
'format': 'lines'})
I have shard_count in both places, because I'm not sure what methods actually need it. Setting shard_count anywhere from 8 to 32, doesn't change anything as the status page always says 1/1 shards running. To separate things, I've made everything run on a backend queue with a large number of instances. I've tried adjusting the queue parameters per this wiki. In the end, it seems to just run serially.
Any ideas? Thanks!
Update (Still no success):
In trying to isolate things, I tried making the call using direct calls to pipeline like so:
class ImportHandler(webapp2.RequestHandler):
def get(self, gsfile):
pipeline = LoadEntitiesPipeline2(gsfile)
pipeline.start(queue_name=get_queue_name("q-1"))
self.redirect(pipeline.base_path + "/status?root=" + pipeline.pipeline_id)
class LoadEntitiesPipeline2(base_handler.PipelineBase):
def run(self, gsfile):
yield mapreduce_pipeline.MapperPipeline(
'loadentities2_' + gsfile,
'backend.line_processor',
'mapreduce.input_readers.FileInputReader',
params={'files': [gsfile], 'format': 'lines'},
shards=32
)
With this new code, it still only runs on one shard. I'm starting to wonder if mapreduce.input_readers.FileInputReader is capable of parallelizing input by line.
It looks like FileInputReader can only shard via files. The format params only change the way mapper function got call. If you pass more than one files to the mapper, it will start to run on more than one shard. Otherwise it will only use one shard to process the data.
EDIT #1:
After dig deeper in the mapreduce library. MapReduce will decide whether or not to split file into pieces based on the can_split method return for each file type it defined. Currently, the only format which implement split method is ZipFormat. So, if your file format is not zip, it won't split the file to run on more than one shard.
#classmethod
def can_split(cls):
"""Indicates whether this format support splitting within a file boundary.
Returns:
True if a FileFormat allows its inputs to be splitted into
different shards.
"""
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/file_formats.py
But it looks like it is possible to write your own file format split method. You can try to hack and add split method on _TextFormat first and see if more than one shard running.
#classmethod
def split(cls, desired_size, start_index, opened_file, cache):
pass
EDIT #2:
An easy workaround would be left the FileInputReader run serially but move the time-cosuming task to parallel reduce stage.
def line_processor(line):
# serial
yield (random.randrange(1000), line)
def reducer(key, values):
# parallel
entities = []
for v in values:
entities.append(CREATE_ENTITY_FROM_VALUE(v))
db.put(entities)
EDIT #3:
If try to modify the FileFormat, here is an example (haven't been test yet)
from file_formats import _TextFormat, FORMATS
class _LinesSplitFormat(_TextFormat):
"""Read file line by line."""
NAME = 'split_lines'
def get_next(self):
"""Inherited."""
index = self.get_index()
cache = self.get_cache()
offset = sum(cache['infolist'][:index])
self.get_current_file.seek(offset)
result = self.get_current_file().readline()
if not result:
raise EOFError()
if 'encoding' in self._kwargs:
result = result.encode(self._kwargs['encoding'])
return result
#classmethod
def can_split(cls):
"""Inherited."""
return True
#classmethod
def split(cls, desired_size, start_index, opened_file, cache):
"""Inherited."""
if 'infolist' in cache:
infolist = cache['infolist']
else:
infolist = []
for i in opened_file:
infolist.append(len(i))
cache['infolist'] = infolist
index = start_index
while desired_size > 0 and index < len(infolist):
desired_size -= infolist[index]
index += 1
return desired_size, index
FORMATS['split_lines'] = _LinesSplitFormat
Then the new file format can be called via change the mapper_parameters from lines to split_line.
class LoadEntitiesPipeline(webapp2.RequestHandler):
id = control.start_map(map_name,
handler_spec="backend.line_processor",
reader_spec="mapreduce.input_readers.FileInputReader",
queue_name=get_queue_name("q-1"),
shard_count=shard_count,
mapper_parameters={
'shard_count': shard_count,
'batch_size': 50,
'processing_rate': 1000000,
'files': [gsfile],
'format': 'split_lines'})
It looks to me like FileInputReader should be capable of sharding based on a quick reading of:
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/input_readers.py
It looks like 'format': 'lines' should split using: self.get_current_file().readline()
Does it seem to be interpreting the lines correctly when it is working serially? Maybe the line breaks are the wrong encoding or something.
From experience FileInputReader will do a max of one shard per file.
Solution: Split your big files. I use split_file in https://github.com/johnwlockwood/karl_data to shard files before uploading them to Cloud Storage.
If the big files are already up there, you can use a Compute Engine instance to pull them down and do the sharding because the transfer speed will be fastest.
FYI: karld is in the cheeseshop so you can pip install karld