Memory limit hit with appengine-mapreduce - python

I'm working on appengine-mapreduce function and have modified the demo to fit my purpose.
Basically I have a million over lines in the following format: userid, time1, time2. My purpose is to find the difference between time1 and time2 for each userid.
However, as I run this on Google App Engine, I encountered this error message in the logs section:
Exceeded soft private memory limit with 180.56 MB after servicing 130 requests total
While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
def time_count_map(data):
"""Time count map function."""
(entry, text_fn) = data
text = text_fn()
try:
q = text.split('\n')
for m in q:
reader = csv.reader([m.replace('\0', '')], skipinitialspace=True)
for s in reader:
"""Calculate time elapsed"""
sdw = s[1]
start_date = time.strptime(sdw,"%m/%d/%y %I:%M:%S%p")
edw = s[2]
end_date = time.strptime(edw,"%m/%d/%y %I:%M:%S%p")
time_difference = time.mktime(end_date) - time.mktime(start_date)
yield (s[0], time_difference)
except IndexError, e:
logging.debug(e)
def time_count_reduce(key, values):
"""Time count reduce function."""
time = 0.0
for subtime in values:
time += float(subtime)
realtime = int(time)
yield "%s: %d\n" % (key, realtime)
Can anyone suggest how else I can optimize my code better? Thanks!!
Edited:
Here's the pipeline handler:
class TimeCountPipeline(base_handler.PipelineBase):
"""A pipeline to run Time count demo.
Args:
blobkey: blobkey to process as string. Should be a zip archive with
text files inside.
"""
def run(self, filekey, blobkey):
logging.debug("filename is %s" % filekey)
output = yield mapreduce_pipeline.MapreducePipeline(
"time_count",
"main.time_count_map",
"main.time_count_reduce",
"mapreduce.input_readers.BlobstoreZipInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"blob_key": blobkey,
},
reducer_params={
"mime_type": "text/plain",
},
shards=32)
yield StoreOutput("TimeCount", filekey, output)
Mapreduce.yaml:
mapreduce:
- name: Make messages lowercase
params:
- name: done_callback
value: /done
mapper:
handler: main.lower_case_posts
input_reader: mapreduce.input_readers.DatastoreInputReader
params:
- name: entity_kind
default: main.Post
- name: processing_rate
default: 100
- name: shard_count
default: 4
- name: Make messages upper case
params:
- name: done_callback
value: /done
mapper:
handler: main.upper_case_posts
input_reader: mapreduce.input_readers.DatastoreInputReader
params:
- name: entity_kind
default: main.Post
- name: processing_rate
default: 100
- name: shard_count
default: 4
The rest of the files are exactly the same as the demo.
I've uploaded a copy of my codes on dropbox: http://dl.dropbox.com/u/4288806/demo%20compressed%20fail%20memory.zip

Also consider calling gc.collect() at regular points during your code. I've seen several SO questions about exceeding soft memory limits that were alleviated by calling gc.collect(), most having to do with blobstore.

It is likely your input file exceeds the soft memory limit in size. For big files use either BlobstoreLineInputReader or BlobstoreZipLineInputReader.
These input readers pass something different to the map function, they pass the start_position in the file and the line of text.
Your map function might look something like:
def time_count_map(data):
"""Time count map function."""
text = data[1]
try:
reader = csv.reader([text.replace('\0', '')], skipinitialspace=True)
for s in reader:
"""Calculate time elapsed"""
sdw = s[1]
start_date = time.strptime(sdw,"%m/%d/%y %I:%M:%S%p")
edw = s[2]
end_date = time.strptime(edw,"%m/%d/%y %I:%M:%S%p")
time_difference = time.mktime(end_date) - time.mktime(start_date)
yield (s[0], time_difference)
except IndexError, e:
logging.debug(e)
Using BlobstoreLineInputReader will allow the job to run much faster as it can use more than one shard, up to 256, but it means you need to upload your files uncompressed, which can be a pain. I handle it by uploading the compressed files to an EC2 windows server, then decompress and upload from there, since upstream bandwidth is so big.

Related

TransportError (429) 'Data too large' when loading JSON docs to ElasticSearch

I have a process running in Python 3.7 that loads JSON files, gathers file rows into chunks in async queues, and incrementally posts chunks to ElasticSearch for indexing.
Chunking is meant to avoid overloading the ElasticSearch connection.
def load_files_to_queue(file_queue, incomplete_files, doc_queue, index):
logger.info("Initializing load files to queue")
while True:
try:
current_file = file_queue.get(False)
logger.info("Loading {} into queue.".format(current_file))
iteration_counter = 0
with open(current_file) as loaded_file:
iterator = json_iterator(loaded_file)
current_type = "doctype"
chunk = []
for row in iterator:
# Every so often check the queue size
iteration_counter += 1
if iteration_counter > 5000:
# If it gets too big, pause until it has gone
# down a bunch.
if doc_queue.qsize() > 30:
logger.info(
"Doc queue at {}, pausing until smaller.".format(
doc_queue.qsize()
)
)
while doc_queue.qsize() > 10:
time.sleep(0.5)
iteration_counter = 0
for transformed in transform_single_doc(current_type, row, index):
if transformed:
chunk.append(transformed)
# NOTE: Send messages in chunks in stead of single rows so that queue
# has less frequent locking
if len(chunk) >= DOC_QUEUE_CHUNK_SIZE:
doc_queue.put(chunk)
chunk = []
if chunk:
doc_queue.put(chunk)
incomplete_files.remove(current_file)
logger.info("Finished loading {} into queue.".format(current_file))
logger.info("There are {} files left to load.".format(file_queue.qsize()))
except Empty:
break
def bulk_load_from_queue(file_queue, incomplete_files, doc_queue, chunk_size=500):
"""
Represents a single worker thread loading docs into ES
"""
logger.info("Initialize bulk doc loader {}".format(threading.current_thread()))
conn = Elasticsearch(settings.ELASTICSEARCH, timeout=180)
dequeue_results(
streaming_bulk(
conn,
load_docs_from_queue(file_queue, incomplete_files, doc_queue),
max_retries=2,
initial_backoff=10,
chunk_size=chunk_size,
yield_ok=False,
raise_on_exception=True,
raise_on_error=True,
)
)
logger.info("Shutting down doc loader {}".format(threading.current_thread()))
Occasionally an error like this would happen in bulk_load_from_queue, which I interpret to mean the chunk was too large.
TransportError(429, 'circuit_breaking_exception', '[parent] Data too large, data for [<http_request>] would be [1024404322/976.9mb], which is larger than the limit of [1011774259/964.9mb], real usage: [1013836880/966.8mb], new bytes reserved: [10567442/10mb], usages [request=32880/32.1kb, fielddata=7440/7.2kb, in_flight_requests=164031664/156.4mb, accounting=46679308/44.5mb]')
Re-running usually resolved this, but the error has become too frequent. So I looked to enforce a chunk size limit in load_files_to_queue like so:
for transformed in transform_single_doc(current_type, row, index):
if transformed:
chunk_size = chunk_size + sys.getsizeof(transformed)
chunk.append(transformed)
# NOTE: Send messages in chunks in stead of single rows so that queue
# has less frequent locking
if (
chunk_size >= DOC_QUEUE_CHUNK_SIZE
or len(chunk) >= DOC_QUEUE_CHUNK_LEN
):
doc_queue.put(chunk)
chunk = []
chunk_size = 0
if len(chunk) > 0:
doc_queue.put(chunk)
This results in a handful of these errors towards the end of processing:
ConnectionResetError
[Errno 104] Connection reset by peer
and then:
EOFError multiprocessing.connection in _recv
basically this means your request to Elasticsearch was too large for it to handle, so you could try reducing the chunk size
alternatively, look at using the _bulk api, there are helpers in the python clients which should take most of the pain away for this
We are encountering this again in our QA environment #brenden. do you suggest to further reduce the chunk size ? currently being passed as 200.
for worker in range(doc_worker_count):
job = doc_pool.apply_async(
bulk_load_from_queue,
args=(file_queue, incomplete_files, doc_queue, 200),
error_callback=error_callback,
)
jobs.append(job)

Tensorflow 2.3: How to parallelize reading text from big file?

I need to break down my data-set file of size 4GB into chunks small chunks. As part of optimizing the time consumption, I would like to maximize the parallel processing. Currently, I can observe that cores of CPU and GPU are under utilized. See the attached output in the image here.
My Code snippet looks like below
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def _int64_feature(value):
"""Returns an int64_list from a bool / enum / int / uint."""
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def serialize_row(text, rating):
# Create a dictionary mapping the feature name to the tf.Example-compatible data type.
feature = {
'text': _bytes_feature(text),
'rating': _float_feature(rating),
}
# Create a Features message using tf.train.Example.
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
return example_proto.SerializeToString()
def transform(example):
str_example = example.decode("utf-8")
json_example = json.loads(str_example)
overall = json_example.get('overall', -99)
text = json_example.get('reviewText', '')
if type(text) is str:
text = bytes(text, 'utf-8')
tf_serialized_string = serialize_row(text, overall)
return tf_serialized_string
line_dataset = tf.data.TextLineDataset(filenames=[file_path])
line_dataset = line_dataset.map(lambda row: tf.numpy_function(transform, [row], tf.string))
line_dataset = line_dataset.shuffle(2)
line_dataset = line_dataset.batch(NUM_OF_RECORDS_PER_BATCH_FILE)
'''
Perform batchwise transformation of the population.
'''
start = time.time()
for idx, line in line_dataset.enumerate():
FILE_NAMES = 'test{0}.tfrecord'.format(idx)
end = time.time()
time_taken = end - start
tf.print('Processing for file - {0}'.format(FILE_NAMES))
DIRECTORY_URL = '/home/gaurav.gupta/projects/practice/'
filepath = os.path.join(DIRECTORY_URL, 'data-set', 'electronics', FILE_NAMES)
batch_ds = tf.data.Dataset.from_tensor_slices(line)
writer = tf.data.experimental.TFRecordWriter(filepath)
writer.write(batch_ds)
tf.print('Processing for file - {0} took {1}'.format(FILE_NAMES, time_taken))
tf.print('Done')
Logs to showcase execution flow
Processing for file - test0.tfrecord took 14.350863218307495
Processing for file - test1.tfrecord took 12.695453882217407
Processing for file - test2.tfrecord took 12.904462575912476
Processing for file - test3.tfrecord took 12.344425439834595
Processing for file - test4.tfrecord took 11.188365697860718
Processing for file - test5.tfrecord took 11.319620609283447
Processing for file - test6.tfrecord took 11.285977840423584
Processing for file - test7.tfrecord took 11.169529438018799
Processing for file - test8.tfrecord took 11.289997816085815
Processing for file - test9.tfrecord took 11.431073188781738
Processing for file - test10.tfrecord took 11.428141593933105
Processing for file - test11.tfrecord took 3.223125457763672
Done
I have tried num_parallel_reads argument but couldn't see much difference. I believe it can be handy while reading multiple files instead of single big file.
I am seeking for your suggestion to parallelize this task to reduce the time consumption.
I would try something like this (I like to use joblib as it is quite simple to put into existing code, you could probably do something similar with many other frameworks, furthermore, joblib does not use GPU nor it does not use any JITting):
from joblib import Parallel, delayed
from tqdm import tqdm
...
def process_file(idx, line):
FILE_NAMES = 'test{0}.tfrecord'.format(idx)
end = time.time()
time_taken = end - start
tf.print('Processing for file - {0}'.format(FILE_NAMES))
DIRECTORY_URL = '/home/gaurav.gupta/projects/practice/'
filepath = os.path.join(DIRECTORY_URL, 'data-set', 'electronics', FILE_NAMES)
batch_ds = tf.data.Dataset.from_tensor_slices(line)
writer = tf.data.experimental.TFRecordWriter(filepath)
writer.write(batch_ds)
#tf.print('Processing for file - {0} took {1}'.format(FILE_NAMES, time_taken))
return FILE_NAMES, time_taken
times = Parallel(n_jobs=12, prefer="processes")(delayed(process_file)(idx, line) for idx, line in tqdm(line_dataset.enumerate(), total=len(line_dataset)))
print('Done.')
This is untested code and I am also unsure how it will work with the tf code, but I would give it a try.
The tqdm is totally unnecessary, it is just something I prefer to use as it provides a nice progress bar.

What's the fastest way to expand url in python

I have a checkin list which contains about 600000 checkins, and there is a url in each checkin, I need to expand them back to original long ones. I do so by
now = time.time()
files_without_url = 0
for i, checkin in enumerate(NYC_checkins):
try:
foursquare_url = urllib2.urlopen(re.search("(?P<url>https?://[^\s]+)", checkin[5]).group("url")).url
except:
files_without_url += 1
if i%1000 == 0:
print("from %d to %d: %2.5f seconds" %(i-1000, i, time.time()-now))
now = time.time()
But this takes too long time: from 0 to 1000 checkins, it takes 3241 seconds! Is this normal? What's the most efficient way to expand url by Python?
MODIFIED: Some Urls are from Bitly while some others are not, and I am not sure where they come from. In this case, I wanna simply use urllib2 module.
for your information, here is an example of checkin[5]:
I'm at The Diner (2453 18th Street NW, Columbia Rd., Washington) w/ 4 others. http...... (this is the short url)
I thought I would expand on my comment regarding the use of multiprocessing to speed up this task.
Let's start with a simple function that will take a url and resolve it as far as possible (following redirects until it gets a 200 response code):
import requests
def resolve_url(url):
try:
r = requests.get(url)
except requests.exceptions.RequestException:
return (url, None)
if r.status_code != 200:
longurl = None
else:
longurl = r.url
return (url, longurl)
This will either return a (shorturl, longurl) tuple, or it will
return (shorturl, None) in the event of a failure.
Now, we create a pool of workers:
import multiprocessing
pool = multiprocessing.Pool(10)
And then ask our pool to resolve a list of urls:
resolved_urls = []
for shorturl, longurl in pool.map(resolve_url, urls):
resolved_urls.append((shorturl, longurl))
Using the above code...
With a pool of 10 workers, I can resolve 500 URLs in 900 seconds.
If I increase the number of workers to 100, I can resolve 500 URLs in 30 seconds.
If I increase the number of workers to 200, I can resolve 500 URLs in 25 seconds.
This is hopefully enough to get you started.
(NB: you could write a similar solution using the threading module rather than multiprocessing. I usually just grab for multiprocessing first, but in this case either would work, and threading might even be slightly more efficient.)
Thread are most appropriate in case of network I/O. But you could try the following first.
pat = re.compile("(?P<url>https?://[^\s]+)") # always compile it
missing_urls = 0
bad_urls = 0
def check(checkin):
match = pat.search(checkin[5])
if not match:
global missing_urls
missing_urls += 1
else:
url = match.group("url")
try:
urllib2.urlopen(url) # don't lookup .url if you don't need it later
except URLError: # or just Exception
global bad_urls
bad_urls += 1
for i, checkin in enumerate(NYC_checkins):
check(checkin)
print(bad_urls, missing_urls)
If you get no improvement, now that we have a nice check function, create a threadpool and feed it. Speedup is guaranteed. Using processes for network I/O is pointless

Sublime Text 3 plugin: ValueError with Edit objects?

I'm building a Sublime Text 3 plugin to shorten URLs using the goo.gl API. Bear in mind that the following code is hacked together from other plugins and tutorial code. I have no previous experience with Python.
The plugin does actually work as it is. The URL is shortened and replaced inline. Here is the plugin code:
import sublime
import sublime_plugin
import urllib.request
import urllib.error
import json
import threading
class ShortenUrlCommand(sublime_plugin.TextCommand):
def run(self, edit):
sels = self.view.sel()
threads = []
for sel in sels:
url = self.view.substr(sel)
thread = GooglApiCall(sel, url, 5) # Send the selection, the URL and timeout to the class
threads.append(thread)
thread.start()
# Wait for threads
for thread in threads:
thread.join()
self.view.sel().clear()
self.handle_threads(edit, threads, sels)
def handle_threads(self, edit, threads, sels, offset=0, i=0, dir=1):
next_threads = []
for thread in threads:
sel = thread.sel
result = thread.result
if thread.is_alive():
next_threads.append(thread)
continue
if thread.result == False:
continue
offset = self.replace(edit, thread, sels, offset)
thread = next_threads
if len(threads):
before = i % 8
after = (7) - before
if not after:
dir = -1
if not before:
dir = 1
i += dir
self.view.set_status("shorten_url", "[%s=%s]" % (" " * before, " " * after))
sublime.set_timeout(lambda: self.handle_threads(edit, threads, sels, offset, i, dir), 100)
return
self.view.erase_status("shorten_url")
selections = len(self.view.sel())
sublime.status_message("URL shortener successfully ran on %s URL%s" %
(selections, "" if selections == 1 else "s"))
def replace(self, edit, thread, sels, offset):
sel = thread.sel
result = thread.result
if offset:
sel = sublime.Region(edit, thread.sel.begin() + offset, thread.sel.end() + offset)
self.view.replace(edit, sel, result)
return
class GooglApiCall(threading.Thread):
def __init__(self, sel, url, timeout):
self.sel = sel
self.url = url
self.timeout = timeout
self.result = None
threading.Thread.__init__(self)
def run(self):
try:
apiKey = "xxxxxxxxxxxxxxxxxxxxxxxx"
requestUrl = "https://www.googleapis.com/urlshortener/v1/url"
data = json.dumps({"longUrl": self.url})
binary_data = data.encode("utf-8")
headers = {
"User-Agent": "Sublime URL Shortener",
"Content-Type": "application/json"
}
request = urllib.request.Request(requestUrl, binary_data, headers)
response = urllib.request.urlopen(request, timeout=self.timeout)
self.result = json.loads(response.read().decode())
self.result = self.result["id"]
return
except (urllib.error.HTTPError) as e:
err = "%s: HTTP error %s contacting API. %s." % (__name__, str(e.code), str(e.reason))
except (urllib.error.URLError) as e:
err = "%s: URL error %s contacting API" % (__name__, str(e.reason))
sublime.error_message(err)
self.result = False
The problem is that I get the following error in the console every time the plugin runs:
Traceback (most recent call last):
File "/Users/joejoinerr/Library/Application Support/Sublime Text 3/Packages/URL Shortener/url_shortener.py", line 51, in <lambda>
sublime.set_timeout(lambda: self.handle_threads(edit, threads, sels, offset, i, dir), 100)
File "/Users/joejoinerr/Library/Application Support/Sublime Text 3/Packages/URL Shortener/url_shortener.py", line 39, in handle_threads
offset = self.replace(edit, thread, sels, offset)
File "/Users/joejoinerr/Library/Application Support/Sublime Text 3/Packages/URL Shortener/url_shortener.py", line 64, in replace
self.view.replace(edit, sel, result)
File "/Applications/Sublime Text.app/Contents/MacOS/sublime.py", line 657, in replace
raise ValueError("Edit objects may not be used after the TextCommand's run method has returned")
ValueError: Edit objects may not be used after the TextCommand's run method has returned
I'm not sure what the problem is from that error. I have done some research and I understand that the solution may be held in the answer to this question, but due to my lack of Python knowledge I can't figure out how to adapt it to my use case.
I was searching for a Python autocompletion plugin for Sublime and found this question. I like your plugin idea. Did you ever get it working? The ValueError is telling you that you are trying to use the edit argument to ShortenUrlCommand.run after ShortenUrlCommand.run has returned. I think you could do this in Sublime Text 2 using begin_edit and end_edit, but in 3 your plugin has to finish all of its edits before run returns (https://www.sublimetext.com/docs/3/porting_guide.html).
In your code, the handle_threads function is checking the GoogleApiCall threads every 100 ms and executing the replacement for any thread that has finished. But handle_threads has a typo that causes it to run forever: thread = next_threads where it should be threads = next_threads. This means that finished threads are never removed from the list of active threads and all threads get processed in each invocation of handle_threads (eventually throwing the exception that you see).
You actually don't need to worry about whether the GoogleApiCall treads are finished in handle_threads, though, because you call join on each one before calling handle_threads (see the python threading docs for more detail on join: https://docs.python.org/2/library/threading.html). You know the threads are finished, so you can just do something like:
def handle_threads(self, edit, threads, sels):
offset = 0
for thread in threads:
if thread.result:
offset = self.replace(edit, thread, sels, offset)
selections = len(threads)
sublime.status_message("URL shortener successfully ran on %s URL%s" %
(selections, "" if selections == 1 else "s"))
This still has problems: it does not properly handle multiple selections and it blocks the UI thread in Sublime.
Multiple Selections
When you replace multiple selections you have to consider that the replacement text might not be the same length as the text it replaces. This shifts the text after it and you have to adjust the indexes for subsequent selected regions. For example, suppose the URLs are selected in the following text and that you are replacing them with shortened URLs:
1 2 3 4 5 6 7
01234567890123456789012345678901234567890123456789012345678901234567890123
blah blah http://example.com/long blah blah http://example.com/longer blah
The second URL occupies indexes 44 to 68. After replacing the first URL we have:
1 2 3 4 5 6 7
01234567890123456789012345678901234567890123456789012345678901234567890123
blah blah http://goo.gl/abc blah blah http://example.com/longer blah
Now the second URL occupies indexes 38 to 62. It is shifted by -6: the difference between the length of the string we just replaced and the length of the string we replaced it with. You need keep track of that difference and update it after each replacement as you go along. It looks like you had this in mind with your offset argument, but never got around to implementing it.
def handle_threads(self, edit, threads, sels):
offset = 0
for thread in threads:
if thread.result:
offset = self.replace(edit, thread.sel, thread.result, offset)
selections = len(threads)
sublime.status_message("URL shortener successfully ran on %s URL%s" %
(selections, "" if selections == 1 else "s"))
def replace(self, edit, selection, replacement_text, offset):
# Adjust the selection region to account for previous replacements
adjusted_selection = sublime.Region(selection.begin() + offset,
selection.end() + offset)
self.view.replace(edit, adjusted_selection, replacement_text)
# Update the offset for the next replacement
old_len = selection.size()
new_len = len(replacement_text)
delta = new_len - old_len
new_offset = offset + delta
return new_offset
Blocking the UI Thread
I'm not familiar with Sublime plugins, so I looked at how this is handled in the Gist plugin (https://github.com/condemil/Gist). They block the UI thread for the duration of the HTTP requests. This seems undesirable, but I think there might be a problem if you don't block: the user could change the text buffer and invalidate the selection indexes before your plugin finishes its updates. If you want to go down this road, you might try moving the URL shortening calls into a WindowCommand. Then once you have the replacement text you could execute a replacement command on the current view for each one. This example gets the current view and executes ShortenUrlCommand on it. You will have to move the code that collects the shortened URLs out into ShortenUrlWrapperCommand.run:
class ShortenUrlWrapperCommand(sublime_plugin.WindowCommand):
def run(self):
view = self.window.active_view()
view.run_command("shorten_url")

in python: child processes going defunct while others are not, unsure why

edit: the answer was that the os was axing processes because i was consuming all the memory
i am spawning enough subprocesses to keep the load average 1:1 with cores, however at some point within the hour, this script could run for days, 3 of the processes go :
tipu 14804 0.0 0.0 328776 428 pts/1 Sl 00:20 0:00 python run.py
tipu 14808 64.4 24.1 2163796 1848156 pts/1 Rl 00:20 44:41 python run.py
tipu 14809 8.2 0.0 0 0 pts/1 Z 00:20 5:43 [python] <defunct>
tipu 14810 60.3 24.3 2180308 1864664 pts/1 Rl 00:20 41:49 python run.py
tipu 14811 20.2 0.0 0 0 pts/1 Z 00:20 14:04 [python] <defunct>
tipu 14812 22.0 0.0 0 0 pts/1 Z 00:20 15:18 [python] <defunct>
tipu 15358 0.0 0.0 103292 872 pts/1 S+ 01:30 0:00 grep python
i have no idea why this is happening, attached is the master and slave. i can attach the mysql/pg wrappers if needed as well, any suggestions?
slave.py:
from boto.s3.key import Key
import multiprocessing
import gzip
import os
from mysql_wrapper import MySQLWrap
from pgsql_wrapper import PGSQLWrap
import boto
import re
class Slave:
CHUNKS = 250000
BUCKET_NAME = "bucket"
AWS_ACCESS_KEY = ""
AWS_ACCESS_SECRET = ""
KEY = Key(boto.connect_s3(AWS_ACCESS_KEY, AWS_ACCESS_SECRET).get_bucket(BUCKET_NAME))
S3_ROOT = "redshift_data_imports"
COLUMN_CACHE = {}
DEFAULT_COLUMN_VALUES = {}
def __init__(self, job_queue):
self.log_handler = open("logs/%s" % str(multiprocessing.current_process().name), "a");
self.mysql = MySQLWrap(self.log_handler)
self.pg = PGSQLWrap(self.log_handler)
self.job_queue = job_queue
def do_work(self):
self.log(str(os.getpid()))
while True:
#sample job in the abstract: mysql_db.table_with_date-iteration
job = self.job_queue.get()
#queue is empty
if job is None:
self.log_handler.close()
self.pg.close()
self.mysql.close()
print("good bye and good day from %d" % (os.getpid()))
self.job_queue.task_done()
break
#curtail iteration
table = job.split('-')[0]
#strip redshift table from job name
redshift_table = re.sub(r"(_[1-9].*)", "", table.split(".")[1])
iteration = int(job.split("-")[1])
offset = (iteration - 1) * self.CHUNKS
#columns redshift is expecting
#bad tables will slip through and error out, so we catch it
try:
colnames = self.COLUMN_CACHE[redshift_table]
except KeyError:
self.job_queue.task_done()
continue
#mysql fields to use in SELECT statement
fields = self.get_fields(table)
#list subtraction determining which columns redshift has that mysql does not
delta = (list(set(colnames) - set(fields.keys())))
#subtract columns that have a default value and so do not need padding
if delta:
delta = list(set(delta) - set(self.DEFAULT_COLUMN_VALUES[redshift_table]))
#concatinate columns with padded \N
select_fields = ",".join(fields.values()) + (",\\N" * len(delta))
query = "SELECT %s FROM %s LIMIT %d, %d" % (select_fields, table,
offset, self.CHUNKS)
rows = self.mysql.execute(query)
self.log("%s: %s\n" % (table, len(rows)))
if not rows:
self.job_queue.task_done()
continue
#if there is more data potentially, add it to the queue
if len(rows) == self.CHUNKS:
self.log("putting %s-%s" % (table, (iteration+1)))
self.job_queue.put("%s-%s" % (table, (iteration+1)))
#various characters need escaping
clean_rows = []
redshift_escape_chars = set( ["\\", "|", "\t", "\r", "\n"] )
in_chars = ""
for row in rows:
new_row = []
for value in row:
if value is not None:
in_chars = str(value)
else:
in_chars = ""
#escape any naughty characters
new_row.append("".join(["\\" + c if c in redshift_escape_chars else c for c in in_chars]))
new_row = "\t".join(new_row)
clean_rows.append(new_row)
rows = ",".join(fields.keys() + delta)
rows += "\n" + "\n".join(clean_rows)
offset = offset + self.CHUNKS
filename = "%s-%s.gz" % (table, iteration)
self.move_file_to_s3(filename, rows)
self.begin_data_import(job, redshift_table, ",".join(fields.keys() +
delta))
self.job_queue.task_done()
def move_file_to_s3(self, uri, contents):
tmp_file = "/dev/shm/%s" % str(os.getpid())
self.KEY.key = "%s/%s" % (self.S3_ROOT, uri)
self.log("key is %s" % self.KEY.key )
f = gzip.open(tmp_file, "wb")
f.write(contents)
f.close()
#local saving allows for debugging when copy commands fail
#text_file = open("tsv/%s" % uri, "w")
#text_file.write(contents)
#text_file.close()
self.KEY.set_contents_from_filename(tmp_file, replace=True)
def get_fields(self, table):
"""
Returns a dict used as:
{"column_name": "altered_column_name"}
Currently only the debug column gets altered
"""
exclude_fields = ["_qproc_id", "_mob_id", "_gw_id", "_batch_id", "Field"]
query = "show columns from %s" % (table)
fields = self.mysql.execute(query)
#key raw field, value mysql formatted field
new_fields = {}
#for field in fields:
for field in [val[0] for val in fields]:
if field in exclude_fields:
continue
old_field = field
if "debug_mode" == field.strip():
field = "IFNULL(debug_mode, 0)"
new_fields[old_field] = field
return new_fields
def log(self, text):
self.log_handler.write("\n%s" % text)
def begin_data_import(self, table, redshift_table, fields):
query = "copy %s (%s) from 's3://bucket/redshift_data_imports/%s' \
credentials 'aws_access_key_id=%s;aws_secret_access_key=%s' delimiter '\\t' \
gzip NULL AS '' COMPUPDATE ON ESCAPE IGNOREHEADER 1;" \
% (redshift_table, fields, table, self.AWS_ACCESS_KEY, self.AWS_ACCESS_SECRET)
self.pg.execute(query)
master.py:
from slave import Slave as Slave
import multiprocessing
from mysql_wrapper import MySQLWrap as MySQLWrap
from pgsql_wrapper import PGSQLWrap as PGSQLWrap
class Master:
SLAVE_COUNT = 5
def __init__(self):
self.mysql = MySQLWrap()
self.pg = PGSQLWrap()
def do_work(table):
pass
def get_table_listings(self):
"""Gathers a list of MySQL log tables needed to be imported"""
query = 'show databases'
result = self.mysql.execute(query)
#turns list[tuple] into a flat list
databases = list(sum(result, ()))
#overriding during development
databases = ['db1', 'db2', 'db3']]
exclude = ('mysql', 'Database', 'information_schema')
scannable_tables = []
for database in databases:
if database in exclude:
continue
query = "show tables from %s" % database
result = self.mysql.execute(query)
#turns list[tuple] into a flat list
tables = list(sum(result, ()))
for table in tables:
exclude = ("Tables_in_%s" % database, "(", "201303", "detailed", "ltv")
#exclude any of the unfavorables
if any(s in table for s in exclude):
continue
scannable_tables.append("%s.%s-1" % (database, table))
return scannable_tables
def init(self):
#fetch redshift columns once and cache
#get columns from redshift so we can pad the mysql column delta with nulls
tables = ('table1', 'table2', 'table3')
for table in tables:
#cache columns
query = "SELECT column_name FROM information_schema.columns WHERE \
table_name = '%s'" % (table)
result = self.pg.execute(query, async=False, ret=True)
Slave.COLUMN_CACHE[table] = list(sum(result, ()))
#cache default values
query = "SELECT column_name FROM information_schema.columns WHERE \
table_name = '%s' and column_default is not \
null" % (table)
result = self.pg.execute(query, async=False, ret=True)
#turns list[tuple] into a flat list
result = list(sum(result, ()))
Slave.DEFAULT_COLUMN_VALUES[table] = result
def run(self):
self.init()
job_queue = multiprocessing.JoinableQueue()
tables = self.get_table_listings()
for table in tables:
job_queue.put(table)
processes = []
for i in range(Master.SLAVE_COUNT):
process = multiprocessing.Process(target=slave_runner, args=(job_queue,))
process.daemon = True
process.start()
processes.append(process)
#blocks this process until queue reaches 0
job_queue.join()
#signal each child process to GTFO
for i in range(Master.SLAVE_COUNT):
job_queue.put(None)
#blocks this process until queue reaches 0
job_queue.join()
job_queue.close()
#do not end this process until child processes close out
for process in processes:
process.join()
#toodles !
print("this is master saying goodbye")
def slave_runner(queue):
slave = Slave(queue)
slave.do_work()
There's not enough information to be sure, but the problem is very likely to be that Slave.do_work is raising an unhandled exception. (There are many lines of your code that could do that in various different conditions.)
When you do that, the child process will just exit.
On POSIX systems… well, the full details are a bit complicated, but in the simple case (what you have here), a child process that exits will stick around as a <defunct> process until it gets reaped (because the parent either waits on it, or exits). Since your parent code doesn't wait on the children until the queue is finished, that's exactly what happens.
So, there's a simple duct-tape fix:
def do_work(self):
self.log(str(os.getpid()))
while True:
try:
# the rest of your code
except Exception as e:
self.log("something appropriate {}".format(e))
# you may also want to post a reply back to the parent
You might also want to break the massive try up into different ones, so you can distinguish between all the different stages where things could go wrong (especially if some of them mean you need a reply, and some mean you don't).
However, it looks like what you're attempting to do is duplicate exactly the behavior of multiprocessing.Pool, but have missed the bar in a couple places. Which raises the question: why not just use Pool in the first place? You could then simplify/optimize things ever further by using one of the map family methods. For example, your entire Master.run could be reduced to:
self.init()
pool = multiprocessing.Pool(Master.SLAVE_COUNT, initializer=slave_setup)
pool.map(slave_job, tables)
pool.join()
And this will handle exceptions for you, and allow you to return values/exceptions if you later need that, and let you use the built-in logging library instead of trying to build your own, and so on. And it should only take about a dozens lines of minor code changes to Slave, and then you're done.
If you want to submit new jobs from within jobs, the easiest way to do this is probably with a Future-based API (which turns things around, making the future result the focus and the pool/executor the dumb thing that provides them, instead of making the pool the focus and the result the dumb thing it gives back), but there are multiple ways to do it with Pool as well. For example, right now, you're not returning anything from each job, so, you can just return a list of tables to execute. Here's a simple example that shows how to do it:
import multiprocessing
def foo(x):
print(x, x**2)
return list(range(x))
if __name__ == '__main__':
pool = multiprocessing.Pool(2)
jobs = [5]
while jobs:
jobs, oldjobs = [], jobs
for job in oldjobs:
jobs.extend(pool.apply(foo, [job]))
pool.close()
pool.join()
Obviously you can condense this a bit by replacing the whole loop with, e.g., a list comprehension fed into itertools.chain, and you can make it a lot cleaner-looking by passing "a submitter" object to each job and adding to that instead of returning a list of new jobs, and so on. But I wanted to make it as explicit as possible to show how little there is to it.
At any rate, if you think the explicit queue is easier to understand and manage, go for it. Just look at the source for multiprocessing.worker and/or concurrent.futures.ProcessPoolExecutor to see what you need to do yourself. It's not that hard, but there are enough things you could get wrong (personally, I always forget at least one edge case when I try to do something like this myself) that it's work looking at code that gets it right.
Alternatively, it seems like the only reason you can't use concurrent.futures.ProcessPoolExecutor here is that you need to initialize some per-process state (the boto.s3.key.Key, MySqlWrap, etc.), for what are probably very good caching reasons. (If this involves a web-service query, a database connect, etc., you certainly don't want to do that once per task!) But there are a few different ways around that.
But you can subclass ProcessPoolExecutor and override the undocumented function _adjust_process_count (see the source for how simple it is) to pass your setup function, and… that's all you have to do.
Or you can mix and match. Wrap the Future from concurrent.futures around the AsyncResult from multiprocessing.

Categories