Tensorflow 2.3: How to parallelize reading text from big file? - python

I need to break down my data-set file of size 4GB into chunks small chunks. As part of optimizing the time consumption, I would like to maximize the parallel processing. Currently, I can observe that cores of CPU and GPU are under utilized. See the attached output in the image here.
My Code snippet looks like below
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def _int64_feature(value):
"""Returns an int64_list from a bool / enum / int / uint."""
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def serialize_row(text, rating):
# Create a dictionary mapping the feature name to the tf.Example-compatible data type.
feature = {
'text': _bytes_feature(text),
'rating': _float_feature(rating),
}
# Create a Features message using tf.train.Example.
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
return example_proto.SerializeToString()
def transform(example):
str_example = example.decode("utf-8")
json_example = json.loads(str_example)
overall = json_example.get('overall', -99)
text = json_example.get('reviewText', '')
if type(text) is str:
text = bytes(text, 'utf-8')
tf_serialized_string = serialize_row(text, overall)
return tf_serialized_string
line_dataset = tf.data.TextLineDataset(filenames=[file_path])
line_dataset = line_dataset.map(lambda row: tf.numpy_function(transform, [row], tf.string))
line_dataset = line_dataset.shuffle(2)
line_dataset = line_dataset.batch(NUM_OF_RECORDS_PER_BATCH_FILE)
'''
Perform batchwise transformation of the population.
'''
start = time.time()
for idx, line in line_dataset.enumerate():
FILE_NAMES = 'test{0}.tfrecord'.format(idx)
end = time.time()
time_taken = end - start
tf.print('Processing for file - {0}'.format(FILE_NAMES))
DIRECTORY_URL = '/home/gaurav.gupta/projects/practice/'
filepath = os.path.join(DIRECTORY_URL, 'data-set', 'electronics', FILE_NAMES)
batch_ds = tf.data.Dataset.from_tensor_slices(line)
writer = tf.data.experimental.TFRecordWriter(filepath)
writer.write(batch_ds)
tf.print('Processing for file - {0} took {1}'.format(FILE_NAMES, time_taken))
tf.print('Done')
Logs to showcase execution flow
Processing for file - test0.tfrecord took 14.350863218307495
Processing for file - test1.tfrecord took 12.695453882217407
Processing for file - test2.tfrecord took 12.904462575912476
Processing for file - test3.tfrecord took 12.344425439834595
Processing for file - test4.tfrecord took 11.188365697860718
Processing for file - test5.tfrecord took 11.319620609283447
Processing for file - test6.tfrecord took 11.285977840423584
Processing for file - test7.tfrecord took 11.169529438018799
Processing for file - test8.tfrecord took 11.289997816085815
Processing for file - test9.tfrecord took 11.431073188781738
Processing for file - test10.tfrecord took 11.428141593933105
Processing for file - test11.tfrecord took 3.223125457763672
Done
I have tried num_parallel_reads argument but couldn't see much difference. I believe it can be handy while reading multiple files instead of single big file.
I am seeking for your suggestion to parallelize this task to reduce the time consumption.

I would try something like this (I like to use joblib as it is quite simple to put into existing code, you could probably do something similar with many other frameworks, furthermore, joblib does not use GPU nor it does not use any JITting):
from joblib import Parallel, delayed
from tqdm import tqdm
...
def process_file(idx, line):
FILE_NAMES = 'test{0}.tfrecord'.format(idx)
end = time.time()
time_taken = end - start
tf.print('Processing for file - {0}'.format(FILE_NAMES))
DIRECTORY_URL = '/home/gaurav.gupta/projects/practice/'
filepath = os.path.join(DIRECTORY_URL, 'data-set', 'electronics', FILE_NAMES)
batch_ds = tf.data.Dataset.from_tensor_slices(line)
writer = tf.data.experimental.TFRecordWriter(filepath)
writer.write(batch_ds)
#tf.print('Processing for file - {0} took {1}'.format(FILE_NAMES, time_taken))
return FILE_NAMES, time_taken
times = Parallel(n_jobs=12, prefer="processes")(delayed(process_file)(idx, line) for idx, line in tqdm(line_dataset.enumerate(), total=len(line_dataset)))
print('Done.')
This is untested code and I am also unsure how it will work with the tf code, but I would give it a try.
The tqdm is totally unnecessary, it is just something I prefer to use as it provides a nice progress bar.

Related

How can i optimize my Python loop for speed

I wrote some code that uses OCR to extract text from screenshots of follower lists and then transfer them into a data frame.
The reason I have to do the hustle with "name" / "display name" and removing blank lines is that the initial text extraction looks something like this:
Screenname 1
name 1
Screenname 2
name 2
(and so on)
So I know in which order each extraction will be.
My code works well for 1-30 images, but if I take more than that its gets a bit slow. My goal is to run around 5-10k screenshots through it at once. I'm pretty new to programming so any ideas/tips on how to optimize the speed would be very appreciated! Thank you all in advance :)
from PIL import Image
from pytesseract import pytesseract
import os
import pandas as pd
from itertools import chain
list_final = [""]
list_name = [""]
liste_anzeigename = [""]
list_raw = [""]
anzeigename = [""]
name = [""]
sort = [""]
f = r'/Users/PycharmProjects/pythonProject/images'
myconfig = r"--psm 4 --oem 3"
os.listdir(f)
for file in os.listdir(f):
f_img = f+"/"+file
img = Image.open(f_img)
img = img.crop((240, 400, 800, 2400))
img.save(f_img)
for file in os.listdir(f):
f_img = f + "/" + file
test = pytesseract.image_to_string(PIL.Image.open(f_img), config=myconfig)
lines = test.split("\n")
list_raw = [line for line in lines if line.strip() != ""]
sort.append(list_raw)
name = {list_raw[0], list_raw[2], list_raw[4],
list_raw[6], list_raw[8], list_raw[10],
list_raw[12], list_raw[14], list_raw[16]}
list_name.append(name)
anzeigename = {list_raw[1], list_raw[3], list_raw[5],
list_raw[7], list_raw[9], list_raw[11],
list_raw[13], list_raw[15], list_raw[17]}
liste_anzeigename.append(anzeigename)
reihenfolge_name = list(chain.from_iterable(list_name))
index_anzeigename = list(chain.from_iterable(liste_anzeigename))
sortieren = list(chain.from_iterable(sort))
print(list_raw)
sort_name = sorted(reihenfolge_name, key=sortieren.index)
sort_anzeigename = sorted(index_anzeigename, key=sortieren.index)
final = pd.DataFrame(zip(sort_name, sort_anzeigename), columns=['name', 'anzeigename'])
print(final)
Use a multiprocessing.Pool.
Combine the code under the for-loops, and put it into a function process_file.
This function should accept a single argument; the name of a file to process.
Next using listdir, create a list of files to process.
Then create a Pool and use its map method to process the list;
import multiprocessing as mp
def process_file(name):
# your code goes here.
return anzeigename # Or watever the result should be.
if __name__ is "__main__":
f = r'/Users/PycharmProjects/pythonProject/images'
p = mp.Pool()
liste_anzeigename = p.map(process_file, os.listdir(f))
This will run your code in parallel in as many cores as your CPU has.
For a N-core CPU this will take approximately 1/N times the time as doing it without multiprocessing.
Note that the return value of the worker function should be pickleable; it has to be returned from the worker process to the parent process.

Using concurrent.futures within a for statement

I store QuertyText within a pandas dataframe. Once I've loaded all the queries into I want to conduct an analysis again each query. Currently, I have ~50k to evaluate. So, doing it one by one, will take a long time.
So, I wanted to implement concurrent.futures. How do I take the individual QueryText stored within fullAnalysis as pass it to concurrent.futures and return the output as a variable?
Here is my entire code:
import pandas as pd
import time
import gensim
import sys
import warnings
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
fullAnalysis = pd.DataFrame()
def fetch_data(jFile = 'ProcessingDetails.json'):
print("Fetching data...please wait")
#read JSON file for latest dictionary file name
baselineDictionaryFileName = 'Dictionary/Dictionary_05-03-2020.json'
#copy data to pandas dataframe
labelled_data = pd.read_json(baselineDictionaryFileName)
#Add two more columns to get the most similar text and score
labelled_data['SimilarText'] = ''
labelled_data['SimilarityScore'] = float()
print("Data fetched from " + baselineDictionaryFileName + " and there are " + str(labelled_data.shape[0]) + " rows to be evalauted")
return labelled_data
def calculateScore(inputFunc):
warnings.filterwarnings("ignore", category=DeprecationWarning)
model = gensim.models.Word2Vec.load('w2v_model_bigdata')
inp = inputFunc
print(inp)
out = dict()
strEvaluation = inp.split("most_similar ",1)[1]
#while inp != 'quit':
split_inp = inp.split()
try:
if split_inp[0] == 'help':
pass
elif split_inp[0] == 'similarity' and len(split_inp) >= 3:
pass
elif split_inp[0] == 'most_similar' and len(split_inp) >= 2:
for pair in model.most_similar(positive=[split_inp[1]]):
out.update({pair[0]: pair[1]})
except KeyError as ke:
#print(str(ke) + "\n")
inp = input()
return out
def main():
with ThreadPoolExecutor(max_workers=5) as executor:
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
#for item in executor.map(calculateScore, arg):
output = executor.map(calculateScore, arg)
return output
if __name__ == "__main__":
fullAnalysis = fetch_data()
results = main()
print(f'results: {results}')
The Python Global Interpreter Lock or GIL allows only one thread to hold control of the Python interpreter. Since your function calculateScore might be cpu-bound and requires the interpreter to execute its byte code, you may be gaining little by using threading. If, on the other hand, it were doing mostly I/O operations, it would be giving up the GIL for most of its running time allowing other threads to run. But that does not seem to be the case here. You probably should be using the ProcessPoolExecutor from concurrent.futures (try it both ways and see):
def main():
with ProcessPoolExecutor(max_workers=None) as executor:
the_futures = {}
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
future = executor.submit(calculateScore, arg)
the_futures[future] = i # map future to request
for future in as_completed(the_futures): # results as they become available not necessarily the order of submission
i = the_futures[future] # the original index
result = future.result() # the result
If you omit the max_workers parameter (or specify a value of None) from the ProcessPoolExecutor constructor, the default will be the number of processors you have on your machine (not a bad default). There is no point in specifying a value larger than the number of processors you have.
If you do not need to tie the future back to the original request, then the_futures can just be a list to which But simplest yest in not even to bother to use the as_completed method:
def main():
with ProcessPoolExecutor(max_workers=5) as executor:
the_futures = []
for i in range(len(fullAnalysis)):
text = fullAnalysis['QueryText'][i]
arg = 'most_similar'+ ' ' + text
future = executor.submit(calculateScore, arg)
the_futures.append(future)
# wait for the completion of all the results and return them all:
results = [f.result() for f in the_futures()] # results in creation order
return results
It should be mentioned that code that launches the ProcessPoolExecutor functions should be in a block governed by a if __name__ = '__main__':. If it isn't you will get into a recursive loop with each subprocess launching the ProcessPoolExecutor. But that seems to be the case here. Perhaps you meant to use the ProcessPoolExecutor all along?
Also:
I don't know what the line ...
model = gensim.models.Word2Vec.load('w2v_model_bigdata')
... in function calculateStore does. It may be the one i/o-bound statement. But this appears to be something that does not vary from call to call. If that is the case and model is not being modified in the function, shouldn't this statement be moved out of the function and computed just once? Then this function would clearly run faster (and be clearly cpu-bound).
Also:
The exception block ...
except KeyError as ke:
#print(str(ke) + "\n")
inp = input()
... is puzzling. You are inputting a value that will never be used right before returning. If this is to pause execution, there is no error message being output.
With Booboo assistance, I was able to update code to include ProcessPoolExecutor. Here is my updated code. Overall, processing has been speed up by more than 60%.
I did run into a processing issue and found this topic BrokenPoolProcess that addresses the issue.
output = {}
thePool = {}
def main(labelled_data, dictionaryRevised):
args = sys.argv[1:]
with ProcessPoolExecutor(max_workers=None) as executor:
for i in range(len(labelled_data)):
text = labelled_data['QueryText'][i]
arg = 'most_similar'+ ' '+ text
output = winprocess.submit(
executor, calculateScore, arg
)
thePool[output] = i #original index for future to request
for output in as_completed(thePool): # results as they become available not necessarily the order of submission
i = thePool[output] # the original index
text = labelled_data['QueryText'][i]
result = output.result() # the result
maximumKey = max(result.items(), key=operator.itemgetter(1))[0]
maximumValue = result.get(maximumKey)
labelled_data['SimilarText'][i] = maximumKey
labelled_data['SimilarityScore'][i] = maximumValue
return labelled_data, dictionaryRevised
if __name__ == "__main__":
start = time.perf_counter()
print("Starting to evaluate Query Text for labelling...")
output_Labelled_Data, output_dictionary_revised = preProcessor()
output,dictionary = main(output_Labelled_Data, output_dictionary_revised)
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')

Python data generation script slows down with time

EDIT 1:
As fizzybear pointed out it looks as though my memory usage is steadily increasing but I can't say why, any ideas would be greatly appreciated.
I'm running a script which uses the staticfg library to generate a tonne of control flow graphs from python programs, approximately 150,000 programs. My code simply loops through every program's file location and generates a corresponding control flow graph.
From a frequently updated progress bar I can see that when the script begins running it easily generates around 1000 CFGs in a few seconds, but half an hour into running it can barely generate 100 CFGs within a minute.
In an attempt to sped things up I implemented multi threading using python's multiprocessing map() function but this doesn't help enough.
Furthermore, the cpu utilization (for all cores) shoots up to around 80-90% at the beginning of the script but drops to around 30-40% after running for a few minutes.
I've tried running it on Windows 10 and Ubuntu 18.04 and both slow down to an almost unbearable speed.
Code for building control-flow-graph
from staticfg import CFGBuilder
def process_set():
content = get_file_paths()
iterate(build_cfg, ERROR_LOG_FILE, content)
def build_cfg(file_path):
cfg = CFGBuilder().build_from_file(os.path.basename(file_path), os.path.join(DATA_PATH, file_path))
cfg.build_visual(get_output_data_path(file_path), format='dot', calls=False, show=False)
os.remove(get_output_data_path(file_path)) # Delete the other weird file created
Code for running the cfg building
from threading import Lock
from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing
def iterate(task, error_file_path, content):
progress_bar = ProgressBar(0, content.__len__(), prefix='Progress:', suffix='Complete')
progress_bar.print_progress_bar()
error_file_lock = Lock()
increment_work_lock = Lock()
increment_errors_lock = Lock()
def an_iteration(file):
try:
task(file)
except Exception as e:
with increment_errors_lock:
progress_bar.increment_errors()
with error_file_lock:
handle_exception(error_file_path, file, 'Error in doing thing', e)
finally:
with increment_work_lock:
progress_bar.increment_work()
progress_bar.print_progress_bar()
pool = multiprocessing.dummy.Pool(multiprocessing.cpu_count())
pool.map(an_iteration, content)
Code for error handling
def handle_exception(error_log_file_path, file_path, message, stacktrace):
with open(error_log_file_path, 'a+', encoding='utf8') as f:
f.write('\r{},{},{},{}\n'.format(str(datetime.datetime.now()), message, file_path, stacktrace))
As far as I can tell (?) there is no object ever increasing in size and no increasing lookup time somewhere, so I'm a little lost as to why the script should be slowing down at all. Any help would be greatly appreciated.
I'm also pretty sure that it's not the contention for the locks that is slowing down the program as I was having this problem before I implemented multi threading, and contention should be pretty low anyway because the CFG building should take up a lot more time than updating the progress bar. Furthermore, errors aren't that frequent so writing to the error log doesn't happen too often, not enough to justify a lot of contention.
Cheers.
Edit 2:
Code for progress bar in case that affects the memory usage
class ProgressBar:
def __init__(self, iteration, total, prefix='', suffix='', decimals=1, length=100, fill='█'):
self.iteration = iteration
self.total = total
self.prefix = prefix
self.suffix = suffix
self.decimals = decimals
self.length = length
self.fill = fill
self.errors = 0
def increment_work(self):
self.iteration += 1
def increment_errors(self):
self.errors += 1
def print_progress_bar(self):
percent = ("{0:." + str(self.decimals) + "f}").format(100 * (self.iteration / float(self.total)))
filled_length = int(self.length * self.iteration // self.total)
bar = self.fill * filled_length + '-' * (self.length - filled_length)
print('%s |%s| %s%% (%s/%s) %s, %s %s' % (self.prefix, bar, percent, self.iteration, self.total, self.suffix, str(self.errors), 'errors'), end='\r')
# Print New Line on Complete
if self.iteration == self.total:
print()

How do I gather performance metrics for GDI and user Objects using python

Think this is my first question I have asked on here normally find all the answers I need (so thanks in advance)
ok my problem I have written a python program that will in threads monitor a process and output the results to a csv file for later. This code is working great I am using win32pdhutil for the counters and WMI, Win32_PerfRawData_PerfProc_Process for the CPU %time. I have now been asked to monitor a WPF application and specifically monitor User objects and GDI objects.
This is where I have a problem, it is that i can't seem to find any python support for gathering metrics on these two counters. these two counters are easily available in the task manager I find it odd that there is very little information on these two counters. I am specifically looking at gathering these to see if we have a memory leak, I don't want to install anything else on the system other than python that is already installed. Please can you peeps help with finding a solution.
I am using python 3.3.1, this will be running on a windows platform (mainly win7 and win8)
This is the code i am using to gather the data
def gatherIt(self,whoIt,whatIt,type,wiggle,process_info2):
#this is the data gathering function thing
data=0.0
data1="wobble"
if type=="counter":
#gather data according to the attibutes
try:
data = win32pdhutil.FindPerformanceAttributesByName(whoIt, counter=whatIt)
except:
#a problem occoured with process not being there not being there....
data1="N/A"
elif type=="cpu":
try:
process_info={}#used in the gather CPU bassed on service
for x in range(2):
for procP in wiggle.Win32_PerfRawData_PerfProc_Process(name=whoIt):
n1 = int(procP.PercentProcessorTime)
d1 = int(procP.Timestamp_Sys100NS)
#need to get the process id to change per cpu look...
n0, d0 = process_info.get (whoIt, (0, 0))
try:
percent_processor_time = (float (n1 - n0) / float (d1 - d0)) *100.0
#print whoIt, percent_processor_time
except ZeroDivisionError:
percent_processor_time = 0.0
# pass back the n0 and d0
process_info[whoIt] = (n1, d1)
#end for loop (this should take into account multiple cpu's)
# end for range to allow for a current cpu time rather that cpu percent over sampleint
if percent_processor_time==0.0:
data=0.0
else:
data=percent_processor_time
except:
data1="N/A"
else:
#we have done something wrong so data =0
data1="N/A"
#endif
if data == "[]":
data=0.0
data1="N/A"
if data == "" :
data=0.0
data1="N/A"
if data == " ":
data=0.0
data1="N/A"
if data1!="wobble" and data==0.0:
#we have not got the result we were expecting so add a n/a
data=data1
return data
cheers
edited for correct cpu timings issue if anyone tried to run it :D
so after a long search i was able to mash something together that gets me the info needed.
import time
from ctypes import *
from ctypes.wintypes import *
import win32pdh
# with help from here http://coding.derkeiler.com/Archive/Python/comp.lang.python/2007-10/msg00717.html
# the following has been mashed together to get the info needed
def GetProcessID(name):
object = "Process"
items, instances = win32pdh.EnumObjectItems(None, None, object, win32pdh.PERF_DETAIL_WIZARD)
val = None
if name in instances :
tenQuery = win32pdh.OpenQuery()
tenarray = [ ]
item = "ID Process"
path = win32pdh.MakeCounterPath( ( None, object, name, None, 0, item ) )
tenarray.append( win32pdh.AddCounter( tenQuery, path ) )
win32pdh.CollectQueryData( tenQuery )
time.sleep( 0.01 )
win32pdh.CollectQueryData( tenQuery )
for tencounter in tenarray:
type, val = win32pdh.GetFormattedCounterValue( tencounter, win32pdh.PDH_FMT_LONG )
win32pdh.RemoveCounter( tencounter )
win32pdh.CloseQuery( tenQuery )
return val
processIDs = GetProcessID('OUTLOOK') # Remember this is case sensitive
PQI = 0x400
#open a handle on to the process so that we can query it
OpenProcessHandle = windll.kernel32.OpenProcess(PQI, 0, processIDs)
# OK so now we have opened the process now we want to query it
GR_GDIOBJECTS, GR_USEROBJECTS = 0, 1
print(windll.user32.GetGuiResources(OpenProcessHandle, GR_GDIOBJECTS))
print(windll.user32.GetGuiResources(OpenProcessHandle, GR_USEROBJECTS))
#so we have what we want we now close the process handle
windll.kernel32.CloseHandle(OpenProcessHandle)
hope that helps
For GDI count, I think a simpler, cleaner monitoring script is as follows:
import time, psutil
from ctypes import *
def getPID(processName):
for proc in psutil.process_iter():
try:
if processName.lower() in proc.name().lower():
return proc.pid
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
pass
return None;
def getGDIcount(PID):
PH = windll.kernel32.OpenProcess(0x400, 0, PID)
GDIcount = windll.user32.GetGuiResources(PH, 0)
windll.kernel32.CloseHandle(PH)
return GDIcount
PID = getPID('Outlook')
while True:
GDIcount = getGDIcount(PID)
print(f"{time.ctime()}, {GDIcount}")
time.sleep(1)

Memory limit hit with appengine-mapreduce

I'm working on appengine-mapreduce function and have modified the demo to fit my purpose.
Basically I have a million over lines in the following format: userid, time1, time2. My purpose is to find the difference between time1 and time2 for each userid.
However, as I run this on Google App Engine, I encountered this error message in the logs section:
Exceeded soft private memory limit with 180.56 MB after servicing 130 requests total
While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
def time_count_map(data):
"""Time count map function."""
(entry, text_fn) = data
text = text_fn()
try:
q = text.split('\n')
for m in q:
reader = csv.reader([m.replace('\0', '')], skipinitialspace=True)
for s in reader:
"""Calculate time elapsed"""
sdw = s[1]
start_date = time.strptime(sdw,"%m/%d/%y %I:%M:%S%p")
edw = s[2]
end_date = time.strptime(edw,"%m/%d/%y %I:%M:%S%p")
time_difference = time.mktime(end_date) - time.mktime(start_date)
yield (s[0], time_difference)
except IndexError, e:
logging.debug(e)
def time_count_reduce(key, values):
"""Time count reduce function."""
time = 0.0
for subtime in values:
time += float(subtime)
realtime = int(time)
yield "%s: %d\n" % (key, realtime)
Can anyone suggest how else I can optimize my code better? Thanks!!
Edited:
Here's the pipeline handler:
class TimeCountPipeline(base_handler.PipelineBase):
"""A pipeline to run Time count demo.
Args:
blobkey: blobkey to process as string. Should be a zip archive with
text files inside.
"""
def run(self, filekey, blobkey):
logging.debug("filename is %s" % filekey)
output = yield mapreduce_pipeline.MapreducePipeline(
"time_count",
"main.time_count_map",
"main.time_count_reduce",
"mapreduce.input_readers.BlobstoreZipInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"blob_key": blobkey,
},
reducer_params={
"mime_type": "text/plain",
},
shards=32)
yield StoreOutput("TimeCount", filekey, output)
Mapreduce.yaml:
mapreduce:
- name: Make messages lowercase
params:
- name: done_callback
value: /done
mapper:
handler: main.lower_case_posts
input_reader: mapreduce.input_readers.DatastoreInputReader
params:
- name: entity_kind
default: main.Post
- name: processing_rate
default: 100
- name: shard_count
default: 4
- name: Make messages upper case
params:
- name: done_callback
value: /done
mapper:
handler: main.upper_case_posts
input_reader: mapreduce.input_readers.DatastoreInputReader
params:
- name: entity_kind
default: main.Post
- name: processing_rate
default: 100
- name: shard_count
default: 4
The rest of the files are exactly the same as the demo.
I've uploaded a copy of my codes on dropbox: http://dl.dropbox.com/u/4288806/demo%20compressed%20fail%20memory.zip
Also consider calling gc.collect() at regular points during your code. I've seen several SO questions about exceeding soft memory limits that were alleviated by calling gc.collect(), most having to do with blobstore.
It is likely your input file exceeds the soft memory limit in size. For big files use either BlobstoreLineInputReader or BlobstoreZipLineInputReader.
These input readers pass something different to the map function, they pass the start_position in the file and the line of text.
Your map function might look something like:
def time_count_map(data):
"""Time count map function."""
text = data[1]
try:
reader = csv.reader([text.replace('\0', '')], skipinitialspace=True)
for s in reader:
"""Calculate time elapsed"""
sdw = s[1]
start_date = time.strptime(sdw,"%m/%d/%y %I:%M:%S%p")
edw = s[2]
end_date = time.strptime(edw,"%m/%d/%y %I:%M:%S%p")
time_difference = time.mktime(end_date) - time.mktime(start_date)
yield (s[0], time_difference)
except IndexError, e:
logging.debug(e)
Using BlobstoreLineInputReader will allow the job to run much faster as it can use more than one shard, up to 256, but it means you need to upload your files uncompressed, which can be a pain. I handle it by uploading the compressed files to an EC2 windows server, then decompress and upload from there, since upstream bandwidth is so big.

Categories