Ordering data from returned pool.apply_async

Ordering data from returned pool.apply_async - python

I am currently writing a steganography program. I currently have the majority of the things I want working. However I want to rebuild my message using multiple processes, this obviously means the bits returned from the processes need to be ordered. So currently I have:
Ok im home now I will put some actual code up.
def message_unhide(data):
inp = cv.LoadImage(data[0]) #data[0] path to image
steg = LSBSteg(inp)
bin = steg.unhideBin()
return bin
#code in main program underneath
count = 0
f = open(files[2], "wb") #files[2] = name of file to rebuild
fat = open("fat.txt", 'w+')
inp = cv.LoadImage(files[0][count]) # files[0] directory path of images
steg = LSBSteg(inp)
bin = steg.unhideBin()
fat.write(bin)
fat.close()
fat = open("fat.txt", 'rb')
num_files = fat.read() #amount of images message hidden across
fat.close()
count += 1
pool = Pool(5)
binary = []
''' Just something I was testing
for x in range(int(num_files)):
binary.append(0)
print (binary)
'''
while count <= int(num_files):
data = [files[0][count], count]
#f.write(pool.apply(message_unhide, args=(data, ))) #
#binary[count - 1] = [pool.apply_async(message_unhide, (data, ))] #
#again just another few ways i was trying to overcome
binary = [pool.apply_async(message_unhide, (data, ))]
count += 1
pool.close()
pool.join()
bits = [b.get() for b in binary]
print(binary)
#for b in bits:
# f.write(b)
f.close()
This method just overwrites binary
binary = [pool.apply_async(message_unhide, (data, ))]
This method fills the entire binary, however I loose the .get()
binary[count - 1] = [pool.apply_async(message_unhide, (data, ))]
Sorry for sloppy coding I am certainly no expert.

Your main issue has to do with overwriting binary in the loop. You only have one item in the list because you're throwing away the previous list and recreating it each time. Instead, you should use append to modify the existing list:
binary.append(pool.apply_async(message_unhide, (data, )))
But you might have a much nicer time if you use pool.map instead of rolling your own version. It expects an iterable yielding a single argument to pass to the function on each iteration, and it returns a list of the return values. The map call blocks until all the values are ready, so you don't need any other synchronization logic.
Here's an implementation using a generator expression to build the data argument items on the fly. You could simplify things and just pass files[0] to map if you rewrote message_unhide to accept the filename as its argument directly, without indexing a list (you never use the index, it seems):
# no loop this time
binary = pool.map(message_unhide, ([file, i] for i, file in enumerate(files[0])))

Related

Pattern for serial-to-parallel-to-serial data processing

I'm working with arrays of datasets, iterating over each dataset to extract information, and using the extracted information to build a new dataset that I then pass to a parallel processing function that might do parallel I/O (requests) on the data.
The return is a new dataset array with new information, which I then have to consolidate with the previous one. The pattern ends up being Loop->parallel->Loop.
parallel_request = []
for item in dataset:
transform(item)
subdata = extract(item)
parallel_request.append(subdata)
new_dataset = parallel_function(parallel_request)
for item in dataset:
transform(item)
subdata = extract(item)
if subdata in new_dataset:
item[subdata] = new_dataset[subdata]
I'm forced to use two loops. Once to build the parallel request, and the again to consolidate the parallel results with my old data. Large chunks of these loops end up repeating steps. This pattern is becoming uncomfortably prevalent and repetitive in my code.
Is there some technique to "yield" inside the first loop after adding data to parallel_request, continuing on to the next item. Once parallel_request is filled, execute parallel function, and then resume the loop for each item again, restoring the previously saved context (local variables).
EDIT: I think one solution would be to use a function instead of a loop, and call it recursively. The downside being that i would definitely hit the recursion limit.
parallel_requests = []
final_output = []
index = 0
def process_data(dataset, last=False):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
index += 1
parallel_requests.append(subdata)
# If not last, recurse
# Otherwise, call the processing function.
if not last:
process_data(dataset, index == len(dataset))
else:
new_data = process_requests(parallel_requests)
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
process_data(original_dataset)
Any solution would involve somehow preserving data, data2, data3, subdata...etc, which would have to be stored somewhere. Recursion uses the stack to store them, which will trigger the recursion limit. Another way would be store them in some array outside of the loop, which makes the code much more cumbersome. Another solution would be to just recompute them, and would also require code duplication.
So I suspect to achieve this you'd need some specific Python facility that enables this.

I believe i have solved the issue:
Based on the previous recursive code, you can can exploit the generator facilities offered by Python to preserve the serial context when calling the parallel function:
def process_data(dataset, parallel_requests, final_output):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
parallel_requests.append(subdata)
yield
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
final_output = []
parallel_requests = []
funcs = [process_data(datum, parallel_requests, final_output) for datum in dataset]
[next(f) for f in funcs]
process_requests(parallel_requests)
[next(f) for f in funcs]
The output list and generator calls are general enough that you can abstract away these lines in a helper function sets it up and calls the generators for you, leading to a very clean result with code overhead being one line for the function definition, and one line to call the helper.

Delete element of list in pool.map() python

processPool.map(parserMethod, ((inputFile[line:line + chunkSize], sharedQueue) for line in xrange(0, lengthOfFile, chunkSize)))
Here, I am passing control to parserMethod with a tuple of params inputFile[line:line + chunkSize] and a sharedQueue.
Can anyone tell me how I can delete the elements of inputFile[line:line + chunkSize] after it is passed to the parserMethod ?
Thanks !

del inputFile[line:line + chunkSize]
will remove those items. However, your map is stepping through the entire file, which makes me wonder: are you trying to remove them as they're parsed? This requires the map or parser to alter an input argument, which invites trouble.
If you're only trying to save memory usage, it's a little late: you already saved the entire file in InputFile. If you need only to clean up after the parsing, then use the extreme form of delete, once, after the parsing is finished:
del inputFile[:]
If you want to reduce the memory requirement up front, you have to back up a step. Instead of putting the entire file into a list, try making an nice input pipeline. You didn't post the context of this code, so I'm going to use a generic case with a couple of name assumptions:
def line_chunk_stream(input_stream, chunk_size):
# Generator to return a stream of paring units,
# <chunk_size> lines each.
# To make sure you could check the logic here,
# I avoided several Pythonic short-cuts.
line_count = 0
parse_chunk = []
for line in input_stream:
line_count += 1
parse_chunk.append(line)
if line_count % chunk_size == 0:
yield parse_chunk
del parse_chunk[:]
input_stream = open("source_file", 'r')
parse_stream = line_chunk_stream(input_stream, chunk_size)
parserMethod(parse_stream)
I hope that at least one of these solves your underlying problem.

Merging lists obtained by a loop

I've only started python recently but am stuck on a problem.
# function that tells how to read the urls and how to process the data the
# way I need it.
def htmlreader(i):
# makes variable websites because it is used in a loop.
pricedata = urllib2.urlopen(
"http://website.com/" + (",".join(priceids.split(",")[i:i + 200]))).read()
# here my information processing begins but that is fine.
pricewebstring = pricedata.split("},{")
# results in [[1234,2345,3456],[3456,4567,5678]] for example.
array1 = [re.findall(r"\d+", a) for a in pricewebstring]
# writes obtained array to my text file
itemtxt2.write(str(array1) + '\n')
i = 0
while i <= totalitemnumber:
htmlreader(i)
i = i + 200
See the comments in the script as well.
This is in a loop and will each time give me an array (defined by array1).
Because I print this to a txt file it results in a txt file with separate arrays.
I need one big array so it needs to merge the results of htmlreader(i).
So my output is something like:
[[1234,2345,3456],[3456,4567,5678]]
[[6789,4567,2345],[3565,1234,2345]]
But I want:
[[1234,2345,3456],[3456,4567,5678],[6789,4567,2345],[3565,1234,2345]]
Any ideas how I can approach this?

Since you want to gather all the elements in a single list, you can simply gather them in another list, by flattening it like this
def htmlreader(i, result):
...
result.extend([re.findall(r"\d+", a) for a in pricewebstring])
i, result = 0, []
while i <= totalitemnumber:
htmlreader(i, result)
i = i + 200
itemtxt2.write(str(result) + '\n')
In this case, the result created by re.findall (a list) is added to the result list. Finally, you are writing the entire list as a whole to the file.
If the above shown method is confusing, then change it like this
def htmlreader(i):
...
return [re.findall(r"\d+", a) for a in pricewebstring]
i, result = 0, []
while i <= totalitemnumber:
result.extend(htmlreader(i))
i = i + 200

List integration as argument (beginner)

I am writing a script in python, but I am a beginner (started yesterday).
Basically, I just create chunks that I fill with ~10 pictures, align them, build the model, and build the texture. Now I have my chunks and I want to align them...
From the manual:
PhotoScan.alignChunks(chunks, reference, method=’points’, accuracy=’high’, preselection=False)
Aligns speciﬁed set of chunks.
Parameters
chunks (list) – List of chunks to be aligned.
reference (Chunk) – Chunk to be used as a reference.
method (string) – Alignment method in [’points’, ‘markers’].
accuracy (string) – Alignment accuracy in [’high’, ‘medium’, ‘low’].
preselection (boolean) – Enables image pair preselection.
Returns Success of operation.
Return type boolean
I tried to align the chunks, but the script throws an error at line 26:
TypeError: expected a list of chunks as an argument
Do you have any idea how I can make it work?
This is my current code:
import PhotoScan
doc = PhotoScan.app.document
main_doc = PhotoScan.app.document
chunk = PhotoScan.Chunk()
proj = PhotoScan.GeoProjection()
proj.init("EPSG::32641")
gc = chunk.ground_control
gc.projection = proj
working_path = "x:\\New_agisoft\\ok\\Optical\\"
for i in range (1,3):
new_chunk = PhotoScan.Chunk()
new_chunk.label = str(i)
loop = i*10
loo = (i-1)*10
doc.chunks.add(new_chunk)
for j in range (loo,loop):
file_path = working_path + str(j) + ".jpg"
new_chunk.photos.add(file_path)
gc = new_chunk.ground_control
gc.loadExif()
gc.apply()
main_doc.active = len(main_doc.chunks) - 1
doc.activeChunk.alignPhotos(accuracy="low", preselection="ground control")
doc.activeChunk.buildModel(quality="lowest", object="height field", geometry="smooth", faces=50000)
doc.activeChunk.buildTexture(mapping="generic", blending="average", width=2048, height=2048)
PhotoScan.alignChunks(,1,method="points",accuracy='low', preselection=True)

PhotoScan.alignChunks(,1,method="points",accuracy='low', preselection=True)
^
Before the ',' you need the chunks!

Note: I have never used this module.
You're calling PhotoScan.alignChunks with an empty first argument, while the documentation states that it expects a list of chunks.
You could initialize an empty list before your loop:
chunks = []
And add completed chunks to the list from inside the loop:
# ...
chunks.append(new_chunk)
Then call the function:
PhotoScan.alignChunks(chunks, chunk[0], ...)

What is the lightest way of doing this task?

I have a file whose contents are of the form:
.2323 1
.2327 1
.3432 1
.4543 1
and so on some 10,000 lines in each file.
I have a variable whose value is say a=.3344
From the file I want to get the row number of the row whose first column is closest to this variable...for example it should give row_num='3' as .3432 is closest to it.
I have tried in a method of loading the first columns element in a list and then comparing the variable to each element and getting the index number
If I do in this method it is very much time consuming and slow my model...I want a very quick method as this need to to called some 1000 times minimum...
I want a method with least overhead and very quick can anyone please tell me how can it be done very fast.
As the file size is maximum of 100kb can this be done directly without loading into any list of anything...if yes how can it be done.
Any method quicker than the method mentioned above are welcome but I am desperate to improve the speed -- please help.
def get_list(file, cmp, fout):
ind, _ = min(enumerate(file), key=lambda x: abs(x[1] - cmp))
return fout[ind].rstrip('\n').split(' ')
#root = r'c:\begpython\wavnk'
header = 6
for lst in lists:
save = database_index[lst]
#print save
index, base,abs2, _ , abs1 = save
using_data[index] = save
base = 'C:/begpython/wavnk/'+ base.replace('phone', 'text')
fin, fout = base + '.pm', base + '.mcep'
file = open(fin)
fout = open(fout).readlines()
[next(file) for _ in range(header)]
file = [float(line.partition(' ')[0]) for line in file]
join_cost_index_end[index] = get_list(file, float(abs1), fout)
join_cost_index_strt[index] = get_list(file, float(abs2), fout)
this is the code i was using..copying file into a list.and all please give better alternarives to this

Building on John Kugelman's answer, here's a way you might be able to do a binary search on a file with fixed-length lines:
class SubscriptableFile(object):
def __init__(self, file):
self._file = file
file.seek(0,0)
self._line_length = len(file.readline())
file.seek(0,2)
self._len = file.tell() / self._line_length
def __len__(self):
return self._len
def __getitem__(self, key):
self._file.seek(key * self._line_length)
s = self._file.readline()
if s:
return float(s.split()[0])
else:
raise KeyError('Line number too large')
This class wraps a file in a list-like structure, so that now you can use the functions of the bisect module on it:
def find_row(file, target):
fw = SubscriptableFile(file)
i = bisect.bisect_left(fw, target)
if fw[i + 1] - target < target - fw[i]:
return i + 1
else:
return i
Here file is an open file object and target is the number you want to find. The function returns the number of the line with the closest value.
I will note, however, that the bisect module will try to use a C implementation of its binary search when it is available, and I'm not sure if the C implementation supports this kind of behavior. It might require a true list, rather than a "fake list" (like my SubscriptableFile).

Is the data in the file sorted in numerical order? Are all the lines of the same length? If not, the simplest approach is best. Namely, reading through the file line by line. There's no need to store more than one line in memory at a time.
Code:
def closest(num):
closest_row = None
closest_value = None
for row_num, row in enumerate(file('numbers.txt')):
value = float(row.split()[0])
if closest_value is None or abs(value - num) < abs(closest_value - num):
closest_row = row
closest_row_num = row_num
closest_value = value
return (closest_row_num, closest_row)
print closest(.3344)
Output for sample data:
(2, '.3432 1\n')
If the lines are all the same length and the data is sorted then there are some optimizations that will make this a very fast process. All the lines being the same length would let you seek directly to particular lines (you can't do this in a normal text file with lines of different length). Which would then enable you to do a binary search.
A binary search would be massively faster than a linear search. A linear search will on average have to read 5,000 lines of a 10,000 line file each time, whereas a binary search would on average only read log2 10,000 ≈ 13 lines.

Load it into a list then use bisect.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Ordering data from returned pool.apply_async - python

Related

Pattern for serial-to-parallel-to-serial data processing

Delete element of list in pool.map() python

Merging lists obtained by a loop

List integration as argument (beginner)

What is the lightest way of doing this task?

Categories

Resources