Throttling a step in beam application

Throttling a step in beam application - python

I'm using python beam on google dataflow, my pipeline looks like this:
Read image urls from file >> Download images >> Process images
The problem is that I can't let Download images step scale as much as it needs because my application can get blocked from the image server.
Is it a way that I can throttle the step ? Either on input or output per minute.
Thank you.

One possibility, maybe naïve, is to introduce a sleep in the step. For this you need to know the maximum number of instances of the ParDo that can be running at the same time. If autoscalingAlgorithm is set to NONE you can obtain that from numWorkers and workerMachineType (DataflowPipelineOptions). Precisely, the effective rate will be divided by the total number of threads: desired_rate/(num_workers*num_threads(per worker)). The sleep time will be the inverse of that effective rate:
Integer desired_rate = 1; // QPS limit
if (options.getNumWorkers() == 0) { num_workers = 1; }
else { num_workers = options.getNumWorkers(); }
if (options.getWorkerMachineType() != null) {
machine_type = options.getWorkerMachineType();
num_threads = Integer.parseInt(machine_type.substring(machine_type.lastIndexOf("-") + 1));
}
else { num_threads = 1; }
Double sleep_time = (double)(num_workers * num_threads) / (double)(desired_rate);
Then you can use TimeUnit.SECONDS.sleep(sleep_time.intValue()); or equivalent inside the throttled Fn. In my example, as a use case, I wanted to read from a public file, parse out the empty lines and call the Natural Language Processing API with a maximum rate of 1 QPS (I initialized desired_rate to 1 previously):
p
.apply("Read Lines", TextIO.read().from("gs://apache-beam-samples/shakespeare/kinglear.txt"))
.apply("Omit Empty Lines", ParDo.of(new OmitEmptyLines()))
.apply("NLP requests", ParDo.of(new ThrottledFn()))
.apply("Write Lines", TextIO.write().to(options.getOutput()));
The rate-limited Fn is ThrottledFn, notice the sleep function:
static class ThrottledFn extends DoFn<String, String> {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
// Instantiates a client
try (LanguageServiceClient language = LanguageServiceClient.create()) {
// The text to analyze
String text = c.element();
Document doc = Document.newBuilder()
.setContent(text).setType(Type.PLAIN_TEXT).build();
// Detects the sentiment of the text
Sentiment sentiment = language.analyzeSentiment(doc).getDocumentSentiment();
String nlp_results = String.format("Sentiment: score %s, magnitude %s", sentiment.getScore(), sentiment.getMagnitude());
TimeUnit.SECONDS.sleep(sleep_time.intValue());
Log.info(nlp_results);
c.output(nlp_results);
}
}
}
With this I get a 1 element/s rate as seen in the image below and avoid hitting quota when using multiple workers, even if requests are not really spread out (you might get 8 simultaneous requests and then 8s sleep, etc.). This was just a test, possibly a better implemention would be using guava's rateLimiter.
If the pipeline is using autoscaling (THROUGHPUT_BASED) then it would be more complicated and the number of workers should be updated (for example, Stackdriver Monitoring has a job/current_num_vcpus metric). Other general considerations would be controlling the number of parallel ParDos by using a dummy GroupByKey or splitting the source with splitIntoBundles, etc. I'd like to see if there are other nicer solutions.

Related

On python, how do I compare CurrentValue to previousValue and output max

I have always used Arduino IDE but now, I am using MuEditor.
I used Arduino IDE and it was easier to compare the current value to its previous value, and output max value after a certain period. Code from the Arduino is shown below.
int sample1 = 0;
void loop() {
int sensorValue = 0;
sensorValue = analogRead(A0);
for (int i = 0; i <= 100; i++) {
if (sensorValue > sample1) {
sample1 = sensorValue;
}
}
Serial.println(sample1 * (5.0 / 1023.0));
}
I want to use the same concept on Mu Editor, and I cannot seems to get it done right.
I would like to continuously compare the current value to previous value and output max value after certain period. This is what I came up with on MuEditor. I would appreciate your help on this.
import time
import board
from analogio import AnalogIn
analog_in = AnalogIn(board.A1)
Sample = 0
def get_voltage(pin):
return (pin.value * 3.3) / 65536
while True:
for x in range(1000):
if Sample < analog_in:
Sample = analog_in
print((get_voltage(Sample1),))
time.sleep(0.1)
Result:

With circuitpython's analogio.AnalogIn() you need to refer to the .value property of the AnalogIn instance
import analogio
with analogio.AnalogIn(board.A1) as pin_sensor:
pin_sensor.value # voltage
# pin_sensor is closed when leaving the above scope
Just get the max() at each loop if you're discarding the other values
...
result = 0
with analogio.AnalogIn(board.A1) as pin_sensor:
for x in range(..):
result = max(result, pin_sensor.value)
print(result)
However, note that especially with electronics, you more likely want a more complex mechanism-
collect frequently enough to catch the frequency of what you're sampling (see Nyquist-Shannon theorem), ideally as often as you can (however, if you have a collection of digital logic, this may be infeasible, or drive a faster processor choice, etc.)
carefully discard outliers (which can be transients/static/contact bounce..)
refer to a moving average (take the average of the last N samples)
collect with a precise timer and interrupt (such that your sampling rate isn't dependent on unrelated logic)

OpenCl simple unsuccessful optimization

I'm learning to use opencl in python and I wanted to optimize one of the function. I learned that this can be done by storing global memory in local memory. However, it doesn't work as it should, the duration is twice as long. This is well done? Can I optimize this code more?
__kernel void sumOP( __global float *input,
__global float *weights,
int layer_size,
__global float *partialSums,__local float* cache)
{
private const int i = get_global_id(0);
private const int in_layer_s = layer_size;
private const int item_id = get_local_id(0);
private const int group_id = get_group_id(0);
private const int group_count = get_num_groups(0);
const int localsize = get_local_size(0);
for ( int x = 0; x < in_layer_s; x++ )
{
cache[x] = weights[i*in_layer_s + x];
}
float total1 = 0;
for ( int x = 0; x < in_layer_s; x++ )
{
total1 += cache[x] *input[x];
}
partialSums[i] = sigmoid(total1);
}
Python call
l = opencl.LocalMemory(len(inputs))
event = program.sumOP(queue, output.shape, np.random.randn(6,).shape, inputs.data, weights.data,np.int32(len(inputs)),output.data,l)
Thanks for some advice

Besides doing a data write race condition with writing to same shared memory address cache[x] by all workitems of a group (as Dithermaster said) and lack of barrier() function, some optimizations can be added after those are fixed:
First loop in kernel
for ( int x = 0; x < in_layer_s; x++ )
{
cache[x] = weights[i*in_layer_s + x];
}
scans a different memory area for each work item, one element at a time. This is probably wrong in terms of global memory performance because each workitem in their own loop, could be using same memory channel or even same memory bank, hence, all workitems access that channel or bank serially. This is worse if in_layer_s gets a larger value and especially if it is power of 2. To solve this problem, all workitems should access contiguous addresses with their neighbors. GPU works better when global memory is accessed uniformly with workitems. On local memory, it is less of an issue to access randomly or with gaps between workitems. Thats why its advised to use uniform save/load on global while doing random/scatter/gather on local.
Second loop in kernel
for ( int x = 0; x < in_layer_s; x++ )
{
total1 += cache[x] *input[x];
}
is using only single accumulator. This a dependency chain that needs each loop cycle to be completed before moving on to next. Use at least 2 temporary "total" variables and unroll the loop. Here, if in_layer_s is small enough, input array could be moved into local or constant memory to access it faster (repeatedly by all workitems, since all workitems access same input array) (maybe half of input to constant memory and other half to local memory to increase total bandwidth)
Is weights[i*in_layer_s + x]; an array of structs? If yes, you can achieve a speedup by making it a struct of arrays and get rid of first loop's optimization altogether, with an increase of code bloat in host side but if priority is speed then struct of arrays is both faster and readable on gpu side. This also makes it possible to upload only the necessary weights data(an array of SOA) to gpu from host side, decreasing total latency (upload + compute + download) further.
You can also try asynchronous local<-->global transfer functions to make loading and computing overlapped for each workitem group, to hide even more latency as a last resort. https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/async_work_group_copy.html

Python stop multiple process when one returns a result?

I am trying to write a simple proof-of-work nonce-finder in python.
def proof_of_work(b, nBytes):
nonce = 0
# while the first nBytes of hash(b + nonce) are not 0
while sha256(b + uint2bytes(nonce))[:nBytes] != bytes(nBytes):
nonce = nonce + 1
return nonce
Now I am trying to do this multiprocessed, so it can use all CPU cores and find the nonce faster. My idea is to use multiprocessing.Pool and execute the function proof_of_work multiple times, passing two params num_of_cpus_running and this_cpu_id like so:
def proof_of_work(b, nBytes, num_of_cpus_running, this_cpu_id):
nonce = this_cpu_id
while sha256(b + uint2bytes(nonce))[:nBytes] != bytes(nBytes):
nonce = nonce + num_of_cpus_running
return nonce
So, if there are 4 cores, every one will calculate nonces like this:
core 0: 0, 4, 8, 16, 32 ...
core 1: 1, 5, 9, 17, 33 ...
core 2: 2, 6, 10, 18, 34 ...
core 3: 3, 7, 15, 31, 38 ...
So, I have to rewrite proof_of_work so when anyone of the processes finds a nonce, everyone else stops looking for nonces, taking into account that the found nonce has to be the lowest value possible for which the required bytes are 0. If a CPU speeds up for some reason, and returns a valid nonce higher than the lowest valid nonce, then the proof of work is not valid.
The only thing I don't know how to do is the part in which a process A will only stop if process B found a nonce that is lower than the nonce that is being calculated right now by process A. If its higher, A keeps calculating (just in case) until it arrives to the nonce provided by B.
I hope I explained myself correctly. Also, if there is a faster implementation of anything I wrote, I would love to hear about it. Thank you very much!

One easy option is to use micro-batches and check if an answer was found. Too small batches incur overhead from starting parallel jobs, too large size causes other processes to do extra work while one process already found an answer. Each batch should take 1 - 10 seconds to be efficient.
Sample code:
from multiprocessing import Pool
from hashlib import sha256
from time import time
def find_solution(args):
salt, nBytes, nonce_range = args
target = '0' * nBytes
for nonce in xrange(nonce_range[0], nonce_range[1]):
result = sha256(salt + str(nonce)).hexdigest()
#print('%s %s vs %s' % (result, result[:nBytes], target)); sleep(0.1)
if result[:nBytes] == target:
return (nonce, result)
return None
def proof_of_work(salt, nBytes):
n_processes = 8
batch_size = int(2.5e5)
pool = Pool(n_processes)
nonce = 0
while True:
nonce_ranges = [
(nonce + i * batch_size, nonce + (i+1) * batch_size)
for i in range(n_processes)
]
params = [
(salt, nBytes, nonce_range) for nonce_range in nonce_ranges
]
# Single-process search:
#solutions = map(find_solution, params)
# Multi-process search:
solutions = pool.map(find_solution, params)
print('Searched %d to %d' % (nonce_ranges[0][0], nonce_ranges[-1][1]-1))
# Find non-None results
solutions = filter(None, solutions)
if solutions:
return solutions
nonce += n_processes * batch_size
if __name__ == '__main__':
start = time()
solutions = proof_of_work('abc', 6)
print('\n'.join('%d => %s' % s for s in solutions))
print('Solution found in %.3f seconds' % (time() - start))
Output (a laptop with Core i7):
Searched 0 to 1999999
Searched 2000000 to 3999999
Searched 4000000 to 5999999
Searched 6000000 to 7999999
Searched 8000000 to 9999999
Searched 10000000 to 11999999
Searched 12000000 to 13999999
Searched 14000000 to 15999999
Searched 16000000 to 17999999
Searched 18000000 to 19999999
Searched 20000000 to 21999999
Searched 22000000 to 23999999
Searched 24000000 to 25999999
Searched 26000000 to 27999999
Searched 28000000 to 29999999
Searched 30000000 to 31999999
Searched 32000000 to 33999999
Searched 34000000 to 35999999
Searched 36000000 to 37999999
37196346 => 000000f4c9aee9d427dc94316fd49192a07f1aeca52f6b7c3bb76be10c5adf4d
Solution found in 20.536 seconds
With single core it took 76.468 seconds. Anyway this isn't by far the most efficient way to find a solution but it works. For example if the salt is long then the SHA-256 state could be pre-computed after the salt has been absorbed and continue brute-force search from there. Also byte array could be more efficient than the hexdigest().

A general method to do this is to:
think of work packets, e.g. to perform the calculation for a particular range, a range should not take long, say 0.1 seconds to a second
have some manager distribute the work packets to the worker
after a work packet has been concluded, tell the manager the result and request a new work packet
if the work is done and a result has been found accept the results from workers and give them a signal that no more work is to be performed - the workers can now safely terminate
This way you don't have to check with the manager each iteration (which would slow down everything), or do nasty things like stopping a thread mid-session. Needless to say, the manager needs to be thread safe.
This fits perfectly with your model, as you still need the results of the other workers, even if a result has been found.
Note that in your model, it could be that a thread may go out of sync with the other threads, lagging behind. You don't want to do another million calculations once a result is found. I'm just reiterating this from the question because I think the model is wrong. You should fix the model instead of fixing the implementation.

You can use multiprocessing.Queue(). Have a Queue per CPU/process. When a process finds a nonce, it puts it on the queue of other processes. Other processes check their queue (non-blocking) in each iteration of the while loop and if there is anything on it, they decide to continue or terminate based on the value in the queue:
def proof_of_work(b, nBytes, num_of_cpus_running, this_cpu_id, qSelf, qOthers):
nonce = this_cpu_id
while sha256(b + uint2bytes(nonce))[:nBytes] != bytes(nBytes):
nonce = nonce + num_of_cpus_running
try:
otherNonce = qSelf.get(block=False)
if otherNonce < nonce:
return
except:
pass
for q in qOthers:
q.put(nonce)
return nonce
qOthers is a list of queues ( each queue=multiprocessing.Queue() ) belonging to other processes.
If you decide to use queues as I suggested, you should be able to write a better/nicer implementation of above approach.

I like to improve NikoNyrh's answer by changing pool.map to pool.imap_unordered. Using imap_unordered will return the result immediately from any of the workers without waiting for all of them to be completed. So once any of the results returns the tuple, we can exit the while loop.
def proof_of_work(salt, nBytes):
n_processes = 8
batch_size = int(2.5e5)
with Pool(n_processes) as pool:
nonce = 0
while True:
nonce_ranges = [
(nonce + i * batch_size, nonce + (i+1) * batch_size)
for i in range(n_processes)
]
params = [
(salt, nBytes, nonce_range) for nonce_range in nonce_ranges
]
print('Searched %d to %d' % (nonce_ranges[0][0], nonce_ranges[-1][1]-1))
for result in pool.imap_unordered(find_solution, params):
if isinstance(result,tuple): return result
nonce += n_processes * batch_size

Efficient way to modify the time format of a field for all documents in MongoDB

I have a collection contains three hundred million documents.
Each document has a "created_at" field that specifies the time in a string format like this
'Thu Feb 05 09:25:38 +0000 2015'
I want to change all the "created_at" field to a MongoDB supported time format.
So I wrote a simple Ruby script:
collection.find.each do |document|
document[:created_at] = Time.parse document[:created_at]
collection.save(document)
end
It did change the time format as I wished, but my script has been running for 50 hours, and there is no signs of finishing.
Is there a better way to do this task?
A MongoDB shell script or Python script is also doable to me.
By the way, this collection is not indexed since it's continuously inserting documents

Using mongo bulk update you can changed date to ISODATE format as below :
var bulk = db.collectionName.initializeOrderedBulkOp();
var counter = 0;
db.collectionName.find().forEach(function(data) {
var updoc = {
"$set": {}
};
var myKey = "created_at";
updoc["$set"][myKey] = new Date(Date.parse(data.created_at));
// queue the update
bulk.find({
"_id": data._id
}).update(updoc);
counter++;
// Drain and re-initialize every 1000 update statements
if(counter % 1000 == 0) {
bulk.execute();
bulk = db.collectionName.initializeOrderedBulkOp();
}
})
// Add the rest in the queue
if(counter % 1000 != 0) bulk.execute();

Python Fast Input Output Using Buffer Competitive Programming

I have seen people using buffer in different languages for fast input/output in Online Judges. For example this http://www.spoj.pl/problems/INTEST/ is done with C like this:
#include <stdio.h>
#define size 50000
int main (void){
unsigned int n=0,k,t;
char buff[size];
unsigned int divisible=0;
int block_read=0;
int j;
t=0;
scanf("%lu %lu\n",&t,&k);
while(t){
block_read =fread(buff,1,size,stdin);
for(j=0;j<block_read;j++){
if(buff[j]=='\n'){
t--;
if(n%k==0){
divisible++;
}
n=0;
}
else{
n = n*10 + (buff[j] - '0');
}
}
}
printf("%d",divisible);
return 0;
How can this be done with python?

import sys
file = sys.stdin
size = 50000
t = 0
while(t != 0)
block_read = file.read(size)
...
...
Most probably this will not increase performance though – Python is interpreted language, so you basically want to spend as much time in native code (standard library input/parsing routines in this case) as possible.
TL;DR either use built-in routines to parse integers or get some sort of 3rd party library which is optimized for speed.

I tried solving this one in Python 3 and couldn't get it to work no matter how I tried reading the input. I then switched to running it under Python 2.5 so I could use
import psyco
psyco.full()
After making that change I was able to get it to work by simply reading input from sys.stdin one line at a time in a for loop. I read the first line using raw_input() and parsed the values of n and k, then used the following loop to read the remainder of the input.
for line in sys.stdin:
count += not int(line) % k

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Throttling a step in beam application - python

Related

On python, how do I compare CurrentValue to previousValue and output max

OpenCl simple unsuccessful optimization

Python stop multiple process when one returns a result?

Efficient way to modify the time format of a field for all documents in MongoDB

Python Fast Input Output Using Buffer Competitive Programming

Categories

Resources