I have always used Arduino IDE but now, I am using MuEditor.
I used Arduino IDE and it was easier to compare the current value to its previous value, and output max value after a certain period. Code from the Arduino is shown below.
int sample1 = 0;
void loop() {
int sensorValue = 0;
sensorValue = analogRead(A0);
for (int i = 0; i <= 100; i++) {
if (sensorValue > sample1) {
sample1 = sensorValue;
}
}
Serial.println(sample1 * (5.0 / 1023.0));
}
I want to use the same concept on Mu Editor, and I cannot seems to get it done right.
I would like to continuously compare the current value to previous value and output max value after certain period. This is what I came up with on MuEditor. I would appreciate your help on this.
import time
import board
from analogio import AnalogIn
analog_in = AnalogIn(board.A1)
Sample = 0
def get_voltage(pin):
return (pin.value * 3.3) / 65536
while True:
for x in range(1000):
if Sample < analog_in:
Sample = analog_in
print((get_voltage(Sample1),))
time.sleep(0.1)
Result:
With circuitpython's analogio.AnalogIn() you need to refer to the .value property of the AnalogIn instance
import analogio
with analogio.AnalogIn(board.A1) as pin_sensor:
pin_sensor.value # voltage
# pin_sensor is closed when leaving the above scope
Just get the max() at each loop if you're discarding the other values
...
result = 0
with analogio.AnalogIn(board.A1) as pin_sensor:
for x in range(..):
result = max(result, pin_sensor.value)
print(result)
However, note that especially with electronics, you more likely want a more complex mechanism-
collect frequently enough to catch the frequency of what you're sampling (see Nyquist-Shannon theorem), ideally as often as you can (however, if you have a collection of digital logic, this may be infeasible, or drive a faster processor choice, etc.)
carefully discard outliers (which can be transients/static/contact bounce..)
refer to a moving average (take the average of the last N samples)
collect with a precise timer and interrupt (such that your sampling rate isn't dependent on unrelated logic)
Related
I am trying to model a scheduling task using IBMs DOcplex Python API. The goal is to optimize EV charging schedules and minimize charging costs. However, I am having problems working with the CPO interval variable.
Charging costs are defined by different price windows, e.g., charging between 00:00 - 06:00 costs 0.10$ per kW while charging between 06:00 - 18:00 costs 0.15$ per kW.
My initial idea was this:
schedule_start = start_of(all_trips[trip_id].interval)
schedule_end = end_of(all_trips[trip_id].interval)
cost_windows = {
"morning":{ "time":range(0,44),
"cost":10},
"noon":{ "time":range(44,64),
"cost":15},
"afternoon":{ "time":range(64,84),
"cost":15},
"night":{ "time":range(84,97),
"cost":10}
}
time_low = 0
time_high = 0
for i in range(schedule_start,schedule_end):
for key in cost_windows.keys():
if i in cost_windows.get(key).get("time"):
if cost_windows.get(key).get("cost") == 10:
time_low += 1
else:
time_high += 1
cost_total = ((time_low * 10 * power) + (time_high * 15 * power)) / 400
As seen above, the idea was to loop through the interval start to end (interval size can be a maximum of 96, each unit representing a 15 minute time block) and check in what price window the block is. We later calculate the total cost by multiplying the number of blocks in each window with the power (integer variable) and price.
However, this approach does not work as we cannot use the start_of(interval) like a regular integer. Is there a way to get the start and end values for an interval and use them like regular integers? Or is there another approach that I am missing?
Regards
Have you tried to use overlap_length as can be seen in
How to initiate the interval variable bounds in docplex (python)?
?
start_of and end_of do not return values but something that is not set until the model is run.
What you were trying to do is a bit like
using CP;
dvar int l;
dvar interval a in 0..10 size 3;
subject to
{
l==sum(i in 0..10) ((startOf(a)<=i) && (endOf(a)>i));
}
execute
{
writeln("l=",l);
}
in OPL but you enumerate time and that's not the good way
Small example with overlapLength and 3 time windows with 3 prices
using CP;
dvar int l;
tuple pricewindow
{
int s;
int e;
float price;
}
{pricewindow} windows={<0,5,1>,<5,6,0>,<6,10,0.5>};
dvar interval pwit[w in windows] in w.s..w.e size (w.e-w.s);
dvar interval a in 0..10 size 6;
dexpr float cost=sum(w in windows) overlapLength(a,pwit[w])*w.price;
minimize cost;
subject to
{
}
which gives
// solution with objective 3
a = <1 4 10 6>;
I wrote this code to print the sensor values in Python, but the problem is that the soil_sensor prints twice.
This is the code in the Arduino :
#include <DHT.h>
#include <DHT_U.h>
#define DHTPIN 8
#define DHTTYPE DHT11
int msensor = A0;
int msvalue = 0;
int min = 0;
int max = 1024;
DHT dht(DHTPIN, DHTTYPE);
void setup() {
Serial.begin(9600);
pinMode(msensor, INPUT);
dht.begin();
}
void loop() {
msvalue = analogRead(msensor);
float percentage = (float)((msvalue - min) * 100) / (max - min);
percentage = map(msvalue, max, min, 0, 100);
Serial.print("r ");Serial.println(percentage);
int h = dht.readHumidity();
int t = dht.readTemperature();
Serial.print ("h ");
Serial.println (h);
Serial.print ("c ");
Serial.println (t);
delay(2000);
}
And this is the code in Python :
from time import sleep
import serial
arduinoP1 = serial.Serial(port="/dev/ttyUSB0", baudrate=9600)
def rtot():
arduino_data = arduinoP1.read(6)
str_rn = arduino_data.decode()
sleep(1)
return str_rn
for x in range(3):
i = rtot()
if "r" in i:
v1 = int(float(i[1:5].strip('\\r\\nr')))
print(v1, 'soil_sensor')
if "c" in i:
print(i[1:2], 'temperature_sensor')
if "h" in i:
v3 = int(i[2:4])
print(v3, 'Humidity_sensor')
As you can see, the soil sensor is repeated twice :
soil sensor is repeated twice
I want the values to be displayed correctly and in the form of numbers
The first thing you should notice is that sending numbers throug the serial interface will result in different string lenghts depending on the number of digits.
So reading a fixed number of 6 bytes is not a good idea. (actually this is almost never a good idea)
You terminate each sensor reading with a linebreak. So why not use readline instead of read[6].
Here v1 = int(float(i[1:5].strip('\\r\\nr'))) you're trying to remove \r, \n and r from the received string. Unfortunately you escaped the backslash so you're actually stripping \, r and n.
\r is actually something where you need the backslash to represent the carriage return character. Don't escape it!
In the first run loop() will send something like:
r 0.00\r\nh 40\r\nc 25\r\n
So the first 6 bytes are r 0.00. So i[1:5] is 0.0.
As you see there is nothing to escape. Also 5 is excluded so you would have to use i[2:6] to get 0.00. But as mentioned above using fixed lenghts for numbers is a bad idea. You can receive anything between 0.00 and 100.00 here.
So using readline you'll receive
r 0.00\r\n
The first and last two characters are always there and we can use [2,-2] to get the number inbetween regardless of its length.
I'm using python beam on google dataflow, my pipeline looks like this:
Read image urls from file >> Download images >> Process images
The problem is that I can't let Download images step scale as much as it needs because my application can get blocked from the image server.
Is it a way that I can throttle the step ? Either on input or output per minute.
Thank you.
One possibility, maybe naïve, is to introduce a sleep in the step. For this you need to know the maximum number of instances of the ParDo that can be running at the same time. If autoscalingAlgorithm is set to NONE you can obtain that from numWorkers and workerMachineType (DataflowPipelineOptions). Precisely, the effective rate will be divided by the total number of threads: desired_rate/(num_workers*num_threads(per worker)). The sleep time will be the inverse of that effective rate:
Integer desired_rate = 1; // QPS limit
if (options.getNumWorkers() == 0) { num_workers = 1; }
else { num_workers = options.getNumWorkers(); }
if (options.getWorkerMachineType() != null) {
machine_type = options.getWorkerMachineType();
num_threads = Integer.parseInt(machine_type.substring(machine_type.lastIndexOf("-") + 1));
}
else { num_threads = 1; }
Double sleep_time = (double)(num_workers * num_threads) / (double)(desired_rate);
Then you can use TimeUnit.SECONDS.sleep(sleep_time.intValue()); or equivalent inside the throttled Fn. In my example, as a use case, I wanted to read from a public file, parse out the empty lines and call the Natural Language Processing API with a maximum rate of 1 QPS (I initialized desired_rate to 1 previously):
p
.apply("Read Lines", TextIO.read().from("gs://apache-beam-samples/shakespeare/kinglear.txt"))
.apply("Omit Empty Lines", ParDo.of(new OmitEmptyLines()))
.apply("NLP requests", ParDo.of(new ThrottledFn()))
.apply("Write Lines", TextIO.write().to(options.getOutput()));
The rate-limited Fn is ThrottledFn, notice the sleep function:
static class ThrottledFn extends DoFn<String, String> {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
// Instantiates a client
try (LanguageServiceClient language = LanguageServiceClient.create()) {
// The text to analyze
String text = c.element();
Document doc = Document.newBuilder()
.setContent(text).setType(Type.PLAIN_TEXT).build();
// Detects the sentiment of the text
Sentiment sentiment = language.analyzeSentiment(doc).getDocumentSentiment();
String nlp_results = String.format("Sentiment: score %s, magnitude %s", sentiment.getScore(), sentiment.getMagnitude());
TimeUnit.SECONDS.sleep(sleep_time.intValue());
Log.info(nlp_results);
c.output(nlp_results);
}
}
}
With this I get a 1 element/s rate as seen in the image below and avoid hitting quota when using multiple workers, even if requests are not really spread out (you might get 8 simultaneous requests and then 8s sleep, etc.). This was just a test, possibly a better implemention would be using guava's rateLimiter.
If the pipeline is using autoscaling (THROUGHPUT_BASED) then it would be more complicated and the number of workers should be updated (for example, Stackdriver Monitoring has a job/current_num_vcpus metric). Other general considerations would be controlling the number of parallel ParDos by using a dummy GroupByKey or splitting the source with splitIntoBundles, etc. I'd like to see if there are other nicer solutions.
I'm trying to implement a scan algorithm (Hillis-Steele) and I'm having some trouble understanding how to do it properly on CUDA. This is a minimal example using pyCUDA:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pycuda.compiler import SourceModule
#compile cuda code
mod = SourceModule('''
__global__ void scan(int * addresses){
for(int idx=1; idx <= threadIdx.x; idx <<= 1){
int new_value = addresses[threadIdx.x] + addresses[threadIdx.x - idx];
__syncthreads();
addresses[threadIdx.x] = new_value;
}
}
''')
func = mod.get_function("scan")
#Initialize an array with 1's
addresses_h = np.full((896,), 1, dtype='i4')
addresses_d = cuda.to_device(addresses_h)
#Launch kernel and copy back the result
threads_x = 896
func(addresses_d, block=(threads_x, 1, 1), grid=(1, 1))
addresses_h = cuda.from_device(addresses_d, addresses_h.shape, addresses_h.dtype)
# Check the result is correct
for i, n in enumerate(addresses_h):
assert i+1 == n
My question is about __syncthreads(). As you can see, I'm calling __syncthreads() inside the for loop and not every thread will execute that code the same number of times:
ThreadID - Times it will execute for loop
0 : 0 times
1 : 1 times
2- 3 : 2 times
4- 7 : 3 times
8- 15 : 4 times
16- 31 : 5 times
32- 63 : 6 times
64-127 : 7 times
128-255 : 8 times
256-511 : 9 times
512-896 : 10 times
There can be threads in the same warp with different number of calls to syncthreads. What will happen in that case? How can threads that are not executing the same code be synchronized?
In the sample code, we start with an array full of 1's and we get in the output the index+1 as value for each position. It is computing the correct answer. Is it "by chance" or the code is correct?
If this is not a proper use of syncthreads, how could I implement such algoritm using cuda?
If this is not a proper use of syncthreads, how could I implement such algoritm using cuda?
One typical approach is to separate the conditional code from the __syncthreads() calls. Use the conditional code to determine what threads will participate.
Here's a simple transformation of your posted code, that should give the same result, without any violations (i.e. all threads will participate in every __syncthreads() operation):
mod = SourceModule('''
__global__ void scan(int * addresses){
for(int i=1; i < blockDim.x; i <<= 1){
int new_value;
if (threadIdx.x >= i) new_value = addresses[threadIdx.x] + addresses[threadIdx.x - i];
__syncthreads();
if (threadIdx.x >= i) addresses[threadIdx.x] = new_value;
}
}
''')
I'm not suggesting this is a complete and proper scan, or that it is optimal, or anything of the sort. I'm simply showing how your code can be transformed to avoid the violation inherent in what you have.
If you want to learn more about scan methods, this is one source. But if you actually need a scan operation, I would suggest using thrust or cub.
I'm learning to use opencl in python and I wanted to optimize one of the function. I learned that this can be done by storing global memory in local memory. However, it doesn't work as it should, the duration is twice as long. This is well done? Can I optimize this code more?
__kernel void sumOP( __global float *input,
__global float *weights,
int layer_size,
__global float *partialSums,__local float* cache)
{
private const int i = get_global_id(0);
private const int in_layer_s = layer_size;
private const int item_id = get_local_id(0);
private const int group_id = get_group_id(0);
private const int group_count = get_num_groups(0);
const int localsize = get_local_size(0);
for ( int x = 0; x < in_layer_s; x++ )
{
cache[x] = weights[i*in_layer_s + x];
}
float total1 = 0;
for ( int x = 0; x < in_layer_s; x++ )
{
total1 += cache[x] *input[x];
}
partialSums[i] = sigmoid(total1);
}
Python call
l = opencl.LocalMemory(len(inputs))
event = program.sumOP(queue, output.shape, np.random.randn(6,).shape, inputs.data, weights.data,np.int32(len(inputs)),output.data,l)
Thanks for some advice
Besides doing a data write race condition with writing to same shared memory address cache[x] by all workitems of a group (as Dithermaster said) and lack of barrier() function, some optimizations can be added after those are fixed:
First loop in kernel
for ( int x = 0; x < in_layer_s; x++ )
{
cache[x] = weights[i*in_layer_s + x];
}
scans a different memory area for each work item, one element at a time. This is probably wrong in terms of global memory performance because each workitem in their own loop, could be using same memory channel or even same memory bank, hence, all workitems access that channel or bank serially. This is worse if in_layer_s gets a larger value and especially if it is power of 2. To solve this problem, all workitems should access contiguous addresses with their neighbors. GPU works better when global memory is accessed uniformly with workitems. On local memory, it is less of an issue to access randomly or with gaps between workitems. Thats why its advised to use uniform save/load on global while doing random/scatter/gather on local.
Second loop in kernel
for ( int x = 0; x < in_layer_s; x++ )
{
total1 += cache[x] *input[x];
}
is using only single accumulator. This a dependency chain that needs each loop cycle to be completed before moving on to next. Use at least 2 temporary "total" variables and unroll the loop. Here, if in_layer_s is small enough, input array could be moved into local or constant memory to access it faster (repeatedly by all workitems, since all workitems access same input array) (maybe half of input to constant memory and other half to local memory to increase total bandwidth)
Is weights[i*in_layer_s + x]; an array of structs? If yes, you can achieve a speedup by making it a struct of arrays and get rid of first loop's optimization altogether, with an increase of code bloat in host side but if priority is speed then struct of arrays is both faster and readable on gpu side. This also makes it possible to upload only the necessary weights data(an array of SOA) to gpu from host side, decreasing total latency (upload + compute + download) further.
You can also try asynchronous local<-->global transfer functions to make loading and computing overlapped for each workitem group, to hide even more latency as a last resort. https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/async_work_group_copy.html