I'm asking for help/tips with system design.
I have some iot system with sensors PIR(motion), contactrons, temperature& humidity ...
Nothing fancy.
I'm collecting, filtering the raw data the data to build some observations on top.
So far I have some event_rules classes that are bound to sensors and return True/False depending on the data that's coming constantly from the queue(from sensors).
I know I need to run some periodic analyses on existing data e.g when Motion sensors are not reporting anymore or both incoming/existing that includes loading the data and analyzing data in some time window (counting/average, etc.)
That time window approach could help answer the questions like:
temperature increased over 10deg in last 1h* or no motion detected for past 10mins or High/low/no movement detected over last 30mins
My silly approach was to run some semi-cron python thread that executes rules one-by-one and checks the rules output every N seconds e.g every 30sec. Some rules includes a state machine and handles transitions from one state to another.
But this is soo baaad imho, imagine system scales-up and all of the sudden system is going to check hundreds of rules every N...seconds.
I know some generic approach is needed.
How shall I tackle the case? What is the correct approach? In the uC world I'd call it how to properly generate system clock that will check the rules, but again not all at once and in a kindla configurable manner.
I'd be thankful for the tips, maybe there are already some python libraries to address it. I'm using pandas for analyses and machine state for the state transitions, event rules are defined in SQL database and cast to polymorphic python class based on the rule type.
Using Pandas rolling Window could be a solution (Sources: pandas.pydata.org: Window, How to use rolling in pandas?).
This meant in general:
Step 1:
Define a timebased window based either on a number of rows (increased index id) or timebased (increased timestamp)
Step 2:
Apply this window onto the dataset
The code snippet below applies basic calculations (mean, min, max) to a dataframe and adds the results as new columns in the dataframe.
To keep the original dataframe clean I suggest to use a copy instead of:
import pandas as pd
df = pd.read_csv('[PathToDatasurce]')
df_copy = df.copy()
df_copy['moving-average'] = df_15['SourceColumn'].rolling(window=10).mean()
df_copy['moving-average'] = df_15['SourceColumn'].rolling(window=10).min()
df_copy['moving-average'] = df_15['SourceColumn'].rolling(window=10).max()
Related
:slight_smile:
What are you trying to achieve?:
I am currently doing a reaction time and accuracy task that involves comparing visually presented numerical information and auditory numerical information. The visually presented numerical information will be presented in three forms - Arabic numerals (e.g 5), number words (e.g five), and non-symbolic magnitude (a picture of 5 dots). Both visual numerical information and auditory numerical information will be presented sequentially. After the presentation of the 2nd stimulus, participants are to respond if these two stimuli are conveying the same information or not. They are supposed to press “a” if the numerical information is the same and “l” if it’s different.
Apart from varying the format of the visual numerical stimuli I am presenting, I also intend to vary the stimulus onset asynchrony (SOA)/time interval between the two stimuli. I have 7 levels of time intervals/SOAs (plus minus 750, 250 and 500, and 0ms), resulting in me creating my experiment in such a manner (see attached picture).
One set of fixation_cross and VA_750ms (for example) constitutes a block. Hence, in total, there are 7 blocks here (only 4 are pictured though). I have already randomized the trials within each block. The next step for me is to randomize the presentation of these blocks, with one block denoting one level of SOA/time interval (e.g +750ms). To do this, I’ve placed a loop around all the blocks, with this loop titled “blocknames” in the picture. While the experiment still works fine, randomization still doesn’t occur.
I understand that there was a post addressing the randomization of blocks, but I felt that it was more specific to experiments that only have one routine. This is not very feasible for my case considering that I would have to vary the time interval between two numerical stimuli within a trial.
What did you try to make it work?:
Nevertheless, I’ve tried to create an excel file with the names of the excel files in each condition - across all routines, the excel files actually contain the same information, but they’re just named differently according to what the condition name is (e.g AV500ms, VA750ms). In this case, the experiment still works, but the blocks are still not being randomized.
What specifically went wrong when you tried that?:
With the same excel file, I also tried to label my conditions as $condsFile instead of using the exact document location, but this was what I got instead.
At the same time, I was wondering if I could incorporate my SOA/time interval levels into Excel instead - how would this be carried out in Builder?
This might be some useful background info on my Psychopy software and laptop.
OS (e.g. Win10): Win 10
PsychoPy version (e.g. 1.84.x): 2020.1.3
Standard Standalone? (y/n) Yes,
I apologize if this might have been posted a few times. However, I’ve tried to apply these solutions according to what my experiment requires, but to no avail. I’m also quite a new user to Psychopy and am not very sure on how to proceed from here as well. Would really appreciate any advice on this!
This isn't really a programming question per se, as it can be addressed entirely by using the graphical Builder interface of PsychoPy. In the future, you should probably address such questions to the dedicated support forum at https://discourse.psychopy.org rather than here at Stack Overflow.
In essence, your experiment should have a much simpler structure. Embed your two trial routines within a trials loop. After that loop, insert your break routine. Lastly, embed the whole lot within an outer blocks loop. i.e. your experiment will show only three routines and two loops, not the very long structure you currently have. The nested loops means the two trial routines will run on very trial, while the break routine will run only once per block.
The key aspect to controlling the block order is the outer blocks loop. Connect it to a conditions file that looks like this:
condition_file
block_1.csv
block_2.csv
block_3.csv
block_4.csv
block_5.csv
block_6.csv
block_7.csv
And set the loop to be be "random".
In the inner trials loop, put the variable name $condition_file in the conditions file field. So you now will have the order of blocks randomised across your subjects.
The other key aspect you need to learn is to control more of the task using variables contained within each of your conditions files. e.g. you are currently creating a separate routine for each ISI value (e.g. AV500ms and AV750ms). Instead, you should just have a single routine, called say AV. Make the timings of the stimulus components within that routine be controlled by a variable from your conditions file.
A key principle of programming is DRY: Don't Repeat Yourself (and although you aren't directly programming, under the hood, PsychoPy Builder will generate a Python program for you). Creating multiple routines that differ only in one respect is an indicator that things are not being specified optimally. By having only one routine, if you need to alter it in some way, you only have to do it once, rather than repeat it 7 times. The latter approach is very fragile and hard to maintain, and can easily lead to errors.
There is a resource on controlling blocks of trials here:
https://www.psychopy.org/builder/blocksCounterbalance.html
I built a device based on a microcontroller with some sensors attached to it, one of them is an orientation sensor that currently delivers information about pitch, yaw, roll and acceleration for x,y,z. I would like to be able to detect movement "events" when the device is well... moved around.
For example I would like to detect a "repositioned" event which basically would consist of series of other events - "up" (picked up), "move" (moved in air to some other point), "down" (device put back down).
Since I am just starting to figure out how to make it possible I would like to ask if I am getting the right ideas or wasting my time.
My idea is currently that I use the data I probed to create a dataset and try to use machine learning to detect if each element belongs to one of the events I am trying to detect. So basically I took the device and first rotated it on table a few times, then picked it up several times, then moved it in the air and finally put it down several times. This generated a set of data that has a structure like that:
yaw,pitch,roll,accelx,accely,accelz,state
-140,178,178,17,-163,-495,stand
110,-176,-166,-212,-97,-389,down
118,-177,178,123,16,-146,up
166,-174,-171,-375,-145,-929,up
157,-178,178,4,-61,-259,down
108,177,-177,-55,76,-516,move
152,178,-179,35,98,-479,stand
175,177,-178,-30,-168,-668,move
100,177,178,-42,26,-447,stand
-14,177,179,42,-57,-491,stand
-155,177,179,28,-57,-469,stand
92,-173,-169,347,-373,-305,down
[...]
the last "state" column is added by me - I added this after each test movement type and then shuffled the rows.
I got about 450 records this way and the idea is to use the machine learning to predict the "state" column for each record coming from the running device, then I could queue up the outcomes and if in some short period the "up" events are majority I can take it the device is being picked up.
Maybe instead of using each reading as a data row I should rather take the last 10 readings (lets say) and try to predict what happens per column - i.e. if I know last 10 yaw readings were the changes during I was moving the device up I should rather use this data - so 10 readings from each of the 6 columns is processed as row and then I have 6 results - again the ratio of result types may make it possible to detect the "movement" event that happened during these 10 readings.
I am currently about 30% into an online ML course and enjoying it but I'd really like to hear some comments from more experienced people.
Are my ideas a reasonable solution or am I totally failing to understand how I can use ML? If so, what resources shall I use to get myself started?
Your idea to regroup the reading seems interesting. But it all depends on how often you get a record and how you plan on grouping them.
If you get a record every 10-100ms, it could be a good idea to group them since it will help to have more accurate data reducing noise. You could take the mean of each column to get rid of that noise and help your classifier to better classify your different states.
Otherwise if you have a record every second, I think it's a bad idea to regroup the records since you will most certainely mix several actions together.
The best way would be to try out both ways if you have the time ^^
I'm doing some Monte Carlo for a model and figured that Dask could be quite useful for this purpose. For the first 35 hours or so, things were running quite "smoothly" (apart from the fan noise giving a sense that the computer was taking off). Each model run would take about 2 seconds and there were 8 partitions running it in parallel. Activity monitor was showing 8 python3.6 instances.
However, the computer has become "silent" and CPU usage (as displayed in Spyder) hardly exceeds 20%. Model runs are happening sequentially (not in parallel) and taking about 4 seconds each. This happened today at some point while I was working on other things. I understand that depending on the sequence of actions, Dask won't use all cores at the same time. However, in this case there is really just one task to be performed (see further below), so one could expect all partitions to run and finish more or less simultaneously. Edit: the whole set up has run successfully for 10.000 simulations in the past, the difference now being that there are nearly 500.000 simulations to run.
Edit 2: now it has shifted to doing 2 partitions in parallel (instead of the previous 1 and original 8). It appears that something is making it change how many partitions are simultaneously processed.
Edit 3: Following recommendations, I have used a dask.distributed.Client to track what is happening, and ran it for the first 400 rows. An illustration of what it looks like after completing is included below. I am struggling to understand the x-axis labels, hovering over the rectangles shows about 143 s.
Some questions therefore are:
Is there any relationship between running other software (Chrome, MS Word) and having the computer "take back" some CPU from python?
Or instead, could it be related to the fact that at some point I ran a second Spyder instance?
Or even, could the computer have somehow run out of memory? But then wouldn't the command have stopped running?
... any other possible explanation?
Is it possible to "tell" Dask to keep up the hard work and go back to using all CPU power while it is still running the original command?
Is it possible to interrupt an execution and keep whichever calculations have already been performed? I have noticed that stopping the current command doesn't seem to do much.
Is it possible to inquire on the overall progress of the computation while it is running? I would like to know how many model runs are left to have an idea of how long it would take to complete in this slow pace. I have tried using the ProgressBar in the past but it hangs on 0% until a few seconds before the end of the computations.
To be clear, uploading the model and the necessary data would be very complex. I haven't created a reproducible example either out of fear of making the issue worse (for now the model is still running at least...) and because - as you can probably tell by now - I have very little idea of what could be causing it and I am not expecting anyone to be able to reproduce it. I'm aware this is not best practice and apologise in advance. However, I would really appreciate some thoughts on what could be going on and possible ways to go about it, if anyone has been thorough something similar before and/or has experience with Dask.
Running:
- macOS 10.13.6 (Memory: 16 GB | Processor: 2.5 GHz Intel Core i7 | 4 cores)
- Spyder 3.3.1
- dask 0.19.2
- pandas 0.23.4
Please let me know if anything needs to be made clearer
If you believe it can be relevant, the main idea of the script is:
# Create a pandas DataFrame where each column is a parameter and each row is a possible parameter combination (cartesian product). At the end of each row some columns to store the respective values of some objective functions are pre-allocated too.
# Generate a dask dataframe that is the DataFrame above split into 8 partitions
# Define a function that takes a partition and, for each row:
# Runs the model with the coefficient values defined in the row
# Retrieves the values of objective functions
# Assigns these values to the respective columns of the current row in the partition (columns have been pre-allocated)
# and then returns the partition with columns for objective functions populated with the calculated values
# map_partitions() to this function in the dask dataframe
Any thoughts?
This shows how simple the script is:
The dashboard:
Update: The approach I took was to:
Set a large number of partitions (npartitions=nCores*200). This made it much easier to visualise the progress. I'm not sure if setting so many partitions is good practice but it worked without much of a slowdown.
Instead of trying to get a single huge pandas DataFrame in the end by .compute(), I got the dask dataframe to be written to Parquet (in this way each partition was written to a separate file). Later, reading all files into a dask dataframe and computeing it to a pandas DataFrame wasn't difficult, and if something went wrong in the middle at least I wouldn't lose the partitions that had been successfully processed and written.
This is what it looked like at a given point:
Dask has many diagnostic tools to help you understand what is going on inside your computation. See http://docs.dask.org/en/latest/understanding-performance.html
In particular I recommend using the distributed scheduler locally and watching the Dask dashboard to get a sense of what is going on in your computation. See http://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
This is a webpage that you can visit that will tell you exactly what is going on in all of your processors.
I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :
This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])
This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.
I have a live feed of logging data coming in through the network. I need to calculate live statistics, like the one in my previous question. How would I design this module? I mean, it seems unrealistic (read, bad design) to keep applying a groupby function to the entire df every single time a message arrives. Can I just update one row and its calculated column gets auto-updated?
JFYI, I'd be running another thread that will print read values from the df and print to the a webpage every 5 seconds or so..
Of course, I could run groupby-apply every 5 seconds instead of doing it in real time, but I thought it'd be better to keep the df and the calculation independent of the printing module.
Thoughts?
groupby is pretty damn fast, and if you preallocate slots for new items you can make it even faster. In other words, try it and measure it for a reasonable amount of fake data. If it's fast enough, use pandas and move on. You can always rewrite it later.