InfluxDB: How to deal with missing data?

InfluxDB: How to deal with missing data? - python

Question Description
We are performing a lot of timeseries queries, these queries sometimes result in issues, they are usually performed through an API (Python) and sometimes result in complete failure due to data missing.
Due to this situation we are not sure where to educate ourselves and get the answer to this specific question on, how to deal with missing data in our timeseries (influxdb) database
Example
To describe a problem in an example..
We have some timeseries data, let's say we measure the temperature of the room, now we have many rooms and sometimes sensors die or stop working for a week or two, then we replace them and so on, in that timeframe the data is missing.
Now we try to perform certain calculations, they fail, let's say we want to calculate the temperature average per each day, now this will fail because some days we have no measurement input on the sensors.
One approach that we thought of is that we just interpolate the data for that day. Use the last and the first available and just place that value for the days that there is no data available.
This has many downsides, major one being due to fake data, you can't trust it and for our processes that are a bit more serious we would prefer to not store fake data (or interpolated).
We were wondering what the possible alternatives were to this question and where can we find the resource to educate ourselves on such topic.

Answer
The idea is that we fill the missing values, the gaps, with data that is null or None. This way we can use influxdb built-in fill.
https://docs.influxdata.com/influxdb/cloud/query-data/flux/fill/
Like in this example, we are able to fill null values and thereby perform any additional queries and actions on the data on analysis.
The link reference above contains all of the methodologies that we can use to resolve and fill in the missing data values.

Related

How can I use machine learning to convert orientation sensor data into movement events?

I built a device based on a microcontroller with some sensors attached to it, one of them is an orientation sensor that currently delivers information about pitch, yaw, roll and acceleration for x,y,z. I would like to be able to detect movement "events" when the device is well... moved around.
For example I would like to detect a "repositioned" event which basically would consist of series of other events - "up" (picked up), "move" (moved in air to some other point), "down" (device put back down).
Since I am just starting to figure out how to make it possible I would like to ask if I am getting the right ideas or wasting my time.
My idea is currently that I use the data I probed to create a dataset and try to use machine learning to detect if each element belongs to one of the events I am trying to detect. So basically I took the device and first rotated it on table a few times, then picked it up several times, then moved it in the air and finally put it down several times. This generated a set of data that has a structure like that:
yaw,pitch,roll,accelx,accely,accelz,state
-140,178,178,17,-163,-495,stand
110,-176,-166,-212,-97,-389,down
118,-177,178,123,16,-146,up
166,-174,-171,-375,-145,-929,up
157,-178,178,4,-61,-259,down
108,177,-177,-55,76,-516,move
152,178,-179,35,98,-479,stand
175,177,-178,-30,-168,-668,move
100,177,178,-42,26,-447,stand
-14,177,179,42,-57,-491,stand
-155,177,179,28,-57,-469,stand
92,-173,-169,347,-373,-305,down
[...]
the last "state" column is added by me - I added this after each test movement type and then shuffled the rows.
I got about 450 records this way and the idea is to use the machine learning to predict the "state" column for each record coming from the running device, then I could queue up the outcomes and if in some short period the "up" events are majority I can take it the device is being picked up.
Maybe instead of using each reading as a data row I should rather take the last 10 readings (lets say) and try to predict what happens per column - i.e. if I know last 10 yaw readings were the changes during I was moving the device up I should rather use this data - so 10 readings from each of the 6 columns is processed as row and then I have 6 results - again the ratio of result types may make it possible to detect the "movement" event that happened during these 10 readings.
I am currently about 30% into an online ML course and enjoying it but I'd really like to hear some comments from more experienced people.
Are my ideas a reasonable solution or am I totally failing to understand how I can use ML? If so, what resources shall I use to get myself started?

Your idea to regroup the reading seems interesting. But it all depends on how often you get a record and how you plan on grouping them.
If you get a record every 10-100ms, it could be a good idea to group them since it will help to have more accurate data reducing noise. You could take the mean of each column to get rid of that noise and help your classifier to better classify your different states.
Otherwise if you have a record every second, I think it's a bad idea to regroup the records since you will most certainely mix several actions together.
The best way would be to try out both ways if you have the time ^^

Slow loop python to search data in antoher data frame in python

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :

This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])

This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

What is a viable approach for loading a large quantity of data originating in csv files?

Background
Ultimately I'm trying to load a large quantity of data into GBQ. Records exist for the majority of seconds across many dates.
I want to have a table for each day's data so I may query the set using the syntax:
_TABLE_SUFFIX BETWEEN '20170403' AND '20170408'...
Some of the columns will be nested and sets of columns currently originate from a high number of csv files each between 30-150mb in size.
Approach
My approach to processing the files is to:
1) Create tables with correct schema
2) Fill each table with a record for each second in preparation for cycling through data and running UPDATE queries
3) Cycle through csv files and load data
All this is intended to be driven by the GBQ client library for python.
Step (2) may seem odd, but I think it's needed in order to simplify the loading as I will be able to guarantee that UPDATE DML will work in all cases.
I do realise that (3) has little chance of working given the issues with (2), but I have run a similar operation at a slightly smaller scale leading me to believe that my approach might not be entirely hopeless at this larger scale.
I'm also left wondering whether this is the most appropriate method to load data of this nature, but some of the options presented in GBQ documentation seem like they could be overkill for a one-time load of data i.e I would not want to learn apache beam unless I know I have to given that my chief concern is analysing the dataset. But I'm not beyond learning something like apache beam if I know it's the only way of completing such a load as it could come in handy in future.
Problem that I've run into
In order to run (2) I have three nested for loops cycling through each hour, minute second and constructing a long string to go into the INSERT query:
(TIMESTAMP '{timestamp}', TIMESTAMP '2017-04-03 00:00:00', ... + ~86k times for each day)
GBQ doesn't allow you to run such a long query there is a limit, from memory it's 250k characters (could be wrong). For that reason the length of query is tested at each iteration and if it exceeds a number slightly lower than 250k characters then the insert query is performed.
Although this worked for a similar type of operation involving more columns when I run this this time the process terminates when GQB returns:
"google.cloud.exceptions.BadRequest: 400 Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.."
When I reduce the size of the strings and run smaller queries this error is returned less frequently, but does not disappear all together. Unfortunately at this reduced size of query the script would take too long to complete to serve as a feasible option for step (2). It takes >600 seconds and the entire data set would take in excess of 20 days to run.
I had thought that changing the query mode to 'BATCH' may overcome this, but it's been pointed out to me that this won't help, I'm not yet clear why, but happy to take someone's word for it. Hence I've detailed the problem further as it seems likely that my approach may be wrong.
This is what I see as the relevant code block:
from google.cloud import bigquery
from google.cloud.bigquery import Client
... ...
bqc = Client()
dataset = bqc.dataset('test_dataset')
job = bqc.run_async_query(str(uuid.uuid4()), sql)
job.use_legacy_sql = False
job.begin()
What would be the best approach to loading this dataset? Is there a way of getting my current approach to work?
Query
INSERT INTO `project.test_dataset.table_20170403` (timestamp) VALUES (TIMESTAMP '2017-04-03 00:00:00'), ... for ~7000 distinct values
Update
It seems as if the GBQ client library has had a not insignificant upgrade since I last downloaded it Client.run_asyn_query() appears to have been replaced with Client.query(). I'll experiment with the new code and report back.

Split large collection into smaller ones?

I have a collection that is potentially going to be very large. Now I know MongoDB doesn't really have a problem with this, but I don't really know how to go about designing a schema that can handle a very large dataset comfortably. So I'm going to give an outline of the problem.
We are collecting large amounts of data for our customers. Basically, when we gather this data it is represented as a 3-tuple, lets say (a, b, c), where b and c are members of sets B and C respectively. In this particular case we know that the B and C sets will not grow very much over time. For our current customers we are talking about ~200,000 members. However, the A set is the one that keeps growing over time. Currently we are at about ~2,000,000 members per customer, but this is going to grow (possibly rapidly.) Also, there are 1->n relations between b->a and c->a.
The workload on this data set is basically split up into 3 use cases. The collections will be periodically updated, where A will get the most writes, and B and C will get some, but not many. The second use case is random access into B, then aggregating over some number of documents in C that pertain to b \in B. And the last usecase is basically streaming a large subset from A and B to generate some new data.
The problem that we are facing is that the indexes are getting quite big. Currently we have a test setup with about 8 small customers, the total dataset is about 15GB in size at the moment, and indexes are running at about 3GB to 4GB. The problem here is that we don't really have any hot zones in our dataset. It's basically going to get an evenly distributed load amongst all documents.
Basically we've come up with 2 options to do this. The one that I described above, where all data for all customers is piled into one collection. This means we'd have to create an index om some field that links the documents in that collection to a particular customer.
The other options is to throw all b's and c's together (these sets are relatively small) but divide up the C collection, one per customer. I can imangine this last solution being a bit harder to manage, but since we rarely access data for multiple customers at the same time, it would prevent memory problems. MongoDB would be able to load the customers index into memory and just run from there.
What are your thoughts on this?
P.S.: I hope this wasn't too vague, if anything is unclear I'll go into some more details.

It sounds like the larger set (A if I followed along correctly), could reasonably be put into its own database. I say database rather than collection, because now that 2.2 is released you would want to minimize lock contention between the busier database and the others, and to do that a separate database would be best (2.2 introduced database level locking). That is looking at this from a single replica set model, of course.
Also the index sizes sound a bit out of proportion to your data size - are you sure they are all necessary? Pruning unneeded indexes, combining and using compound indexes may well significantly reduce the pain you are hitting in terms of index growth (it would potentially make updates and inserts more efficient too). This really does need specifics and probably belongs in another question, or possibly a thread in the mongodb-user group so multiple eyes can take a look and make suggestions.
If we look at it with the possibility of sharding thrown in, then the truly important piece is to pick a shard key that allows you to make sure locality is preserved on the shards for the pieces you will frequently need to access together. That would lend itself more toward a single sharded collection (preserving locality across multiple related sharded collections is going to be very tricky unless you manually split and balance the chunks in some way). Sharding gives you the ability to scale out horizontally as your indexes hit the single instance limit etc. but it is going to make the shard key decision very important.
Again, specifics for picking that shard key are beyond the scope of this more general discussion, similar to the potential index review I mentioned above.

store a calculated value in the datastore or just calculate it on the fly

I writing an app in python for google app engine where each user can submit a post and each post has a ranking which is determined by its votes and comment count. The ranking is just a simple calculation based on these two parameters. I am wondering should I store this value in the datastore (and take up space there) or just simply calculate it every time that I need it. Now just fyi the posts will be sorted by ranking so that needs to be taken into account.
I am mostly thinking for the sake of efficiency and trying to balance if I should try and save the datastore room or save the read/write quota.
I would think it would be better to simply store it but then I need to recalculate and rewrite the ranking value every time anyone votes or comments on the post.
Any input would be great.

What about storing the ranking as a property in the post. That would make sense for querying/sorting wouldn't it.
If you store the ranking at the same time (meaning in the same entitiy) as you store the votes/comment count, then the only increase in write cost would be for the index. (ok initial write cost too but that is what 2 [very small anyway]).
You need to do a database operation everytime anyone votes or comments on the post anyway right!?! How else can to track votes/comments?
Actually though, I imagine you will get into use text search to find data in the posts. If so, I would look into maybe storing the ranking as a property in the search index and using it to rank matching results.
Don't we need to consider how you are selecting the posts to display. Is ranking by votes and comments the only criteria?

Caching is most useful when the calculation is expensive. If the calculation is simple and cheap, you might as well recalculate as needed.

If you're depending on keeping a running vote count in an entity, then you either have to be willing to lose an occasional vote, or you have to use transactions. If you use transactions, you're rate limited as to how many transactions you can do per second. (See the doc on transactions and entity groups). If you're liable to have a high volume of votes, rate limiting can be a problem.
For a low rate of votes, keeping a count in an entity might work fine. But if you any significant peaks in voting rate, storing separate Vote entities that periodically get rolled up into a cached count, perhaps adjusted by (possibly unreliable) incremental counts kept in memcache, might work better for you.
It really depends on what you want to optimize for. If you're trying to minimize disk writes by keeping a vote count cached non-transactionally, you risk losing votes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.