Upsampling high rate data using pandas - python

Yes, you read that correctly. I want to upsample multi-hertz data, by a factor of 5. I have a spreadsheet with dozens of columns and thousands of rows, I would like to upsample all the data at once. So I cracked open the pandas web pages and tutorials. I read the csv file in with pandas "read_csv", then used the floating point seconds column to create a datetime-like column and set that as the index, since it seems to want that.
Then I tried both "resample" and "asfreq" on it, e.g. df.resample("Xms").sum() and df.asfreq(freq="Xms"), where X is the number of milliseconds I want to upsample to. Resample filled all the rows in between with zeros, and "interpolate" won't touch those. "asfreq" has my first data row, followed by rows of NaNs, but my subsequent data rows seem to have disappeared! Note the floating point seconds values were not necessarily on clean Xms boundaries. And yet when I interpolate that data, it becomes meaningful again (albeit for some reason it only gave me the first 25k points) I have no idea how...
I note with dismay that all of examples I find for this function relate to data taken over hours, days, weeks, months, years...so I'm beginning to think this isn't the right way to go about it... does anyone have tips to help me understand what I'm seeing / how to proceed? Thanks.

Related

InfluxDB: How to deal with missing data?

Question Description
We are performing a lot of timeseries queries, these queries sometimes result in issues, they are usually performed through an API (Python) and sometimes result in complete failure due to data missing.
Due to this situation we are not sure where to educate ourselves and get the answer to this specific question on, how to deal with missing data in our timeseries (influxdb) database
Example
To describe a problem in an example..
We have some timeseries data, let's say we measure the temperature of the room, now we have many rooms and sometimes sensors die or stop working for a week or two, then we replace them and so on, in that timeframe the data is missing.
Now we try to perform certain calculations, they fail, let's say we want to calculate the temperature average per each day, now this will fail because some days we have no measurement input on the sensors.
One approach that we thought of is that we just interpolate the data for that day. Use the last and the first available and just place that value for the days that there is no data available.
This has many downsides, major one being due to fake data, you can't trust it and for our processes that are a bit more serious we would prefer to not store fake data (or interpolated).
We were wondering what the possible alternatives were to this question and where can we find the resource to educate ourselves on such topic.
Answer
The idea is that we fill the missing values, the gaps, with data that is null or None. This way we can use influxdb built-in fill.
https://docs.influxdata.com/influxdb/cloud/query-data/flux/fill/
Like in this example, we are able to fill null values and thereby perform any additional queries and actions on the data on analysis.
The link reference above contains all of the methodologies that we can use to resolve and fill in the missing data values.

How to deal with consecutive missing values of stock price in a time series using python?

I have a data frame consisting of two-time series describing two different stock prices, spanning over five years with an interval of approximately 2 minutes. I am struggling to decide how to deal with the missing values to build a meaningful model.
Some info about the data frame:-
Total number rows: 1315440
Number of missing values in Series_1: 1113923
Number of missing values in Series_2: 378952
Often there are missing values in 100+ consecutive rows, which is what makes me confused about how to deal with this dataset.
Below is a portion of the data, plots of Series_1 (column 2) and Series_2 (column_3).
Visualisation of Series_1:
Visualisation of Series_2:
Any advice would be appreciated. Thanks.
Depending on where your data come from, the missing data at a given time may mean that at this particular timestamp, out of the two stocks, an order was executed for one but not for the other. There is no reason in fact that two different stocks trade at exactly the same time. Certain dormant stocks with no liquidity can go for a long time without being traded while others are more active. Moreover, given that the precision of the data is down to the microsecond, no surprise that the trades on both stocks are not necessarily happening at the exact same microsecond. In this cases, it is safe to assume that the price of the stock was the last recorded transaction and update the missing values accordingly. Assuming you are using pandas, you could harmonize it by applying the pandas fillna method. Just make sure to sort your data frame beforehand:
df.sort_values('Time', inplace=True)
df['Series1'].fillna(method='ffill', inplace=True)
df['Series2'].fillna(method='ffill', inplace=True)

How to lively save pandas dataframe to file?

I have a Python program that is controlling some machines and stores some data. The data is produced at a rate of about 20 rows per second (and about 10 columns or so). The whole run of this program can be as long as one week, as a result there is a large dataframe.
What are safe and correct ways to store this data? With safe I mean that if something fails in the day 6, I will still have all the data from days 1→6. With correct I mean not re-writing the whole dataframe to a file in each loop.
My current solution is a CSV file, I just print each row manually. This solution is both safe and correct, but the problem is that CSV does not preserve data types and also occupies more memory. So I would like to know if there is a binary solution. I like the feather format as it is really fast, but it does not allow to append rows.
I can think of two easy options:
store chunks of data (e.g. every 30 seconds or whatever suits your use case) into separate files; you can then postprocess them back into a single dataframe.
store each row into an SQL database as it comes in. Sqlite will likely be a good start, but I'd maybe really go for PostgreSQL. That's what databases are meant for, after all.

Undersampling large dataset under specific conditon applied to other column in python/pandas

I currently working with a large dataset (about 40 coulmns and tens of thousans of rows) and i would like to undersample it to be able to work with it more easily.
For the undersampling, unlike the resample method from pandas that resample according to timedelta, I'm trying to specify conditons on other columns to determine the data points to keep.
I'm not sure it's so clear but for example, let's say I have 3 columns (index, time and temperature) like as followed:
Now for the resampling, I would like to keep a data point every 1s or every 2°C, the resulting dataset would look like this:
I couldn't find a simple way of doing this with pandas. The only way would be to iterate over the rows but it was very slow because of the size of my datasets.
I though about using the diff method but of course it can only determine the difference on a specified period, same for pct_change that could have been use to keep only the points in the regions were the variations are maximal to undersample.
Thanks in advance if you have any suggestions on how to proceed with this resampling.

Standardizing GPX traces

I have two GPX files (from a race I ran twice, obtained via the Strava API) and I would like to be able to compare the effort across both. The sampling frequency is irregular however (i.e. data is not recorded every second, or every meter), so a straightforward comparison is not possible and I would need to standardize the data first. Preferably, I would resample the data so that I have data points for every 10 meters for example.
I'm using Pandas, so I'm currently standardizing a single file by inserting rows for every 10 meters and interpolating the heartrate, duration, lat/lng, etc from the surrounding data points. This works, but doesn't make the data comparable across files, as the recording does not start at the exact same location.
An alternative is first standardizing the course coordinates using something like geohashing and then trying to map both efforts to this standardized course. Since coordinates can not be easily sorted, I'm not sure how to do that correctly however.
Any pointers are appreciated, thanks!

Categories