This question was migrated from Cross Validated because it can be answered on Stack Overflow.
Migrated 13 days ago.
I am finding myself in a puzzling situation with using xarray.
I need to find the daily maximum value of an xarray object but I also need to retain the associated time step to the hour, without losing that information, as instead happens with the resample functionality. There is usually a functionality, such as on="time", which would allow to read the field, but it does not seem to be supported by xarray.
Do you know if there is any other elegant way to do what I need to do?
I put a code snippet below:
import xarray as xr
obj = xr.open_dataset("/atmos.2019010100-2019123123.u_ref.nc") # can be any nc file
obj = obj.resample(time='1D').max()
Related
Question Description
We are performing a lot of timeseries queries, these queries sometimes result in issues, they are usually performed through an API (Python) and sometimes result in complete failure due to data missing.
Due to this situation we are not sure where to educate ourselves and get the answer to this specific question on, how to deal with missing data in our timeseries (influxdb) database
Example
To describe a problem in an example..
We have some timeseries data, let's say we measure the temperature of the room, now we have many rooms and sometimes sensors die or stop working for a week or two, then we replace them and so on, in that timeframe the data is missing.
Now we try to perform certain calculations, they fail, let's say we want to calculate the temperature average per each day, now this will fail because some days we have no measurement input on the sensors.
One approach that we thought of is that we just interpolate the data for that day. Use the last and the first available and just place that value for the days that there is no data available.
This has many downsides, major one being due to fake data, you can't trust it and for our processes that are a bit more serious we would prefer to not store fake data (or interpolated).
We were wondering what the possible alternatives were to this question and where can we find the resource to educate ourselves on such topic.
Answer
The idea is that we fill the missing values, the gaps, with data that is null or None. This way we can use influxdb built-in fill.
https://docs.influxdata.com/influxdb/cloud/query-data/flux/fill/
Like in this example, we are able to fill null values and thereby perform any additional queries and actions on the data on analysis.
The link reference above contains all of the methodologies that we can use to resolve and fill in the missing data values.
I'm asking for help/tips with system design.
I have some iot system with sensors PIR(motion), contactrons, temperature& humidity ...
Nothing fancy.
I'm collecting, filtering the raw data the data to build some observations on top.
So far I have some event_rules classes that are bound to sensors and return True/False depending on the data that's coming constantly from the queue(from sensors).
I know I need to run some periodic analyses on existing data e.g when Motion sensors are not reporting anymore or both incoming/existing that includes loading the data and analyzing data in some time window (counting/average, etc.)
That time window approach could help answer the questions like:
temperature increased over 10deg in last 1h* or no motion detected for past 10mins or High/low/no movement detected over last 30mins
My silly approach was to run some semi-cron python thread that executes rules one-by-one and checks the rules output every N seconds e.g every 30sec. Some rules includes a state machine and handles transitions from one state to another.
But this is soo baaad imho, imagine system scales-up and all of the sudden system is going to check hundreds of rules every N...seconds.
I know some generic approach is needed.
How shall I tackle the case? What is the correct approach? In the uC world I'd call it how to properly generate system clock that will check the rules, but again not all at once and in a kindla configurable manner.
I'd be thankful for the tips, maybe there are already some python libraries to address it. I'm using pandas for analyses and machine state for the state transitions, event rules are defined in SQL database and cast to polymorphic python class based on the rule type.
Using Pandas rolling Window could be a solution (Sources: pandas.pydata.org: Window, How to use rolling in pandas?).
This meant in general:
Step 1:
Define a timebased window based either on a number of rows (increased index id) or timebased (increased timestamp)
Step 2:
Apply this window onto the dataset
The code snippet below applies basic calculations (mean, min, max) to a dataframe and adds the results as new columns in the dataframe.
To keep the original dataframe clean I suggest to use a copy instead of:
import pandas as pd
df = pd.read_csv('[PathToDatasurce]')
df_copy = df.copy()
df_copy['moving-average'] = df_15['SourceColumn'].rolling(window=10).mean()
df_copy['moving-average'] = df_15['SourceColumn'].rolling(window=10).min()
df_copy['moving-average'] = df_15['SourceColumn'].rolling(window=10).max()
I've got a dataset with multiple time values as below.
Area,Year,Month,Day of Week,Time of Day,Hour of Day
x,2016,1,6.0,108,1.0
z,2016,1,6.0,140,1.0
n,2016,1,6.0,113,1.0
p,2016,1,6.0,150,1.0
r,2016,1,6.0,158,1.0
I have been trying to transform this into a single datetime object to simplify the dataset and be able to do proper time series analysis against it.
For some reason I have been unable to get the right outcome using the datetime library from Python. Would anyone be able to point me in the right direction?
Update - Example of stats here.
https://data.pa.gov/Public-Safety/Crash-Incident-Details-CY-1997-Current-Annual-Coun/dc5b-gebx/data
I don't think there is a week column. Hmm. I wonder if I've missed something?
Any suggestions would be great. Really just looking to simplify this dataset. Maybe even create another table / sheet for the causes of crash, as their's a lot of superfluous columns that are taking up a lot of data, which can be labeled with simple ints.
I am new to python pandas, and I am trying to find the strongest month within a given series of timestamped sales data. The question to answer for n products is: when is the demand for the given product the highest?
I am not looking for a complete solution but rather some ideas, how to approach this problem.
I already looked into seasonal_decomposition to get some sort of seasonality indication but I feel that this might be a bit too complicated of an approach.
I don't have 50 reputation to add comment hence adding answer section. Some insight about your required solution would be great, because to me it's not clear about your requirement. BTW Coming to the idea, if your can split and load the time series data as the timestamp and demand then you can easily do it using regular python methods like max and then getting the time stamp value where the max demand occurred.
Well, I am not sure about how is your data so not sure if the answer will help but from what you said is that you are trying to check the month of the highest sales, so giving the product you will probably want to use the pandas groupby using the month and you will have a DataFrame with every month grouped.
imagine a DF named Data:
mean_buy = Data.groupby(months).mean()
with months = np.array([1,2,3,4,5,6,7,8,9,10,11,12]*number_of_years)
I have been using matplotlib.finance to pull stock information. quotes_historical_yahoo() is a really easy function to use but seems to only allow me to pull information on the day.
Is there a way using matplotlib to pull stock values in intervals of 5 minutes?
If not can I get a suggestion of some other python software that will do what I want.
There are several sources of historical data at varying resolutions here, but they dont go back very far. For example, you can only get ten days worth of data at the 1 minute interval from google finance.
I use pandas for historical data using DataReader(), and then read_csv() for the above sources (but that can get tricky and you will need to write your own code to format and make some of these useful)