Pandas interpolate does not work on hourly time series data - python

I have got the below plot of temperature in a time series dates aggregated hourly.
What I am trying to do is to interpolate the missing values between 2019 and 2020, using pandas pd.interpolate, and generate results hourly (same frequency as the rest of the data in weather_data). My data is called weather_data, the index column is called date_time (dtype is float64) and the temperature column has also got float64 as the dtype. Here is what I have tried:
test = weather_datetime_index.temperature.interpolate("cubicspline")
test.plot()
This gave the same plot. I also tried (based on this post):
interpolated_temp = weather_datetime_index["temperature"].astype(float).interpolate(method="time")
still gave the same plot.
I also tried (as per this post):
test = weather_datetime_index.temperature.interpolate("spline",limit_direction="forward", order=1)
test.plot()
but still gave me the same plot.
How can I interpolate this data using pd.interpolate?

Related

Xarray dataset to Pandas dataframe too slow with new xarray library updates

I am trying to convert an xarray Dataset to a Pandas dataframe. It used to take minutes, but after an xarray library update, it takes hours.
I read in a list of 40 large netcdf datasets (each dataset is 1GB, totaling 40GB) using the command:
with xr.open_mfdataset(infile_list, combine='by_coords') as ds:
The data has lon, lat, day, crs dimensions. I select a single lat and lon slice using the command:
station = ds.sel(lon=-84.7250, lat=42.3583, method='nearest')
I try to convert this slice to a pandas dataframe. The output dataframe I expect would have 8 data variables.
df = station.to_dataframe()
I am able to convert a single data variable from the slice to a Pandas series in 1.5 minutes using the command:
df = station["wind_speed"].to_series()
Do I need to change how I read in the large datasets? Or is there a workaround to get to a pandas dataframe faster?

Matplotlib plot plotting the wrong data values

I am trying to plot random rows in a dataset, where the data consists of data collated across different dates. I have plotted it in such a way that the x-axis is labelled for the specific dates, and there is no interpolation between dates.
The issue I am having, is that the values plotted by matplotlib, do not match the entry values in the dataset. I am unsure as to what is happening here, would anyone be able to provide some insight, and possibly as to how I would fix it?
I have attached an image of the dataset and the plot, with the code contained below.
The code for generating the x-ticks, is as follows:
In: #creating a flat dates object such that dates are integer objects
flat_Dates_dates = flat_Dates[2:7]
flat_Dates_dates
Out: [20220620, 20220624, 20220627, 20220701, 20220708]
In: #creating datetime object(pandas, not datetime module) to only plot specific dates and remove interpolation of dates
date_obj_pd = pd.to_datetime(flat_Dates_dates, format=("%Y%m%d"))
Out: DatetimeIndex(['2022-06-20', '2022-06-24', '2022-06-27', '2022-07-01',
'2022-07-08'],
dtype='datetime64[ns]', freq=None)
As you can see from the dataset, the plotted trends should not take that form, the data values are wildly different from where they should be on the graph.
Edit: Apologies, I forgot to mention x = date_obj_pd - which is why I added the code, essentially just the array of datetime objects.
y is just the name of the pandas DataFrame (data table) I have included in the image.
You are plotting columns instead of rows. The blue line contains elements 1:7 from the first column, namely these:
If you transpose the dataframe you should get the desired result:
plt.plot(x, y[1:7].transpose(), 'o--')

how to add to plot gaps when observations are missed?

Here is what i got (time series) in pandas dataframe
screenshot
(also dates were converted from timestamps)
My goal is to plot not only observations, but all the range of dates. I need to see horizontal line or gap when there is no new observations.
Dealing with data that is not observed equidistant in time is a typical challenge with real-world time series data. Given your problem, this code should work.
from datetime import datetime
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
# sample Frame
df = pd.DataFrame({'time' : ['2022,7,3,0,1,21', '2022,7,3,0,2,47', '2022,7,3,0,2,47', '2022,7,3,0,5,5',
'2022,7,3,0,5,5'],
'balance' : [12.6, 12.54, 12.494426, 12.482481, 12.449206]})
df['time'] = pd.to_datetime(df['time'], format='%Y,%m,%d,%H,%M,%S')
# aggregate time duplicates by mean
df = df.groupby('time').mean()
df.reset_index(inplace=True)
# pick equidistant time grid
df_new = pd.DataFrame({'time' : pd.date_range(start=df.loc[0]['time'], end=df.loc[2]['time'], freq='S')})
df = pd.merge(left=df_new, right=df, on='time', how='left')
# fill nan
df['balance'].fillna(method='pad', inplace=True)
df.set_index("time", inplace=True)
# plot
_ = df.plot(title='Time Series of Balance')
There are several caveats to this solution.
First, your data has a high temporal resolution (seconds). However, there are hours-long gaps in between observations. You either coarsen the timestamp by rounding (e.g. to minutes or hours) or go along with the time series on a second-by-second resolution and accept the fact that most you balance values will be filled-in values rather than true observations.
Second, you have different balance values for the same timestamp which indicates faulty entries or a misspecified timestamp. I unified those entries via grouping by timestamp and averaged the balance over those non-unique timestamps.
Third, filled-up gaps and true observations both have the same visual representation in the plot (blue dots in the graph). As previously mentioned commenting out the fillna() line would only showcase true observations leaving everything in between white.
Finally, the missing values are merely filled in via padding. Look up different values of the argument method in the documentation in case you want to linearly interpolate etc.
Summary
The problems described above are typical for event-driven time series data. Since you deal with a (financial) balance that constitutes a state that is only changed by events (orders), I believe that the assumptions made above arew reasonable and can be adjusted easily for your or many other use cases.
this helped
data = data.set_index('time').resample('1M').mean()

pandas Identify cycles in a timeseries data

I have timeseries data and I want to identify cycles and duration of each cycle.
The datetime index does not have a frequency (there is no fixed time step between data point)
I tried to decompose the series using seasonal_decompose from statsmodels.tsa.seasonal but I got the following error ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
First resample your processed DataFrame in a variable say y (make sure that the index is of datetime type)
Then pass that variable in seasonal_decompose
Ex:
y = df['Sales'].resample('MS').mean()
x = seasonal_decompose(y)
#plot the decomposed data
x.plot()
#show the plot
plt.show()

Why are my histogram bars all displaying frequencies of 1

I have a series (114 rows) with indexed timestamps and percentages (astype float).
testseries.head()
Out[100]:
Timestamps
2018-04-19 13:23:57-04:00 0.000161238
2018-04-06 13:59:50-04:00 -0.0169348
2018-04-04 11:39:41-04:00 0.0475188
2018-04-03 14:53:37-04:00 -0.00231244
2018-03-29 14:09:57-04:00 0.0209815
Name: Change, dtype: object
I'm trying to create a histogram of the distribution of these, as I've done several times before, but am getting an unexpected result when I call
testseries.hist()
link to image of output hist
I've tried various options, like setting density=True, changing the number of bins, or plotting in matplotlib vs. pandas, but the result is always a series of thin bars with height equal to the maximum on the y-axis.
What's causing this?
The histogram is correctly showing you that each value appears once. In order to show something smoother, you might want to group counts by quantiles and count, displaying the histogram of the result:
testseries.groupby(pd.cut(testseries.astype(float), 10)).sum().hist()
Example
import pandas as pd
import numpy as np
testseries = pd.Series(np.random.randn(100000))
testseries.groupby(pd.cut(testseries.astype(float), 10)).sum().hist();

Categories