interpolate values between sample years with Pandas - python

I'm trying to get interpolated values for the metric shown below using Pandas time series.
test.csv
year,metric
2020,290.72
2025,221.763
2030,152.806
2035,154.016
Code
import pandas as pd
df = pd.read_csv('test.csv', parse_dates={'Timestamp': ['year']},
index_col='Timestamp')
As far as I understand this gives me an time series with the January 1 of each year as the index. Now I need to fill in values for missing years (2021, 2022, 2023, 2024, 2026 etc)
Is there a way to do this with Pandas?

If you're using a newer version of Pandas, your DataFrame object should have an interpolate method that can be used to fill in the gaps.

It turns out, interpolation only fills in values, where there are none. In my case above, what I had to do was to re-index so that the interval was 12 months.
# reindex with interval 12 months (M: month, S: beginning of the month)
df_reindexed = df.reindex(pd.date_range(start='20120101', end='20350101', freq='12MS'))
# method=linear works because the intervals are equally spaced out now
df_interpolated = df_reindexed.interpolate(method='linear')

Related

How to slice a pandas DataFrame between two dates (day/month) ignoring the year?

I want to filter a pandas DataFrame with DatetimeIndex for multiple years between the 15th of april and the 16th of september. Afterwards I want to set a value the mask.
I was hoping for a function similar to between_time(), but this doesn't exist.
My actual solution is a loop over the unique years.
Minimal Example
import pandas as pd
df = pd.DataFrame({'target':0}, index=pd.date_range('2020-01-01', '2022-01-01', freq='H'))
start_date = "04-15"
end_date = "09-16"
for year in df.index.year.unique():
# normal approche
# df[f'{year}-{start_date}':f'{year}-{end_date}'] = 1
# similar approche slightly faster
df.iloc[df.index.get_loc(f'{year}-{start_date}'):df.index.get_loc(f'{year}-{end_date}')+1]=1
Does a solution exist where I can avoid the loop and maybe improve the performance?
To get the dates between April 1st and October 31st, what about using the month?
df.loc[df.index.month.isin(range(4, 10)), 'target'] == 1
If you want to map any date/time, just ignoring the year, you can replace the year to 2000 (leap year) and use:
s = pd.to_datetime(df.index.strftime('2000-%m-%d'))
df.loc[(s >= '2000-04-15') & (s <= '2020-09-16'), 'target'] = 1

Creating a matplotlib line graph using datetime objects while ignoring the year value

I have a dataset of highest and lowest temperatures recorded for each day of the year, for the years 2005-2014. I want to create a graph where I plot the max and min temperatures for each day of the year for this period (so there will be only one max and min temperature for each day plotted). I was able to create a df from the data set of the absolute min and maxs for each day, here's the example of the max:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
# splitting 2005-2014 df dates into separate columns for easier analysis
weather_05_14['Year'] = weather_05_14['Date'].dt.strftime('%Y')
weather_05_14['Month'] = weather_05_14['Date'].dt.strftime('%m')
weather_05_14['Day'] = weather_05_14['Date'].dt.strftime('%d')
# extracting the min and max temperatures for each day, regardless of year
max_temps = weather_05_14.loc[weather_05_14.groupby(['Day', 'Month'], sort=False)
['Data_Value'].idxmax()][['Data_Value', 'Date']]
max_temps.rename(columns={'Data_Value': 'Max'}, inplace=True)
This is what the data frame looks like:
Now here's where my issue is. I want to plot this data in a line plot based on month/day, disregarding the year so it's in order. My thought was that I could do this by changing the year to be the same for every data point (as it won't be data that will be in the final graph anyway) and this is what I did to try to accomplish that:
max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005)
but I got this error:
ValueError: day is out of range for month
I have also tried to take my separate Day, Month, Year columns that I used to group by, include those with the max_temps df, change the year, and then move those all to a new column and convert them to a datetime object, but I get a similar error
max_temps['Year'] = 2005
max_temps['New Date'] = pd.to_datetime[max_temps[['Year', 'Month', 'Day']])
Error: ValueError: cannot assemble the datetimes: day is out of range for month
I have also tried to ignore this issue and then plot with the pandas plot function like:
max_temps.plot(x=['Month', 'Day'], y=['Max'])
Which does work but then I don't get the full functionality of matplotlib (as far as I can tell anyway, I'm new to these libraries).
It gives me this graph:
This is close to the result I'm looking for, but I'd like to use matplotlib to do it.
I feel like I'm making the problem harder than it needs to be but I don't know how. If anyone has any advice or suggestions I would greatly appreciate it, thanks!
As #Jody Klymak pointed out, the reason max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005) isn't working is because in your full dataset, there's probably a leap year and the 29th is included. That means that when you try to set the year to 2005, pandas is trying to create the date 2005-02-29 which will throw
ValueError: day is out of range for month. You can fix this by choosing the year 2004 instead of 2005.
My solution would be to disregard the year entirely, and create a new column that includes the month and day in the format "01-01". Since the month comes first, then all of these strings are guaranteed to be in chronological order regardless of the year.
Here's an example:
import pandas as pd
import matplotlib.pyplot as plt
max_temps = pd.DataFrame({
'Max': [15.6,13.9,13.3,10.6,12.8,18.9,21.7],
'Date': ['2005-01-01','2005-01-02','2005-01-03','2007-01-04','2007-01-05','2008-01-06','2008-01-07']
})
max_temps['Date'] = pd.to_datetime(max_temps['Date'])
## use string formatting to create a new column with Month-Day
max_temps['Month_Day'] = max_temps['Date'].dt.strftime('%m') + "-" + max_temps['Date'].dt.strftime('%d')
plt.plot(max_temps['Month_Day'], max_temps['Max'])
plt.show()

Python: slice yearly data between February and June with pandas

I have a dataset with 10 years of data from 2000 to 2010. I have the initial datetime on 2000-01-01, with data resampled to daily. I also have a weekly counter for when I apply the slice() function, I will only ask for week 5 to week 21 (February 1 to May 30).
I am a little stuck with how I can slice it every year, does it involve a loop or is there a timeseries function in python that will know to slice for a specific period in every year? Below is the code I have so far, I had a for loop that was supposed to slice(5, 21) but that didn't work.
Any suggestions how might I get this to work?
import pandas as pd
from datetime import datetime, timedelta
initial_datetime = pd.to_datetime("2000-01-01")
# Read the file
df = pd.read_csv("D:/tseries.csv")
# Convert seconds to datetime
df["Time"] = df["Time"].map(lambda dt: initial_datetime+timedelta(seconds=dt))
df = df.set_index(pd.DatetimeIndex(df["Time"]))
resampling_period = "24H"
df = df.resample(resampling_period).mean().interpolate()
df["Week"] = df.index.map(lambda dt: dt.week)
print(df)
You can slice using loc:
df.loc[df.Week.isin(range(5,22))]
If you want separate calculations per year (f.e. mean), you can use groupby:
subset = df.loc[df.Week.isin(range(5,22))]
subset.groupby(subset.index.year).mean()

"groupby" method throwing "Length of values does not match length of index"

I have hourly data from 2013-2019 that measures wind speed from a weather station every hour. I'd like to group that by year and graph each year (see code) below. The only thing is that the 2013 starts in September and 2016 was a leap year, so I think the reason I'm getting the error is the unevenness of the number of data points per year? Would I be right on this? How might I work around it?
# create stacked lined plot
from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot
series = read_csv('2013-2019 MR Wind.csv', header=0, index_col=0,
parse_dates=True, squeeze=True)
groups = series.groupby(Grouper(freq='A'))
years = DataFrame()
for name, group in groups:
years[name.year] = group.values #code fails here
years.plot(subplots=True, legend=False)
pyplot.show()

Pandas: extend resampling window

Let's say I have a 'low-frequency' series with data points every 2 hours, which I'd like to upsample to a 1-hour freq.
Is it possible in the code snippet below to have the high-freq signal have 24 rows (instead of 23)? More precisely, I'd like the new index to range from 00:00 to 23:00 with a NaN value (instead of stopping at 22:00).
I've played quite a bit with the several options but I still couldn't find out a clean way to do it.
import pandas as pd
import numpy as np
low_f = pd.Series(np.random.randn(12),
index=pd.date_range(start='01/01/2017', freq='2H', periods=12),
name='2H').cumsum()
high_f = low_f.resample('1H', ).mean()
print(high_f.tail(1).index)
#Yields DatetimeIndex(['2017-01-01 22:00:00'], dtype='datetime64[ns]', freq='H')
#I'd like DatetimeIndex(['2017-01-01 23:00:00'], dtype='datetime64[ns]', freq='H')
#(w/ 24 elements)
You can use DateTimeIndex.shift method to shift the dates by an 1 hour (leading). Take the union of it's old index and the newly formed shifted index.
Finally, reindex them according to these set of indices. As there would be no values of the series occuring at the last index, they would be filled by NaN as per it's default fill_value parameter.
high_f.reindex(high_f.index.union(high_f.index.shift(1, 'H')))

Categories