"groupby" method throwing "Length of values does not match length of index" - python

I have hourly data from 2013-2019 that measures wind speed from a weather station every hour. I'd like to group that by year and graph each year (see code) below. The only thing is that the 2013 starts in September and 2016 was a leap year, so I think the reason I'm getting the error is the unevenness of the number of data points per year? Would I be right on this? How might I work around it?
# create stacked lined plot
from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot
series = read_csv('2013-2019 MR Wind.csv', header=0, index_col=0,
parse_dates=True, squeeze=True)
groups = series.groupby(Grouper(freq='A'))
years = DataFrame()
for name, group in groups:
years[name.year] = group.values #code fails here
years.plot(subplots=True, legend=False)
pyplot.show()

Related

Creating a matplotlib line graph using datetime objects while ignoring the year value

I have a dataset of highest and lowest temperatures recorded for each day of the year, for the years 2005-2014. I want to create a graph where I plot the max and min temperatures for each day of the year for this period (so there will be only one max and min temperature for each day plotted). I was able to create a df from the data set of the absolute min and maxs for each day, here's the example of the max:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
# splitting 2005-2014 df dates into separate columns for easier analysis
weather_05_14['Year'] = weather_05_14['Date'].dt.strftime('%Y')
weather_05_14['Month'] = weather_05_14['Date'].dt.strftime('%m')
weather_05_14['Day'] = weather_05_14['Date'].dt.strftime('%d')
# extracting the min and max temperatures for each day, regardless of year
max_temps = weather_05_14.loc[weather_05_14.groupby(['Day', 'Month'], sort=False)
['Data_Value'].idxmax()][['Data_Value', 'Date']]
max_temps.rename(columns={'Data_Value': 'Max'}, inplace=True)
This is what the data frame looks like:
Now here's where my issue is. I want to plot this data in a line plot based on month/day, disregarding the year so it's in order. My thought was that I could do this by changing the year to be the same for every data point (as it won't be data that will be in the final graph anyway) and this is what I did to try to accomplish that:
max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005)
but I got this error:
ValueError: day is out of range for month
I have also tried to take my separate Day, Month, Year columns that I used to group by, include those with the max_temps df, change the year, and then move those all to a new column and convert them to a datetime object, but I get a similar error
max_temps['Year'] = 2005
max_temps['New Date'] = pd.to_datetime[max_temps[['Year', 'Month', 'Day']])
Error: ValueError: cannot assemble the datetimes: day is out of range for month
I have also tried to ignore this issue and then plot with the pandas plot function like:
max_temps.plot(x=['Month', 'Day'], y=['Max'])
Which does work but then I don't get the full functionality of matplotlib (as far as I can tell anyway, I'm new to these libraries).
It gives me this graph:
This is close to the result I'm looking for, but I'd like to use matplotlib to do it.
I feel like I'm making the problem harder than it needs to be but I don't know how. If anyone has any advice or suggestions I would greatly appreciate it, thanks!
As #Jody Klymak pointed out, the reason max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005) isn't working is because in your full dataset, there's probably a leap year and the 29th is included. That means that when you try to set the year to 2005, pandas is trying to create the date 2005-02-29 which will throw
ValueError: day is out of range for month. You can fix this by choosing the year 2004 instead of 2005.
My solution would be to disregard the year entirely, and create a new column that includes the month and day in the format "01-01". Since the month comes first, then all of these strings are guaranteed to be in chronological order regardless of the year.
Here's an example:
import pandas as pd
import matplotlib.pyplot as plt
max_temps = pd.DataFrame({
'Max': [15.6,13.9,13.3,10.6,12.8,18.9,21.7],
'Date': ['2005-01-01','2005-01-02','2005-01-03','2007-01-04','2007-01-05','2008-01-06','2008-01-07']
})
max_temps['Date'] = pd.to_datetime(max_temps['Date'])
## use string formatting to create a new column with Month-Day
max_temps['Month_Day'] = max_temps['Date'].dt.strftime('%m') + "-" + max_temps['Date'].dt.strftime('%d')
plt.plot(max_temps['Month_Day'], max_temps['Max'])
plt.show()

Produce daily forecasts from monthly averages using Python Pandas

I have daily data going back years. If I firstly wanted to see what the monthly average of these was, then to project out this monthly average forecast for the next few years I have written the following code.
For example, my forecast for the next few January's will be the average of the last few January's, and the same for Feb, Mar etc. Over the past few years my January number is 51.8111, so for the January's in my forecast period I want every day in every January to be this 51.8111 number (i.e. moving the monthly to daily granularity).
My question is, my code seems a bit long winded and with the loop, could potentially be a little slow? For my own learning I was wondering, what is a better way of taking daily data, averaging it by a time period, then projecting out this time period? I was looking at map and apply functions within Pandas, but couldn't quite work it out.
import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(0)
# create random dataframe of daily values
df = pd.DataFrame(np.random.randint(low=0, high=100,size=2317),
columns=['value'],
index=pd.date_range(start='2014-01-01', end=dt.date.today()-dt.timedelta(days=1), freq='D'))
# gain average by month over entire date range
df_by_month = df.groupby(df.index.month).mean()
# create new dataframe with date range for forecast
df_forecast = pd.DataFrame(index=pd.date_range(start=dt.date.today(), periods=1095, freq='D'))
df_forecast['value'] = 0
# project forward the monthly average to each day
for val in df_forecast.index:
df_forecast.loc[val]['value'] = df_by_month.loc[val.month]
# create new dataframe joining together the historical value and forecast
df_complete = df.append(df_forecast)
I think you need Index.map by months by column value from df_by_month:
# create new dataframe with date range for forecast
df_forecast = pd.DataFrame(index=pd.date_range(start=dt.date.today(), periods=1095, freq='D'))
df_forecast['value'] = df_forecast.index.month.map(df_by_month['value'])

How do I tell pandas to group the same months across multiple years?

I've got a dataFrame comprised of datetimes in the format 21-JAN-2016 which I hit with pd.to_datetime(df[0]). I've trying to group my data such that the same month, across the span of several years, is plotted side-by-side. For example, the # occurrences in January for 2015, 2016, 2017, etc.. (So there'd be four bars side-by-side clumped together) And then the # occurrences in February for 2015, 2016, 2017, etc..
Right now I have the below code which I believe is working mostly, but I'm not sure because the x-axis is not labeling the months correctly. Right now it will throw a AttributeError: 'MultiIndex' object has no attribute 'strftime' but if I remove index.strftime("%Y-%b") it plots, just with a bad x-axis label and I'm not sure how I understand how to get it so that my label shows each of the 4 years, and beneath it, the month. This is my code as is:
#!/usr/bin/python
import pandas as pd
import matplotlib.pyplot as plt
import calendar
file = 'dates.txt'
# Convert datetimes
df = pd.read_csv("dates.txt", header=None) # Format: 359 21-JAN-2016
df["dates"] = pd.to_datetime(df[0]) # Format: 388 3-JUL-2015 2015-07-03
### Group data by year per month
by_year_per_month = by_year_per_month = df["dates"].groupby([(df.dates.dt.month),(df.dates.dt.year)]).count()
labels_by_year_per_month = by_year_per_month.index.strftime("%Y-%b")
### Label
by_year_per_month.plot(kind="bar", ax=ax)
ax.set_xticklabels(labels_by_year_per_month)
# Show plot
plt.show()
I thought I could format the month label using df["dates"].groupby([(df.dates.dt.month.to_period('M')),(df.dates.dt.year)]).count() but that gave me AttributeError: 'RangeIndex' object has no attribute 'to_period'.
BONUS:
Not sure if I can ask a second question here so please let me know if I should open a separate question but as a bonus I'd really like to know how to display each cluster of months on the graph such that they are side-by-side and there's a bit of a gap between it and the other groupings. i.e. Jan[15,16,17,18] is grouped up, then there's a space before Feb[15,16,17,18] rather than having even space between everything. Basically just to clean it up and make it easier to read.
EDIT 1:
Updated code to:
#!/usr/bin/python
import pandas as pd
import matplotlib.pyplot as plt
import calendar
file = 'dates.txt'
# Convert datetimes
df = pd.read_csv("dates.txt", header=None) # Format: 359 21-JAN-2016
df["dates"] = pd.to_datetime(df[0]) # Format: 388 3-JUL-2015 2015-07-03
### Group data by month per year
result = df["dates"].groupby([df.dates.dt.month, df.dates.dt.year]).count().unstack()
#result.columns = result.columns.droplevel(0)
result.index.name = 'month'
result.plot(kind="bar")
# Show plot
plt.show()
Which gives me:
You are currently grouping by month and year. You just need to unstack the result into a table.
by_year_per_month.unstack()
You should then be able to plot your data.
dates = pd.DatetimeIndex(start='2016-01-01', freq='d', periods=356 * 4)
df = pd.DataFrame({'date': dates, 'value': np.random.randn(356 * 4)})
# Summing sample data. You want `count` in your example.
result = df.groupby((df.date.dt.month, df.date.dt.year)).sum().unstack()
result.columns = result.columns.droplevel(0)
result.index.name = 'month'
result.plot()

What is the right daytime format for plotting time series in python for a big dataset?

I am trying to plot a time series for my dataset which has 215 rows of weekly data for the years 2010 to 2018. However, I keep getting Value error and Type Error as shown in the code and my screenshot below:
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
def parser(x):
return datetime.strptime(x, '%d-%m-%Y')
series = read_csv('testdatafortimeseries.csv', header=0, nrows=215, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
print(series.head())
#series.plot()
I am new to learning time series and was only trying to implement an example that used only 36 rows of data. I have tried leaving the dates as 4-Oct-2010 but still no difference. This is what my excel sheet looks like if this helps to identify the problem:
Use something like this:
data = pd.read_csv('DatesExample.csv')
data.head()
data.Date = pd.to_datetime(data.Date)
data.head()

interpolate values between sample years with Pandas

I'm trying to get interpolated values for the metric shown below using Pandas time series.
test.csv
year,metric
2020,290.72
2025,221.763
2030,152.806
2035,154.016
Code
import pandas as pd
df = pd.read_csv('test.csv', parse_dates={'Timestamp': ['year']},
index_col='Timestamp')
As far as I understand this gives me an time series with the January 1 of each year as the index. Now I need to fill in values for missing years (2021, 2022, 2023, 2024, 2026 etc)
Is there a way to do this with Pandas?
If you're using a newer version of Pandas, your DataFrame object should have an interpolate method that can be used to fill in the gaps.
It turns out, interpolation only fills in values, where there are none. In my case above, what I had to do was to re-index so that the interval was 12 months.
# reindex with interval 12 months (M: month, S: beginning of the month)
df_reindexed = df.reindex(pd.date_range(start='20120101', end='20350101', freq='12MS'))
# method=linear works because the intervals are equally spaced out now
df_interpolated = df_reindexed.interpolate(method='linear')

Categories