Rolling Mean ValueError - python

I am trying to plot the rolling mean on a double-axis graph. However, I get the ValueError: view limit minimum -36867.6 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units error. My columns do have datetime objects in them so I am not sure why this is happening.
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
lns1 = ax1.plot(df5['TIME'],
df5["y"])
lns2 = ax2.plot(df3_plot.rolling(window=3).mean(),
color='black')
df5 looks like this:
TIME y
0 1990-01-01 3.380127
1 1990-02-01 3.313274
2 1990-03-01 4.036463
3 1990-04-01 3.813060
4 1990-05-01 3.847867
...
355 2019-08-01 8.590325
356 2019-09-01 7.642616
357 2019-10-01 8.362921
358 2019-11-01 7.696176
359 2019-12-01 8.206370
And df3_plot looks like this:
date y
0 1994-01-01 239.274414
1 1994-02-01 226.126581
2 1994-03-01 211.591748
3 1994-04-01 214.708679
4 1995-01-01 223.093071
...
99 2018-04-01 181.889699
100 2019-01-01 174.500096
101 2019-02-01 179.803310
102 2019-03-01 175.570419
103 2019-04-01 176.697451
Futhermore, the graph comes out fine if I don't try using rolling mean for df3_plot. This means that the x-axis is a datetime for both. When I have
lns2 = ax2.plot(df3_plot['date'],
df3_plot['y'],
color='black')
I get this graph
Edit
Suppose that df5 has another column 'y2' that is correctly rolling meaned with 'y'. How can I graph and label it properly? I currently have
df6 = df5.rolling(window=12).mean()
lns1 = ax1.plot(
df6,
label = 'y', # how do I add 'y2' label correctly?
linewidth = 2.0)
df6 looks like this:
TIME y y2
0 1990-01-01 NaN NaN
1 1990-02-01 NaN NaN
2 1990-03-01 NaN NaN
3 1990-04-01 NaN NaN
4 1990-05-01 NaN NaN
... ... ... ...
355 2019-08-01 10.012447 8.331901
356 2019-09-01 9.909044 8.263813
357 2019-10-01 9.810155 8.185539
358 2019-11-01 9.711690 8.085016
359 2019-12-01 9.619968 8.035330

Making 'date' into the index of my dataframe did the trick: df3_plot.set_index('date', inplace=True).
However, I'm not sure why the error messages are different for #dm2 and I.

You already caught this, but the problem is that rolling by default works on the index. There is also an on parameter for setting a column to work on instead:
rolling = df3_plot.rolling(window=3, on='date').mean()
lns2 = ax2.plot(rolling['date'], rolling['y'], color='black')
Note that if you just do df3_plot.rolling(window=3).mean(), you get this:
y
0 NaN
1 NaN
2 0.376586
3 0.168073
4 0.258431
.. ...
299 0.285585
300 0.327987
301 0.518088
302 0.300169
303 0.299366
[304 rows x 1 columns]
Seems like matplotlib tries to plot y here since there is only one column. But the index is int, not dates, so I believe that leads to the error you saw when trying to plot over the other date axis.
When you use on to create rolling in my example, the result still has date and y columns, so you still need to reference the appropriate columns when plotting.

Related

Find cumcount and agg func based on past records of each group

I have a dataframe like as shown below
df = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','DEF',
'DEF','DEF'],
'qty' : [123,31,490,518,70,900],
'trans_date' : ['13/11/2020','10/1/2018','11/11/2017','27/03/2016','13/05/2010','14/07/2008']})
I would like to do the below
a) for each stud_name, look at their past data (full past data) and compute the min, max and mean of qty column
Please note that the 1st record/row for every unique stud_name will be NA because there is no past data (history) to look at and compute the aggregate statistics
I tried something like below but the output is incorrect
df['trans_date'] = pd.to_datetime(df['trans_date'])
df.sort_values(by=['stud_name','trans_date'],inplace=True)
df['past_transactions'] = df.groupby('stud_name').cumcount()
df['past_max_qty'] = df.groupby('stud_name')['qty'].expanding().max().values
df['past_min_qty'] = df.groupby('stud_name')['qty'].expanding().min().values
df['past_avg_qty'] = df.groupby('stud_name')['qty'].expanding().mean().values
I expect my output to be like as shown below
We can use custom function to calculate the past statistics per student
def past_stats(q):
return (
q.expanding()
.agg(['max', 'min', 'mean'])
.shift().add_prefix('past_')
)
df.join(df.groupby('stud_name')['qty'].apply(past_stats))
stud_name qty trans_date past_max past_min past_mean
2 ABC 490 2017-11-11 NaN NaN NaN
1 ABC 31 2018-10-01 490.0 490.0 490.0
0 ABC 123 2020-11-13 490.0 31.0 260.5
5 DEF 900 2008-07-14 NaN NaN NaN
4 DEF 70 2010-05-13 900.0 900.0 900.0
3 DEF 518 2016-03-27 900.0 70.0 485.0

How can I plot a line graph from the data which miss some values in some dates?

I don't understand why my plot did not show me what I had expected. My plot looks unorganized and I am not sure whether it is because the table missed some dates. How should I fix it with the code?
my data
date count
0 2020-03-06 1
1 2020-03-17 2
2 2020-03-18 1
3 2020-03-21 1
4 2020-03-23 1
... ... ...
196 2020-12-27 25
197 2020-12-28 5
198 2020-12-29 19
199 2020-12-30 25
200 2020-12-31 23
my code
plt.plot(data['date'],data['count'])
plt.setp(plt.gca().xaxis.get_majorticklabels(),rotation=45)
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=10))
plt.xlim('2020-03-06','2020-12-31')
plt.ylim((0,50))
plt.savefig('03_clean_tweet_count_by_month_2020.tiff', dpi=300, format='tiff', bbox_inches='tight')
Result
I can't be sure without having the data to test on, but if the rows of your dataframe are not sorted properly, you could get an output like this. Try:
data.sort_values('date', inplace=True)
plt.plot(data['date'],data['count'])
plt.setp(plt.gca().xaxis.get_majorticklabels(),rotation=45)
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=10))
plt.xlim('2020-03-06','2020-12-31')
plt.ylim((0,50))
plt.savefig('03_clean_tweet_count_by_month_2020.tiff', dpi=300, format='tiff', bbox_inches='tight')
Try sorting your DF by Date column first, before plotting
data=data.sort_values(by=['date'])
plt.plot(data['date'],data['count'])

DataError: No numeric types to aggregate when creating plot in loop

I want to make multiple line in lineplot with loop like this but it returns DataError: No numeric types to aggregate. Why it returns that error and how to fix this?
plt.figure()
cases = pd.DataFrame(data=covid[['date','acc_released','acc_deceased','acc_negative','acc_confirmed']])
for col in cases.columns:
sns.lineplot(x=cases['date'],y=covid[col],data=cases)
Without loop it will be like this, which is not efficient but works fine
plt.figure()
sns.lineplot(x=covid['date'], y=covid['acc_confirmed'])
sns.lineplot(x=covid['date'], y=covid['acc_deceased'])
sns.lineplot(x=covid['date'], y=covid['acc_negative'])
sns.lineplot(x=covid['date'], y=covid['acc_released'])
plt.xticks(rotation=90)
plt.legend(['acc_confirmed', 'acc_deceased', 'acc_negative', 'acc_released'],
loc='upper left')
plt.title('Numbers of cases')
This is my data
date acc_released acc_deceased acc_negative acc_confirmed
0 2020-03-02 0 0 335 2
1 2020-03-03 0 0 337 2
2 2020-03-04 0 0 356 2
3 2020-03-05 0 0 371 2
4 2020-03-06 0 0 422 4
5 2020-03-07 0 0 422 4
It's supposed to look this way
If you set the date as your index you can pass the df to data;
sns.lineplot(data=cases)
to change the index;
df.index = df['Time']
then you can drop the time column;
df = df.drop(columns=['Time'])

How do I get a simple scatter plot of a dataframe (preferrably with seaborn)

I'm trying to scatter plot the following dataframe:
mydf = pd.DataFrame({'x':[1,2,3,4,5,6,7,8,9],
'y':[9,8,7,6,5,4,3,2,1],
'z':np.random.randint(0,9, 9)},
index=["12:00", "1:00", "2:00", "3:00", "4:00",
"5:00", "6:00", "7:00", "8:00"])
x y z
12:00 1 9 1
1:00 2 8 1
2:00 3 7 7
3:00 4 6 7
4:00 5 5 4
5:00 6 4 2
6:00 7 3 2
7:00 8 2 8
8:00 9 1 8
I would like to see the times "12:00, 1:00, ..." as the x-axis and x,y,z columns on the y-axis.
When I try to plot with pandas via mydf.plot(kind="scatter"), I get the error ValueError: scatter requires and x and y column. Do I have to break down my dataframe into appropriate parameters? What I would really like to do is get this scatter plotted with seaborn.
Just running
mydf.plot(style=".")
works fine for me:
Seaborn is actually built around pandas.DataFrames. However, your data frame needs to be "tidy":
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Since you want to plot x, y, and z on the same plot, it seems like they are actually different observations. Thus, you really have three variables: time, value, and the letter used.
The "tidy" standard comes from Hadly Wickham, who implemented it in the tidyr package.
First, I convert the index to a Datetime:
mydf.index = pd.DatetimeIndex(mydf.index)
Then we do the conversion to tidy data:
pivoted = mydf.unstack().reset_index()
and rename the columns
pivoted = pivoted.rename(columns={"level_0": "letter", "level_1": "time", 0: "value"})
Now, this is what our data looks like:
letter time value
0 x 2019-03-13 12:00:00 1
1 x 2019-03-13 01:00:00 2
2 x 2019-03-13 02:00:00 3
3 x 2019-03-13 03:00:00 4
4 x 2019-03-13 04:00:00 5
Unfortunately, seaborn doesn't play with DateTimes that well, so you can just extract the hour as an integer:
pivoted["hour"] = pivoted["time"].dt.hour
With a data frame in this form, seaborn takes in the data easily:
import seaborn as sns
sns.set()
sns.scatterplot(data=pivoted, x="hour", y="value", hue="letter")
Outputs:

Pandas stack columns in dataframe an make a histogram

I have a dataframe that looks like this:
df_vspd=df.ix[:,['VSPD1','VSPD2','VSPD3','VSPD4','VSPD5','VSPD6','VSPD7']]
df_vspd.head()
VSPD1 VSPD2 VSPD3 VSPD4 VSPD5 VSPD6 VSPD7
0 NaN NaN NaN NaN NaN NaN NaN
1 21343 37140 30776 12961 1934 14 0
2 6428 9526 9760 12075 4262 0 0
3 11795 14188 16702 18917 612 0 0
4 43571 60684 41611 12168 11264 79 0
I would like to plot a histogram of the data. However I want to stack the columns and do the histogram. Seems like a simple task, however I can not do it!!
Help please
What I want to do is stack the columns (VSPD1-VSPD7), and make them the index column. I tried:
cnames = list(df_vspd.columns)
df_test = df_vspd.set_index(cnames).
However it does not do me any good.
Do you want:
df_vspd.stack(0).hist()

Categories