Missing xticks on chart for matplotlib on Python 3 - python

I am following this section, I realize this code was made using Python 2 but they have xticks showing on the 'Start Date' axis and I do not. My chart only shows Start Date and no dates are provided.
# Set as_index=False to keep the 0,1,2,... index. Then we'll take the mean of the polls on that day.
poll_df = poll_df.groupby(['Start Date'],as_index=False).mean()
# Let's go ahead and see what this looks like
poll_df.head()
Start Date Number of Observations Obama Romney Undecided Difference
0 2009-03-13 1403 44 44 12 0.00
1 2009-04-17 686 50 39 11 0.11
2 2009-05-14 1000 53 35 12 0.18
3 2009-06-12 638 48 40 12 0.08
4 2009-07-15 577 49 40 11 0.09
Great! Now plotting the Differencce versus time should be straight forward.
# Plotting the difference in polls between Obama and Romney
fig = poll_df.plot('Start Date','Difference',figsize=(12,4),marker='o',linestyle='-',color='purple')
https://nbviewer.jupyter.org/github/jmportilla/Udemy-notes/blob/master/Data%20Project%20-%20Election%20Analysis.ipynb

Related

How to plot straight lines in correspondence of highest values?

I have the following data:
Date
01/27/2020 55
03/03/2020 44
02/25/2020 39
03/11/2020 39
01/28/2020 39
02/05/2020 38
03/17/2020 37
03/16/2020 37
03/19/2020 37
03/14/2020 35
03/09/2020 35
03/26/2020 33
03/06/2020 33
01/29/2020 33
03/23/2020 27
03/15/2020 27
02/26/2020 27
03/27/2020 26
03/02/2020 25
02/28/2020 25
03/24/2020 24
03/04/2020 24
01/21/2020 23
03/01/2020 21
02/27/2020 21
01/22/2020 21
02/18/2020 18
01/31/2020 18
03/22/2020 18
01/26/2020 18
03/31/2020 18
02/24/2020 17
01/20/2020 16
01/23/2020 16
03/12/2020 16
03/21/2020 15
02/29/2020 14
03/28/2020 13
02/19/2020 13
03/08/2020 13
02/04/2020 13
02/12/2020 12
02/01/2020 12
02/07/2020 12
03/30/2020 12
02/20/2020 11
03/07/2020 11
03/29/2020 11
02/09/2020 11
02/06/2020 11
using groupby. On the right I have the frequency of values by date.
The plot is
generated by
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Date']).count()['NN'].plot(ax=ax)
I would like to have vertical straight lines in correspondence of the first highest values, i.e.
01/27/2020 55
03/03/2020 44
02/25/2020 39
03/11/2020 39
01/28/2020 39
How could I add these lines in my plot?
The .axvline method should do the trick, regarding the vertical lines. If you try to plot a pandas DataFrame/Series using a set of strings for the index, pandas does some fancy footwork in the background.
You could mess around with the xticks and all sorts, but the easiest thing to do is to convert your column to datetime64.
First, let's make some fluff data:
import random
import pandas as pd
from string import ascii_lowercase
# Make some fluff
dates = [f'01/{random.randint(1,28)}/1901' for _ in range(100)]
fluff = [ascii_lowercase[random.randint(1,26):random.randint(1,26)]
for _ in range(100)]
# Pack into a DataFrame
df = pd.DataFrame({'Date': dates, 'NN': fluff})
# Aggregate
counted = df.groupby('Date').count()
Taking a quick peek:
>>> counted
NN
Date
01/10/1901 2
01/11/1901 6
01/12/1901 2
... ...
You can substitute this for whatever data you have. It's probably easiest if you convert your column before doing the groupby, so:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
agg_df = df.groupby(['Date']).count()
fig, ax = plt.subplots(figsize=(8,6))
agg_df['NN'].plot(ax=ax)
The plot is similar to above. Note that I'm using 8 by 6 for the figsize so that the figure will fit easier on the StackOverflow page. Change it back to 15 by 7 when running your code.
I've used %m/%d/%Y format, as that appears to be what you are using. See here for more info on date formatting: official datetime doc
Finally, get the vertical lines by using a datetime directly:
import datetime
ax.axvline(datetime.datetime(1901,01,10), color='k')
If you want to get vertical straight lines for the highest values, sort your aggregated DataFrame, then whack it in a for-loop.
for d in agg_df.sort_values('NN',ascending=False).index[:5]:
ax.axvline(d, color='k')

Unintended Additional line drawn by Plotly express in Python

Plotly draws an extra diagonal line from the start to the endpoint of the original line graph.
Other data, other graphs work fine.
Only this data adds the line.
Why does this happen?
How can I fix this?
Below is the code
temp = pd.DataFrame(df[{KEY_WORD}])
temp['date'] = temp.index
fig=px.line(temp.melt(id_vars="date"), x='date', y='value', color='variable')
fig.show()
plotly.offline.plot(fig,filename='Fig_en1')
Just had the same issue -- try checking for duplicate values on the X axis. I was using the following code:
fig = px.line(df, x="weekofyear", y="interest", color="year")
fig.show()
That created the following plot:
I realised that this was because in certain years, some of the week numbers for the dates I had pertained to the previous years' weeks 52/53 and therefore created duplicates e.g. index 93 and 145 below:
date interest query year weekofyear
39 2015-12-20 44 home insurance 2015 51
40 2015-12-27 55 home insurance 2015 52
41 2016-01-03 69 home insurance 2016 53
92 2016-12-25 46 home insurance 2016 51
93 2017-01-01 64 home insurance 2017 52
144 2017-12-24 51 home insurance 2017 51
145 2017-12-31 79 home insurance 2017 52
196 2018-12-23 46 home insurance 2018 51
197 2018-12-30 64 home insurance 2018 52
248 2019-12-22 57 home insurance 2019 51
249 2019-12-29 73 home insurance 2019 52
By amending these (for week numbers that are high for dates in Jan, I subtracted 1 from the year column) I seem to have got rid of the phenomenon:
NB: there may be some other differences between the charts due to the dataset being somewhat fluid.
A similar question has been asked and answered in the post How to disable trendline in plotly.express.line?, but in your case I'm pretty sure the problem lies in temp.melt(id_vars="date"), x='date', y='value', color='variable'. It seems you're transfomring your data from a wide to a long format. You're using color='variable' without specifying that in temp.melt(id_vars="date"). And when the color specification does not properly correspond to the structure of your dataset, an extra line like yours can occur. Just take a look at this:
Command 1:
fig = px.line(data_frame=df_long, x='Timestamp', y='value', color='stacked_values')
Plot 1:
Command 2:
fig = px.line(data_frame=df_long, x='Timestamp', y='value')
Plot 2:
See the difference? That's why I think there's a mis-specification in your fig=px.line(temp.melt(id_vars="date"), x='date', y='value', color='variable').
So please share your data, or a sample of your data that reproduces the problem, and I'll have a better chance of verifying your problem.

Overlaying bar charts in python

Can I overlay 3 barcharts in python? The code I used to produce the three barcharts can be seen below:
fig3.set_title('Sample 2(2019-10-05)- Averge bikes used per hour')
fig3.set_xlabel('Hour')
fig3.set_ylabel('Average Percentage')
fig3.set_ylim(ymin=70) ```
fig4=average_bikes_used_hours3.plot.bar(y='Average bikes used in a hour', x='hour',figsize=(20,10))
fig4.set_title('Sample 3(2019-08-31)- Averge bikes used per hour')
fig4.set_xlabel('Hour')
fig4.set_ylabel('Average Percentage')
fig4.set_ylim(ymin=70)
fig5=average_bikes_used_hours4.plot.bar(y='Average bikes used in a hour', x='hour',figsize=(20,10))
fig5.set_title('Sample 4(2019-08-31)- Averge bikes used per hour')
fig5.set_xlabel('Hour')
fig5.set_ylabel('Average Percentage')
fig5.set_ylim(ymin=70)
The most intuitive way is:
create a single DataFrame,
with index for consecutive hours,
with separate columns for each sample.
Something like:
Sample 2 Sample 3 Sample 4
Hour
8 20 25 21
9 22 27 27
10 23 34 29
11 21 30 22
12 19 22 24
Then just plot:
df.plot.bar();
and you will have all samples in a single picture.
For the above data, I got:
If you want some extra space for the legend, pass ylim parameter, e.g.:
df.plot.bar(ylim=(0,40));

Two DataFrames Random Sample by Day grouping instead of hour

I have two dataframes, One is Price and the other one is Volume. They are both hourly and for the the same timeframe (one year).
dfP = pd.DataFrame(np.random.randint(5, 10, (8760,4)), index=pd.date_range('2008-01-01', periods=8760, freq='H'), columns='Col1 Col2 Col3 Col4'.split())
dfV = pd.DataFrame(np.random.randint(50, 100, (8760,4)), index=pd.date_range('2008-01-01', periods=8760, freq='H'), columns='Col1 Col2 Col3 Col4'.split())
Each Day is a SET in the sense that the values have to stay together. When a sample is generated, it needs to be a full day. so a sample would be (for example 24 hours of Feb 2, 2008) in this data set. I would like to generate a 185 day (50%) sample set for dfP and have the Volumes from the same days so i can generate a sum product.
dfProduct = dfP_Sample * dfV_Sample
I am lost on how to achieve this. Any help is appreciated.
It sounds like you're expecting to get the sum of the volumes and prices for each day and then multiply them together?
If that's the case, try the following. If not, please clarify your question.
priceGroup = dfP.groupby(by=dfP.index.date).sum()
volumeGroup = dfV.grouby(by=dfV.index.date).sum()
dfProduct = priceGroup*volumeGroup
If you want to just look at a specific date range, try
import datetime as datetime
dfProduct[np.logical_and(dfProduct.index > datetime.date(2006,08,09),dfProduct.index < datetime.date(2007,01,02))]
First of all we'll generate a column that refers to the day index of the year for example 2008-01-01 will be assigned 1 because it indicates first day of the year and so on
day_order = [date.timetuple().tm_yday for date in dfP.index]
dfP['day_order'] = day_order
then generate random days from 1 to 365 this will represent the day order in the year for example if you get random number 1 this indicates 2008-01-01
random_days = np.random.choice(np.arange(1 , 366) , size = 185 , replace=False)
then slice your original data frame to get only values from random sample according to the day order column we've created previously
dfP_sample = dfP[dfP.day_order.isin(random_days)]
then you can merge both frames on index , and you can do whatever you want
final = pd.merge(dfP_sample , dfV , left_index=True , right_index=True)
final.head()
Out[47]:
Col1_x Col2_x Col3_x Col4_x day_order Col1_y Col2_y Col3_y Col4_y
2008-01-03 00:00:00 9 6 9 9 3 66 85 62 82
2008-01-03 01:00:00 5 8 9 8 3 54 89 65 98
2008-01-03 02:00:00 7 5 5 9 3 83 58 60 96
2008-01-03 03:00:00 9 5 7 6 3 59 54 67 78
2008-01-03 04:00:00 9 5 8 9 3 92 66 66 55
if you don't want to merge both frames , you can apply the same logic on dfV
and then you will get samples from both data frames on the same days

data cleaning a python dataframe

I have a Python dataframe with 1408 lines of data. My goal is to compare the largest number and smallest number associated with a given weekday during one week to the next week's number on the same day of the week which the prior largest/smallest occurred. Essentially, I want to look at quintiles (since there are 5 days in a business week) rank 1 and 5 and see how they change from week to week. Build a cdf of numbers associated to each weekday.
To clean the data, I need to remove 18 weeks in total from it. That is, every week in the dataframe associated with holidays plus the entire week following week after the holiday occurred.
After this, I think I should insert a column in the dataframe that labels all my data with Monday through Friday-- for all the dates in the file (there are 6 years of data). The reason for labeling M-F is so that I can sort each number associated to the day of the week in ascending order. And query on the day of the week.
Methodological suggestions on either 1. or 2. or both would be immensely appreciated.
Thank you!
#2 seems like it's best tackled with a combination of df.groupby() and apply() on the resulting Groupby object. Perhaps an example is the best way to explain.
Given a dataframe:
In [53]: df
Out[53]:
Value
2012-08-01 61
2012-08-02 52
2012-08-03 89
2012-08-06 44
2012-08-07 35
2012-08-08 98
2012-08-09 64
2012-08-10 48
2012-08-13 100
2012-08-14 95
2012-08-15 14
2012-08-16 55
2012-08-17 58
2012-08-20 11
2012-08-21 28
2012-08-22 95
2012-08-23 18
2012-08-24 81
2012-08-27 27
2012-08-28 81
2012-08-29 28
2012-08-30 16
2012-08-31 50
In [54]: def rankdays(df):
.....: if len(df) != 5:
.....: return pandas.Series()
.....: return pandas.Series(df.Value.rank(), index=df.index.weekday)
.....:
In [52]: df.groupby(lambda x: x.week).apply(rankdays).unstack()
Out[52]:
0 1 2 3 4
32 2 1 5 4 3
33 5 4 1 2 3
34 1 3 5 2 4
35 2 5 3 1 4

Categories