Why can matplotlib not format monthly data - python

The formatting of the x-axis works just fine for daily data but fails for monthly data. Then, it just uses 1970 as the base year. Why is that and how can I solve this issue?
import numpy as np
import pandas as pd
import matplotlib.dates as mdates
# generate time series
# switch between 'M' and 'D'
years = pd.date_range(start='1/1/2018', periods=rows, freq='D') #'M')
# randomly generated dummy data
np.random.seed(0)
data1 = np.random.randn(len(years))
# put data together
ts1 = pd.Series(data=data1, index=years)
data = {'check': ts1}
df = pd.concat(data, axis=1)
# generate a plot
ax = df.plot(figsize=(20, 5))
# format xaxis
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y %b"))
Daily
Monthly

Related

barplot with 2 variables python

I have the following dataframe coming from an excel file:
df = pd.read_excel('base.xlsx')
My excel file contains the following columns:
data - datetime64[ns]
stock- float64
demand - float64
origem - object
I need to plot a bar chart where the x-axis will be the date and the bars the stock and demand. Blue would be the demand and orange the stock:
This can be done with the pandas bar plot function. Note that if there are dates that are not recorded in your dataset (e.g. weekends or national holidays) they will not be automatically displayed with a gap in the bar plot. This is because bar plots in pandas (and other packages) are made primarily for categorical data, as mentioned here and here.
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
# Create a random time series with the date as index
# In your case where you are importing your dataset from excel you
# would assign your date column to the df index like this:
rng = np.random.default_rng(123)
days = 7
df = pd.DataFrame(dict(demand = rng.uniform(100, size=days),
stock = rng.uniform(100, size=days),
origin = np.random.choice(list('ABCD'), days)),
index = pd.date_range(start='2020-12-14', freq='D', periods=days))
# Create pandas bar plot
fig, ax = plt.subplots(figsize=(10,5))
df.plot.bar(ax=ax, color=['tab:blue', 'tab:orange'])
# Assign ticks with custom tick labels
# Date format codes for xticklabels:
# https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
plt.xticks(ax.get_xticks(), [ts.strftime('%A') for ts in df.index], rotation=0)
plt.legend(frameon=False)
plt.show()

How to format Pandas / Matplotlib graph so the x-axis ticks are ONLY hours and minutes?

I am trying to plot temperature with respect to time data from a csv file.
My goal is to have a graph which shows the temperature data per day.
My problem is the x-axis: I would like to show the time for uniformly and only be in hours and minutes with 15 minute intervals, for example: 00:00, 00:15, 00:30.
The csv is loaded into a pandas dataframe, where I filter the data to be shown based on what day it is, in the code I want only temperature data for 18th day of the month.
Here is the csv data that I am loading in:
date,temp,humid
2020-10-17 23:50:02,20.57,87.5
2020-10-17 23:55:02,20.57,87.5
2020-10-18 00:00:02,20.55,87.31
2020-10-18 00:05:02,20.54,87.17
2020-10-18 00:10:02,20.54,87.16
2020-10-18 00:15:02,20.52,87.22
2020-10-18 00:20:02,20.5,87.24
2020-10-18 00:25:02,20.5,87.24
here is the python code to make the graph:
import pandas as pd
import datetime
import matplotlib.pyplot as plt
df = pd.read_csv("saveData2020.csv")
#make new columns in dataframe so data can be filtered
df["New_Date"] = pd.to_datetime(df["date"]).dt.date
df["New_Time"] = pd.to_datetime(df["date"]).dt.time
df["New_hrs"] = pd.to_datetime(df["date"]).dt.hour
df["New_mins"] = pd.to_datetime(df["date"]).dt.minute
df["day"] = pd.DatetimeIndex(df['New_Date']).day
#filter the data to be only day 18
ndf = df[df["day"]==18]
#display dataframe in console
pd.set_option('display.max_rows', ndf.shape[0]+1)
print(ndf.head(10))
#plot a graph
ndf.plot(kind='line',x='New_Time',y='temp',color='red')
#edit graph to be sexy
plt.setp(plt.gca().xaxis.get_majorticklabels(),'rotation', 30)
plt.xlabel("time")
plt.ylabel("temp in C")
#show graph with the sexiness edits
plt.show()
here is the graph I get:
Answer
First of all, you have to convert "New Time" (your x axis) from str to datetime type with:
ndf["New_Time"] = pd.to_datetime(ndf["New_Time"], format = "%H:%M:%S")
Then you can simply add this line of code before showing the plot (and import the proper matplotlib library, matplotlib.dates as md) to tell matplotlib you want only hours and minutes:
plt.gca().xaxis.set_major_formatter(md.DateFormatter('%H:%M'))
And this line of code to fix the 15 minutes span for the ticks:
plt.gca().xaxis.set_major_locator(md.MinuteLocator(byminute = [0, 15, 30, 45]))
For more info on x axis time formatting you can check this answer.
Code
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as md
df = pd.read_csv("saveData2020.csv")
#make new columns in dataframe so data can be filtered
df["New_Date"] = pd.to_datetime(df["date"]).dt.date
df["New_Time"] = pd.to_datetime(df["date"]).dt.time
df["New_hrs"] = pd.to_datetime(df["date"]).dt.hour
df["New_mins"] = pd.to_datetime(df["date"]).dt.minute
df["day"] = pd.DatetimeIndex(df['New_Date']).day
#filter the data to be only day 18
ndf = df[df["day"]==18]
ndf["New_Time"] = pd.to_datetime(ndf["New_Time"], format = "%H:%M:%S")
#display dataframe in console
pd.set_option('display.max_rows', ndf.shape[0]+1)
print(ndf.head(10))
#plot a graph
ndf.plot(kind='line',x='New_Time',y='temp',color='red')
#edit graph to be sexy
plt.setp(plt.gca().xaxis.get_majorticklabels(),'rotation', 30)
plt.xlabel("time")
plt.ylabel("temp in C")
plt.gca().xaxis.set_major_locator(md.MinuteLocator(byminute = [0, 15, 30, 45]))
plt.gca().xaxis.set_major_formatter(md.DateFormatter('%H:%M'))
#show graph with the sexiness edits
plt.show()
Plot
Notes
If you do not need "New_Date", "New_Time", "New hrs", "New_mins" and "day" columns for other purposes than plotting, you can use a shorter version of the above code, getting rid of those columns and appling the day filter directly on "date" column as here:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as md
df = pd.read_csv("saveData2020.csv")
# convert date from string to datetime
df["date"] = pd.to_datetime(df["date"], format = "%Y-%m-%d %H:%M:%S")
#filter the data to be only day 18
ndf = df[df["date"].dt.day == 18]
#display dataframe in console
pd.set_option('display.max_rows', ndf.shape[0]+1)
print(ndf.head(10))
#plot a graph
ndf.plot(kind='line',x='date',y='temp',color='red')
#edit graph to be sexy
plt.setp(plt.gca().xaxis.get_majorticklabels(),'rotation', 30)
plt.xlabel("time")
plt.ylabel("temp in C")
plt.gca().xaxis.set_major_locator(md.MinuteLocator(byminute = [0, 15, 30, 45]))
plt.gca().xaxis.set_major_formatter(md.DateFormatter('%H:%M'))
#show graph with the sexiness edits
plt.show()
This code will reproduce exactly the same plot as before.

matplotlib is not automatically reading dataframe as date

Since pandas last update, the x axis is not reading the index as a date. Any clues on what changed? As an example, the following code (Source) creates a random df. The matplotlib part is exactly what I'm have been doing with my real dataset (dates in my data where made using time.strftime("%Y-%m-%d")):
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'col2': data})
df = df.set_index('test')
# creates graph:
import matplotlib.pyplot as plt
fig = plt.plot(df.index, df["col2"])
fig = plt.xticks(rotation=30), plt.legend(loc='best'), plt.xlabel("Weeks")
fig = plt.style.use(['bmh', 'seaborn-paper'])
fig = plt.title("Index", fontsize=14, fontweight='bold')
plt.show()
The resulting graph has the x axis in number format. Before updating, my graphs automatically had dates in the index (because the index is in date format).
Pandas used to import the units handlers for datetime64, but as of 0.21 stopped (though it may be back for 0.22). The way to get the old behaviour without explicit conversion is
from pandas.tseries import converter as pdtc
pdtc.register()
Solution 1
Use pandas .plot on the dataframe:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'col2': data})
df = df.set_index('test')
# creates graph:
import matplotlib.pyplot as plt
sub = df.plot()
fig = plt.xticks(rotation=30), plt.legend(loc='best'), plt.xlabel("Weeks")
fig = plt.style.use(['bmh', 'seaborn-paper'])
fig = plt.title("Index", fontsize=14, fontweight='bold')
Solution 2
Convert them Python datetime objects:
fig = plt.plot(df.index.to_pydatetime(), df["col2"])
Result of both approaches

How do I change the year interval on a Pandas DataFrame area plot?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as dts
def use_matplot():
ax = df.plot(x='year', kind="area" )
years = dts.YearLocator(20)
ax.xaxis.set_major_locator(years)
fig = ax.get_figure()
fig.savefig('output.pdf')
dates = np.arange(1990,2061, 1)
dates = dates.astype('str').astype('datetime64')
df = pd.DataFrame(np.random.randint(0, dates.size, size=(dates.size,3)), columns=list('ABC'))
df['year'] = dates
cols = df.columns.tolist()
cols = [cols[-1]] + cols[:-1]
df = df[cols]
use_matplot()
In the above code, I get an error, "ValueError: year 0 is out of range" when trying to set the YearLocator so as to ensure the X-Axis has year labels for every 20th year. By default the plot has the years show up every 10 years. What am I doing wrong? Desired outcome is simply a plot with 1990, 2010, 2030, 2050 on the bottom. (Instead of default 1990, 2000, 2010, etc.)
Since the years are simple numbers, you may opt for not using them as dates at all and keeping them as numbers.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dates = np.arange(1990,2061, 1)
df = pd.DataFrame(np.random.randint(0,dates.size,size=(dates.size,3)),columns=list('ABC'))
df['year'] = dates
cols = df.columns.tolist()
cols = [cols[-1]] + cols[:-1]
df = df[cols]
ax = df.plot(x='year', kind="area" )
ax.set_xticks(range(2000,2061,20))
plt.show()
Apart from that, using Matplotlib locators and formatters on date axes created via pandas will most often fail. This is due to pandas using a completely different datetime convention. In order to have more freedom for setting custom tickers for datetime axes, you may use matplotlib. A stackplot can be plotted with plt.stackplot. On such a matplotlib plot, the use of the usual matplotlib tickers is unproblematic.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as dts
dates = np.arange(1990,2061, 1)
df = pd.DataFrame(np.random.randint(0,dates.size,size=(dates.size,3)),columns=list('ABC'))
df['year'] = pd.to_datetime(dates.astype(str))
cols = df.columns.tolist()
cols = [cols[-1]] + cols[:-1]
df = df[cols]
plt.stackplot(df["year"].values, df[list('ABC')].values.T)
years = dts.YearLocator(20)
plt.gca().xaxis.set_major_locator(years)
plt.margins(x=0)
plt.show()
Consider using set_xticklabels to specify values of x axis tick marks:
ax.set_xticklabels(sum([[i,''] for i in range(1990, 2060, 20)], []))
# [1990, '', 2010, '', 2030, '', 2050, '']

Pandas Line Graph by Month, Grouped by Industry from Timestamped SQL Export

Newbie question, thank you in advance!
I'm trying to group the data by both date and industry and display a chart that shows the different industry revenue numbers across the time series in monthly increments.
I am working from a SQL export that has timestamps, having a bear of time getting this to work.
Posted sample csv data file here:
https://drive.google.com/open?id=0B4xdnV0LFZI1WGRMN3AyU2JERVU
Here's a small data example:
Industry Date Revenue
Fast Food 01-05-2016 12:18:02 100
Fine Dining 01-08-2016 09:17:48 110
Carnivals 01-18-2016 10:48:52 200
My failed attempt is here:
import pandas as pd
import datetime
import matplotlib.pyplot as plt
df = pd.read_csv('2012_to_12_27_2016.csv')
df['Ship_Date'] = pd.to_datetime(df['Ship_Date'], errors = 'coerce')
df['Year'] = df.Ship_Date.dt.year
df['Ship_Date'] = pd.DatetimeIndex(df.Ship_Date).normalize()
df.index = df['Ship_Date']
df_skinny = df[['Shipment_Piece_Revenue', 'Industry']]
groups = df_skinny[['Shipment_Piece_Revenue', 'Industry']].groupby('Industry')
groups = groups.resample('M').sum()
groups.index = df['Ship_Date']
fig, ax = plt.subplots()
groups.plot(ax=ax, legend=False)
names = [item[0] for item in groups]
ax.legend(ax.lines, names, loc='best')
plt.show()
You could use DataFrame.Series.unique to get a list of all industries and then, using DataFrame.loc, define a new DataFrame object that only contains data from a single Industry.
Then if we set the Ship Date column as the index of the new DataFrame, we can use DataFrame.resample, specify the frequency as months and call sum() to get the total revenue for that month.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('Graph_Sample_Data.csv')
df['Ship Date'] = pd.to_datetime(df['Ship Date'], errors='coerce')
fig, ax = plt.subplots()
for industry in df.Industry.unique():
industry_df = df.loc[df.Industry == industry]
industry_df.index = industry_df['Ship Date']
industry_df = industry_df.resample('M').sum()
industry_df.plot(x=industry_df.index,
y='Revenue',
ax=ax,
label=industry)
plt.show()

Categories