I'm trying to display data from a weather station with mathplotlib. For some reason that I can't quite figure out my last values on the graph are acting randomly, going back in time on the x axis.
x axis is the dates,
y axis is the water level
y1 axis is the discharge flow
Here's a picture of the result
Graph
import pandas as pd
import matplotlib.pyplot as plt
url_hourly = "https://dd.weather.gc.ca/hydrometric/csv/BC/hourly/BC_08MG005_hourly_hydrometric.csv"
url_daily = "https://dd.weather.gc.ca/hydrometric/csv/BC/daily/BC_08MG005_daily_hydrometric.csv"
fields = ["Date","Water Level / Niveau d'eau (m)", "Discharge / Débit (cms)"]
#Read csv files
hourly_data = pd.read_csv(url_hourly, usecols=fields)
day_data = pd.read_csv(url_daily, usecols=fields)
#Merge csv files
water_data = pd.concat([day_data,hourly_data])
#Convert date to datetime
water_data['Date'] = pd.to_datetime(water_data['Date']).dt.normalize()
water_data['Date'] = water_data['Date'].dt.strftime('%m/%d/%Y')
# CSV files contains 288 data entries per day (12per hour * 24hrs). Selecting every 288th element to represent one day
data_24hr = water_data[::288]
# Assigning columns to x, y, y1 axis
x = data_24hr[fields[0]]
y1 = data_24hr[fields[1]]
y2= data_24hr[fields[2]]
#Ploting the graph
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
curve1 = ax1.plot(x,y1, label='Water Level', color = 'r', marker="o")
curve2 = ax2.plot(x,y2,label='Discharge Volume', color = 'b',marker="o")
plt.plot()
plt.show()
Any tips would be greatly appreciated as I'm quite new to this
thank you
Okay I went through the code removed the duplicates (as suggest by Arne) by the "Date" column. Oh and I made the graph formatting slightly more readable. This graphed without going back in time:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker
url_hourly = "https://dd.weather.gc.ca/hydrometric/csv/BC/hourly/BC_08MG005_hourly_hydrometric.csv"
url_daily = "https://dd.weather.gc.ca/hydrometric/csv/BC/daily/BC_08MG005_daily_hydrometric.csv"
fields = ["Date","Water Level / Niveau d'eau (m)", "Discharge / Débit (cms)"]
#Read csv files
hourly_data = pd.read_csv(url_hourly, usecols=fields)
day_data = pd.read_csv(url_daily, usecols=fields)
#Merge csv files
water_data = pd.concat([day_data,hourly_data])
#Convert date to datetime
water_data['Date'] = pd.to_datetime(water_data['Date']).dt.normalize()
water_data['Date'] = water_data['Date'].dt.strftime('%m/%d/%Y')
# CSV files contains 288 data entries per day (12per hour * 24hrs). Selecting every 288th element to represent one day
data_24hr = water_data.iloc[::288]
data_24hr.drop_duplicates(subset="Date",inplace=True) #remove duplicates according to the date column
# Assigning columns to x, y, y1 axis
x = data_24hr[fields[0]]
y1 = data_24hr[fields[1]]
y2= data_24hr[fields[2]]
print(len(x), len(y1))
#Ploting the graph
fig, ax1 = plt.subplots()
ax2 = plt.twinx()
curve1 = ax1.plot(x, y1, label='Water Level', color = 'r', marker="o")
curve2 = ax2.plot(x, y2, label='Discharge Volume', color = 'b',marker="o")
fig.autofmt_xdate(rotation=90)
plt.show()
Related
I have a lot of quarter-hourly data (consumption versus time). I have to make averages on these data and I would have liked to display the averages according to the days of the week + time.
So I am looking to put on a graphic the day and the time at the same time. The expected result is possible in Excel but I'm looking to do it in python with matplotlib (and using dataframes).
If you have any idea, thanks a lot!
Guillaume
Here is a code that displays a decent result but I would like better.
I'm sorry but I can't put an image attached directly because I'm new on the forum.
import pandas as pd
import datetime
import matplotlib.pyplot as plts
columns = ["Date/Time","Value"]
new_df = pd.DataFrame(columns = columns)
Jour1 = pd.to_datetime('02/01/2021')
value = np.random.randint(100, 150, size=(672,))
for x in range(672):
TimeStamp = Jour1
Jour1 = Jour1 + datetime.timedelta(minutes=15)
new_df = new_df.append(pd.Series([TimeStamp,value[x]], index = columns) ,ignore_index=True)
new_df['Day of week Name'] = new_df['Date/Time'].dt.dayofweek.astype(str) + ' - '+ new_df['Date/Time'].dt.day_name()
new_df["Time"] = new_df['Date/Time'].dt.time
new_df = new_df.groupby(['Day of week Name','Time'])['Value'].sum().reset_index()
new_df['TimeShow'] = new_df['Day of week Name'] +' '+ new_df['Time'].astype(str)
fig = plt.figure(figsize=(18,10))
ax=fig.add_subplot(111)
ax.plot(new_df['TimeShow'], new_df['Value'], label="Test", linewidth = 2)
plt.xticks(['0 - Monday 00:00:00','1 - Tuesday 00:00:00','2 - Wednesday 00:00:00','3 - Thursday 00:00:00','4 - Friday 00:00:00','5 - Saturday 00:00:00','6 - Sunday 00:00:00'])
plt.show()
Image in python
Image in excel - day not in order
EDIT :
Thanks to your help, I finally found something that works for me. I don't know if the code is optimized but it works. here is the code if needed :
fig = plt.figure(figsize=(18,10))
ax=fig.add_subplot(111)
date_rng = pd.date_range('2021-01-01 00:00:00','2021-01-08 00:00:00', freq='6h')
xlabels = pd.DataFrame(index=date_rng)
xlabels = xlabels.index.strftime('%H:%M').tolist()
liste_saisons = df['Saison'].unique().tolist()
for saisons in liste_saisons :
df_show = df.loc[(df['Saison'] == saisons)]
df_show = df_show.groupby(['Jour Semaine Nom','Time'],as_index=False)['SUM(CORR_VALUE)'].mean()
df_show['TimeShow'] = df_show['Jour Semaine Nom'] +' '+ df_show['Time'].astype(str)
ax.plot(df_show.index, df_show['SUM(CORR_VALUE)'], label=saisons, linewidth = 3)
fig.suptitle('Evolution de la charge BT quart-horaire moyenne semaine', fontsize=20)
plt.xlabel('Jour de la semaine + Heure', fontsize=20)
plt.ylabel('Charge BT quart-horaire moyenne [MW]', fontsize = 20)
plt.rc('legend', fontsize=16)
ax.legend(loc='upper left')
plt.grid(color='k', linestyle='-.', linewidth=1)
ax.set_xticklabels(xlabels)
plt.xticks(np.arange(0, 96*7, 4*6))
plt.ylim(50,350)
xdays = df_show["Jour Semaine Nom"].tolist()
graph_pos = plt.gca().get_position()
points = np.arange(48, len(xdays), 96)
day_points = np.arange(0, len(xdays), 96)
offset = -65.0
trans = ax.get_xaxis_transform()
for i,d in enumerate(xdays):
if i in points:
ax.text(i, graph_pos.y0 - offset, d, ha='center',bbox=dict(facecolor='cyan', edgecolor='black', boxstyle='round'), fontsize=12)
plt.show()
Result
There are many possible approaches to this kind of task, but I used the text and plot functions to deal with it. to add the first date, I took the size of the graph and subtracted the offset value from the y0 value to determine the position. To add the first date, I took the size of the graph and subtracted an offset value from the y0 value, and for each date, I manually set the y1 value to position the vertical line.
PS: For a faster answer, I will present it even with unfinished code. Attach an image instead of a link. Attach the toy data in text. This is necessary.
import pandas as pd
import numpy as np
date_rng = pd.date_range('2021-01-01','2021-03-01', freq='1h')
value = np.random.randint(100, 150, size=(1417,))
df = pd.DataFrame({'date':pd.to_datetime(date_rng),'value':value})
import matplotlib.pyplot as plt
w = 0.7
fig,ax = plt.subplots(figsize=(20,4))
ax.bar(df.date[:100].apply(lambda x:x.strftime('%Y-%m-%d %H:%M:%S')), df.value[:100], color='C0', width=w, align="center")
xlabels = df.date[:100].apply(lambda x:x.strftime('%H:%M:%S')).tolist()
xdays = df.date[:100].apply(lambda x:x.strftime('%d-%b')).tolist()
ax.set_xticklabels(xlabels, rotation=90)
graph_pos = plt.gca().get_position()
points = np.arange(12, len(xlabels), 24)
day_points = np.arange(0, len(xlabels), 24)
offset = 50.0
trans = ax.get_xaxis_transform()
for i,d in enumerate(xdays):
if i in points:
ax.text(i, graph_pos.y0 - offset, d, ha='center')
if i in day_points:
ax.plot([i, i], [0, -0.3], color='gray', transform=trans, clip_on=False)
ax.set_xlim(-1, len(xlabels))
plt.show()
When using matplotlib to graph time series data, I get strange mangled outputs
This only happens when I convert the time series string in the csv to a datetime object. I need to do this conversion to make use of Matplotlib's time series x axis labelling.
abc = pd.read_csv(path + "Weekly average capacity factor and demand - regular year" + ".csv", parse_dates=True, index_col="Month", header=0)
fig, ax = plt.subplots()
x = abc.index
y1 = abc["Wind"]
curve1 = ax.plot(x, y1)
pd.read_csv(parse_dates=True) creates the index as a datetime64[64] object. Perhaps this isn't optimized for use by matplotlib??
How can I make this work.
Your date index is not in order. Matplotlib does not sort the data prior to plotting, it assumes list order is the data order and tries to connect the points (You can check this by plotting a scatter plot and your data will look fine). You have to sort your data before trying to plot it.
abc = pd.read_csv(path + "Weekly average capacity factor and demand - regular year" + ".csv", parse_dates=True, index_col="Month", header=0)
x, y1 = zip(*sorted(zip(abc.index, abc["Wind"])))
fig, ax = plt.subplots()
curve1 = ax.plot(x, y1)
I have parsed out data form .json than plotted them but I only wants a certain range from it
e.g. year-mounth= 2014-12to 2020-03
THE CODE IS
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_json("observed-solar-cycle-indices.json", orient='records')
data = pd.DataFrame(data)
print(data)
x = data['time-tag']
y = data['ssn']
plt.plot(x, y, 'o')
plt.xlabel('Year-day'), plt.ylabel('SSN')
plt.show()
Here is the result, as you can see it is too many
here is the json file: https://services.swpc.noaa.gov/json/solar-cycle/observed-solar-cycle-indices.json
How to either parse out certain value from the JSON file or plot a certain range?
The following should work:
Select the data using a start and end date
ndata = data[ (data['time-tag'] > '2014-01') & (data['time-tag'] < '2020-12')]
Plot the data. The x-axis labeling is adapted to display only every 12th label
x = ndata['time-tag']
y = ndata['ssn']
fig, ax = plt.subplots()
plt.plot(x, y, 'o')
every_nth = 12
for n, label in enumerate(ax.xaxis.get_ticklabels()):
if n % every_nth != 0:
label.set_visible(False)
plt.xlabel('Year-Month')
plt.xticks(rotation='vertical')
plt.ylabel('SSN')
plt.show()
You could do a search for the index value of your start and end dates for both x and y values. Use this to create a smaller set of lists that you can plot.
For example, it might be something like
x = data['time-tag']
y = data['ssn']
start_index = x.index('2014-314')
end_index = x.index('2020-083')
x_subsection = x[start_index : end_index]
y_subsection = y[start_index : end_index]
plt.plot(x_subsection, y_subsection, 'o')
plt.xlabel('Year-day'), plt.ylabel('SSN')
plt.show()
You may need to convert the dataframe into an array with np.array().
I have very simple code:
from matplotlib import dates
import matplotlib.ticker as ticker
my_plot=df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90)
I've got:
but I would like to have fewer labels on X axis. To do this I've add:
my_plot.xaxis.set_major_locator(ticker.MaxNLocator(12))
It generates fewer labels but values of labels have wrong values (=first of few labels from whole list)
What am I doing wrong?
I have add additional information:
I've forgoten to show what is inside DataFrame.
I have three columns:
reg_Date - datetime64 (index)
temperature - float64
Day - date converted from reg_Date to string, it looks like '2017-10' (YYYY-MM)
Box plot group date by 'Day' and I would like to show values 'Day" as a label but not all values
, for example every third one.
You were almost there. Just set ticker.MultipleLocator.
The pandas.DataFrame.boxplot also returns axes, which is an object of class matplotlib.axes.Axes. So you can use this code snippet to customize your labels:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
center = np.random.randint(50,size=(10, 20))
spread = np.random.rand(10, 20) * 30
flier_high = np.random.rand(10, 20) * 30 + 30
flier_low = np.random.rand(10, 20) * -30
y = np.concatenate((spread, center, flier_high, flier_low))
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(y)
x = ['Label '+str(i) for i in range(20)]
ax.set_xticklabels(x)
ax.set_xlabel('Day')
# Set a tick on each integer multiple of a base within the view interval.
ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
plt.xticks(rotation=90)
I think there is a compatibility issue with Pandas plots and Matplotlib formatters.
With the following code:
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
for l in labels:
if i % 3 == 0:
label = labels[i]
i += 1
new_labels.append(label)
else:
label = ''
i += 1
new_labels.append(label)
ax.set_xticklabels(new_labels)
plt.show()
You get this chart:
But I notice that this is grouped by month instead of by day. It may not be what you wanted.
Adding the day component to the string 'Day' messes up the chart as there seems to be too many boxes.
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m-%d')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
for l in labels:
if i % 15 == 0:
label = labels[i]
i += 1
new_labels.append(label)
else:
label = ''
i += 1
new_labels.append(label)
ax.set_xticklabels(new_labels)
plt.show()
The for loop creates the tick labels every as many periods as desired. In the first chart they were set every 3 months. In the second one, every 15 days.
If you would like to see less grid lines:
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m-%d')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
new_locs = list()
for l in labels:
if i % 3 == 0:
label = labels[i]
loc = locs[i]
i += 1
new_labels.append(label)
new_locs.append(loc)
else:
i += 1
ax.set_xticks(new_locs)
ax.set_xticklabels(new_labels)
ax.grid(axis='y')
plt.show()
I've read about x_compat in Pandas plot in order to apply Matplotlib formatters, but I get an error when trying to apply it. I'll give it another shot later.
Old unsuccesful answer
The tick labels seem to be dates. If they are set as datetime in your dataframe, you can:
months = mdates.MonthLocator(1,4,7,10) #Choose the months you like the most
ax.xaxis.set_major_locator(months)
Otherwise, you can let Matplotlib know they are dates by:
ax.xaxis_date()
Your comment:
I have add additional information:
I've forgoten to show what is inside DataFrame.
I have three columns:
reg_Date - datetime64 (index)
temperature - float64
Day - date converted from reg_Date to string, it looks like '2017-10' *(YYYY-MM) *
Box plot group date by 'Day' and I would like to show values 'Day" as a label but not all values
, for example every third one.
Based on your comment in italic above, I would use reg_Date as the input and the following lines:
days = mdates.DayLocator(interval=3)
daysFmt = mdates.DateFormatter('%Y-%m') #to format display
ax.xaxis.set_major_locator(days)
ax.xaxis.set_major_formatter(daysFmt)
I forgot to mention that you will need to:
import matplotlib.dates as mdates
Does this work?
I am trying to convert Line garph to Bar graph using python panda.
Here is my code which gives perfect line graph as per my requirement.
conn = sqlite3.connect('Demo.db')
collection = ['ABC','PQR']
df = pd.read_sql("SELECT * FROM Table where ...", conn)
df['DateTime'] = df['Timestamp'].apply(lambda x: dt.datetime.fromtimestamp(x))
df.groupby('Type').plot(x='DateTime', y='Value',linewidth=2)
plt.legend(collection)
plt.show()
Here is my DataFrame df
http://postimg.org/image/75uy0dntf/
Here is my Line graph output from above code.
http://postimg.org/image/vc5lbi9xv/
I want to draw bar graph instead of line graph.I want month name on x axis and value on y axis. I want colorful bar graph.
Attempt made
df.plot(x='DateTime', y='Value',linewidth=2, kind='bar')
plt.show()
It gives improper bar graph with date and time(instead of month and year) on x axis. Thank you for help.
Here is a code that might do what you want.
In this code, I first sort your database by time. This step is important, because I use the indices of the sorted database as abscissa of your plots, instead of the timestamp. Then, I group your data frame by type and I plot manually each group at the right position (using the sorted index). Finally, I re-define the ticks and the tick labels to display the date in a given format (in this case, I chose MM/YYYY but that can be changed).
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
types = ['ABC','BCD','PQR']*3
vals = [126,1587,141,10546,1733,173,107,780,88]
ts = [1414814371, 1414814371, 1406865621, 1422766793, 1422766793, 1425574861, 1396324799, 1396324799, 1401595199]
aset = zip(types, vals, ts)
df = pd.DataFrame(data=aset, columns=['Type', 'Value', 'Timestamp'])
df = df.sort(['Timestamp', 'Type'])
df['Date'] = df['Timestamp'].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%m/%Y'))
groups = df.groupby('Type')
ngroups = len(groups)
colors = ['r', 'g', 'b']
fig = plt.figure()
ax = fig.add_subplot(111, position=[0.15, 0.15, 0.8, 0.8])
offset = 0.1
width = 1-2*offset
#
for j, group in enumerate(groups):
x = group[1].index+offset
y = group[1].Value
ax.bar(x, y, width=width, color=colors[j], label=group[0])
xmin, xmax = min(df.index), max(df.index)+1
ax.set_xlim([xmin, xmax])
ax.tick_params(axis='x', which='both', top='off', bottom='off')
plt.xticks(np.arange(xmin, xmax)+0.5, list(df['Date']), rotation=90)
ax.legend()
plt.show()
I hope this works for you. This is the output that I get, given my subset of your database.