Pandas: Histogram Plotting - python

I have a dataframe with dates (datetime) in python. How can I plot a histogram with 30 min bins from the occurrences using this dataframe?
starttime
1 2016-09-11 00:24:24
2 2016-08-28 00:24:24
3 2016-07-31 05:48:31
4 2016-09-11 00:23:14
5 2016-08-21 00:55:23
6 2016-08-21 01:17:31
.............
989872 2016-10-29 17:31:33
989877 2016-10-02 10:00:35
989878 2016-10-29 16:42:41
989888 2016-10-09 07:43:27
989889 2016-10-09 07:42:59
989890 2016-11-05 14:30:59
I have tried looking at examples from Plotting series histogram in Pandas and A per-hour histogram of datetime using Pandas. But they seem to be using a bar plot which is not what I need. I have attempted to create the histogram using temp.groupby([temp["starttime"].dt.hour, temp["starttime"].dt.minute]).count().plot(kind="hist") giving me the results as shown below
If possible I would like the X axis to display the time(e.g 07:30:00)

I think you need bar plot and for axis with times simpliest is convert datetimes to strings by strftime:
temp = temp.resample('30T', on='starttime').count()
ax = temp.groupby(temp.index.strftime('%H:%M')).sum().plot(kind="bar")
#for nicer bar some ticklabels are hidden
spacing = 2
visible = ax.xaxis.get_ticklabels()[::spacing]
for label in ax.xaxis.get_ticklabels():
if label not in visible:
label.set_visible(False)

Related

Change tick frequency for datetime axis [duplicate]

This question already has an answer here:
Change tick frequency on X (time, not number) frequency in matplotlib
(1 answer)
Closed 3 years ago.
I have the following dataframe:
Date Prod_01 Prod_02
19 2018-03-01 49870 0.0
20 2018-04-01 47397 0.0
21 2018-05-01 53752 0.0
22 2018-06-01 47111 0.0
23 2018-07-01 53581 0.0
24 2018-08-01 55692 0.0
25 2018-09-01 51886 0.0
26 2018-10-01 56963 0.0
27 2018-11-01 56732 0.0
28 2018-12-01 59196 0.0
29 2019-01-01 57221 5.0
30 2019-02-01 55495 472.0
31 2019-03-01 65394 753.0
32 2019-04-01 59030 1174.0
33 2019-05-01 64466 2793.0
34 2019-06-01 58471 4413.0
35 2019-07-01 64785 6110.0
36 2019-08-01 63774 8360.0
37 2019-09-01 64324 9558.0
38 2019-10-01 65733 11050.0
And I need to plot a time series of the 'Prod_01' column.
The 'Date' column is in the pandas datetime format.
So I used the following command:
plt.figure(figsize=(10,4))
plt.plot('Date', 'Prod_01', data=test, linewidth=2, color='steelblue')
plt.xticks(rotation=45, horizontalalignment='right');
Output:
However, I want to change the frequency of the xticks to one month, so I get one tick and one label for each month.
I have tried the following command:
plt.figure(figsize=(10,4))
plt.plot('Date', 'Prod_01', data=test, linewidth=2, color='steelblue')
plt.xticks(np.arange(1, len(test), 1), test['Date'] ,rotation=45, horizontalalignment='right');
But I get this:
How can I solve this problem?
Thanks in advance.
I'm not very familiar with pandas data frames. However, I can't see why this wouldn't work with any pyplot:
According the top SO answer on related post by ImportanceOfBeingErnest:
The spacing between ticklabels is exclusively determined by the space between ticks on the axes.
So, to change the distance between ticks, and the labels you can do this:
Suppose a cluttered and base-10 centered person displays the following graph:
It takes the following code and importing matplotlib.ticker:
import numpy as np
import matplotlib.pyplot as plt
# Import this, too
import matplotlib.ticker as ticker
# Arbitrary graph with x-axis = [-32..32]
x = np.linspace(-32, 32, 1024)
y = np.sinc(x)
# -------------------- Look Here --------------------
# Access plot's axes
axs = plt.axes()
# Set distance between major ticks (which always have labels)
axs.xaxis.set_major_locator(ticker.MultipleLocator(5))
# Sets distance between minor ticks (which don't have labels)
axs.xaxis.set_minor_locator(ticker.MultipleLocator(1))
# -----------------------------------------------------
# Plot and show graph
plt.plot(x, y)
plt.show()
To change where the labels are placed, you can change the distance between the 'major ticks'. You can also change the smaller 'minor ticks' in between, which don't have a number attached. E.g., on a clock, the hour ticks have numbers on them and are larger (major ticks) with smaller, unlabeled ones between marking the minutes (minor ticks).
By changing the --- Look Here --- part to:
# -------------------- Look Here --------------------
# Access plot's axes
axs = plt.axes()
# Set distance between major ticks (which always have labels)
axs.xaxis.set_major_locator(ticker.MultipleLocator(8))
# Sets distance between minor ticks (which don't have labels)
axs.xaxis.set_minor_locator(ticker.MultipleLocator(4))
# -----------------------------------------------------
You can generate the cleaner and more elegant graph below:
Hope that helps!

how to visualize columns of a dataframe python as a plot?

I have a dataframe that looks like below:
DateTime ID Temperature
2019-03-01 18:36:01 3 21
2019-04-01 18:36:01 3 21
2019-18-01 08:30:01 2 18
2019-12-01 18:36:01 2 12
I would like to visualize this as a plot, where I need the datetime in x-axis, and Temperature on the y axis with a hue of IDs, I tried the below, but i need to see the Temperature distribution for every point more clearly. Is there any other visualization technique?
x= df['DateTime'].values
y= df['Temperature'].values
hue=df['ID'].values
plt.scatter(x, y,hue,color = "red")
you can try:
df.set_index('DateTime').plot()
output:
or you can use:
df.set_index('DateTime').plot(style="x-", figsize=(15, 10))
output:

pandas display categories incorrect displayed in matplotlib

I am trying to represent categories in matplotlib and for some reason I have categories overlapping on x-axis, as well as missing categories, but y-axis values present. I marked this with red arrows in the picture from the bottom of the question.
The data is contained in sales.csv file that looks like this:
date,first name,last name,city,cost,rooms,bathrooms,type,status
2018-03-04 12:13:21,Linda,Evangelista,Balm Beach,333000,2,2,townhouse,sold
2018-02-01 07:20:20,Rita,Ford,Balm Beach,818000,2,2,detached,sold
2018-03-08 07:13:00,Ali,Hassan,Bowmanville,413000,2,2,bungalow,forsale
2018-05-08 21:00:00,Rashid,Forani,Bowmanville,467000,2,2,townhouse,sold
2018-02-07 16:43:00,Kumar,Yoshi,Bowmanville,613000,3,3,bungalow,sold
2018-01-05 13:43:00,Srini,Santinaram,Bowmanville,723000,2,2,bungalow,forsale
2018-01-03 14:19:00,Maria,Dugall,Brampton,900000,4,3,semidetached,forsale
2018-05-04 19:22:00,Zina,Evangel,Burlington,221000,1,1,townhouse,forsale
2018-05-01 19:44:00,Pierre,Merci,Gatineau,3199000,14,14,bungalow,forsale
2018-05-31 18:10:00,Istvan,Kerekes,Kingston,1110000,4,5,bungalow,sold
2018-03-25 08:22:00,Dumitru,Plamada,Kingston,1650000,5,5,bungalow,forsale
2018-01-01 11:54:00,John,Smith,Markham,1200000,3,3,bungalow,sold
2018-05-07 15:30:00,Arturo,Gonzales,Mississauga,187000,3,3,bungalow,forsale
2018-03-07 22:20:00,Lei,Zhang,North York,122000,1,1,townhouse,forsale
2018-05-04 20:04:00,William,King,Oaks,,3,3,bungalow,sold
2018-03-04 13:05:00,Jeffrey,Kong,Oakville,,2,2,townhouse,forsale
2018-01-04 17:23:00,Abdul,Karrem,Orillia,883000,3,4,townhouse,sold
2018-03-01 13:09:00,Jean,Paumier,Ottawa,1520000,4,4,townhouse,sold
2018-02-01 10:00:00,Ken,Beaufort,Ottawa,3440000,5,5,bungalow,forsale
2018-02-15 11:33:00,Gheorghe,Ionescu,Richmond Hill,1630000,4,3,bungalow,forsale
2018-01-05 10:32:00,Ion,Popescu,Scarborough,1420000,5,3,semidetached,sold
2018-02-07 11:44:00,Xu,Yang,Toronto,422000,2,2,townhouse,forsale
2018-05-29 00:33:00,Giovanni,Gianparello,Toronto,1917000,4,4,bungalow,forsale
2018-03-25 08:27:00,John,Saint-Claire,Toronto,3337000,5,4,bungalow,forsale
2018-01-06 14:06:00,Ann,Murdoch Pyrell,Toronto,1427000,5,4,bungalow,forsale
2018-02-15 13:12:00,Claire,Coldwell,Toronto,3777000,5,4,bungalow,forsale
2018-01-02 09:37:00,Kyle,MCDonald,Toronto,,2,2,townhouse,forsale
2018-02-01 21:22:00,Miriam,Berg,Toronto,,4,4,townhouse,forsale
The code to load the data and display the graph is below:
import pandas as pd
import matplotlib.pyplot as plt
# Load data
sales_brute = pd.read_csv('sales.csv', parse_dates=True, index_col='date')
# Fix the columns names by stripping the extra spaces
sales_brute = sales_brute.rename(columns=lambda x: x.strip())
# Fix the N/A from cost column
sales_brute['cost'].fillna(sales_brute['cost'].mean(), inplace=True)
# Draws a scattered plot, price by cities. Change the colors of plot.
plt.scatter(sales_brute['city'], sales_brute['cost'], color='red')
# Rotates the ticks with 70 grd
plt.xticks(sales_brute['city'], rotation=70)
plt.tight_layout()
# Add grid
plt.grid()
plt.show()
and the results looks strangely like this:
Incorrect display of categories
Maybe we have different versions of matplotlib, but I can't use plt.scatter at all with sales_brute['city'] as first argument.
ValueError: could not convert string to float: 'Toronto'
Instead I made up a new x-axis:
x = range(len(sales_brute))
plt.scatter(x=x, y=sales_brute['cost'], color='red')
plt.xticks(x, sales_brute['city'], rotation=70)
plt.show()
Which results in:
(some stretching required to see the full names)
plt.scatter seems to be happy to take strings as the x-coordinate and arrange them in alphabetical order. plt.xticks, however, wants a list matching the number of ticks and in the same order.
If you change:
plt.xticks(sales_brute['city'], rotation=70)
to
plt.xticks(sales_brute['city'].sort_values().unique(), rotation=70),
you'll get the effect you want.

Plot datetime.date / time series in a pandas dataframe

I created a pandas dataframe from some value counts on particular calendar dates. Here is how I did it:
time_series = pd.DataFrame(df['Operation Date'].value_counts().reset_index())
time_series.columns = ['date', 'count']
Basically, it is two columns, the first "date" is a column with datetime.date objects and the second column, "count" are simply integer values. Now, I'd like to plot a scatter or a KDE to represent how the value changes over the calendar days.
But when I try:
time_series.plot(kind='kde')
plt.show()
I get a plot where the x-axis is from -50 to 150 as if it is parsing the datetime.date objects as integers somehow. Also, it is yielding two identical plots rather than just one.
Any idea how I can plot them and see the calendars day along the x-axis?
you sure you got datetime? i just tried this and it worked fine:
df = date count
7 2012-06-11 16:51:32 1.0
3 2012-09-28 08:05:14 12.0
19 2012-10-01 18:01:47 4.0
2 2012-10-03 15:18:23 29.0
6 2012-12-22 19:50:43 4.0
1 2013-02-19 19:54:03 28.0
9 2013-02-28 16:08:40 17.0
12 2013-03-12 08:42:55 6.0
4 2013-04-04 05:27:27 6.0
17 2013-04-18 09:40:37 29.0
11 2013-05-17 16:34:51 22.0
5 2013-07-07 14:32:59 16.0
14 2013-10-22 06:56:29 13.0
13 2014-01-16 23:08:46 20.0
15 2014-02-25 00:49:26 10.0
18 2014-03-19 15:58:38 25.0
0 2014-03-31 05:53:28 16.0
16 2014-04-01 09:59:32 27.0
8 2014-04-27 12:07:41 17.0
10 2014-09-20 04:42:39 21.0
df = df.sort_values('date', ascending=True)
plt.plot(df['date'], df['count'])
plt.xticks(rotation='vertical')
EDIT:
if you want a scatter plot you can:
plt.plot(df['date'], df['count'], '*')
plt.xticks(rotation='vertical')
If the column is datetime dtype (not object), then you can call plot() directly on the dataframe. You don't need to sort by date either, it's done behind the scenes if x-axis is datetime.
df['date'] = pd.to_datetime(df['date'])
df.plot(x='date', y='count', kind='scatter', rot='vertical');
You can also pass many arguments to make the plot nicer (add titles, change figsize and fontsize, rotate ticklabels, set subplots axis etc.) See the docs for full list of possible arguments.
df.plot(x='date', y='count', kind='line', rot=45, legend=None,
title='Count across time', xlabel='', fontsize=10, figsize=(12,4));
You can even use another column to color scatter plots. In the example below, the months are used to assign color. Tip: To get the full list of possible colormaps, pass any gibberish string to colormap and the error message will show you the full list.
df.plot(x='date', y='count', kind='scatter', rot=90, c=df['date'].dt.month, colormap='tab20', sharex=False);

How do I get a simple scatter plot of a dataframe (preferrably with seaborn)

I'm trying to scatter plot the following dataframe:
mydf = pd.DataFrame({'x':[1,2,3,4,5,6,7,8,9],
'y':[9,8,7,6,5,4,3,2,1],
'z':np.random.randint(0,9, 9)},
index=["12:00", "1:00", "2:00", "3:00", "4:00",
"5:00", "6:00", "7:00", "8:00"])
x y z
12:00 1 9 1
1:00 2 8 1
2:00 3 7 7
3:00 4 6 7
4:00 5 5 4
5:00 6 4 2
6:00 7 3 2
7:00 8 2 8
8:00 9 1 8
I would like to see the times "12:00, 1:00, ..." as the x-axis and x,y,z columns on the y-axis.
When I try to plot with pandas via mydf.plot(kind="scatter"), I get the error ValueError: scatter requires and x and y column. Do I have to break down my dataframe into appropriate parameters? What I would really like to do is get this scatter plotted with seaborn.
Just running
mydf.plot(style=".")
works fine for me:
Seaborn is actually built around pandas.DataFrames. However, your data frame needs to be "tidy":
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Since you want to plot x, y, and z on the same plot, it seems like they are actually different observations. Thus, you really have three variables: time, value, and the letter used.
The "tidy" standard comes from Hadly Wickham, who implemented it in the tidyr package.
First, I convert the index to a Datetime:
mydf.index = pd.DatetimeIndex(mydf.index)
Then we do the conversion to tidy data:
pivoted = mydf.unstack().reset_index()
and rename the columns
pivoted = pivoted.rename(columns={"level_0": "letter", "level_1": "time", 0: "value"})
Now, this is what our data looks like:
letter time value
0 x 2019-03-13 12:00:00 1
1 x 2019-03-13 01:00:00 2
2 x 2019-03-13 02:00:00 3
3 x 2019-03-13 03:00:00 4
4 x 2019-03-13 04:00:00 5
Unfortunately, seaborn doesn't play with DateTimes that well, so you can just extract the hour as an integer:
pivoted["hour"] = pivoted["time"].dt.hour
With a data frame in this form, seaborn takes in the data easily:
import seaborn as sns
sns.set()
sns.scatterplot(data=pivoted, x="hour", y="value", hue="letter")
Outputs:

Categories