Histogram for a dataframe column [duplicate] - python

This question already has answers here:
Selecting a column to make histogram
(1 answer)
How to plot a histogram of a single dataframe column and exclude 0s
(1 answer)
How do I only plot histogram for only certain columns of a data-frame in pandas
(1 answer)
Closed 7 months ago.
I would like to construct a histogram (or empirical distribution function) for a dataframe column (=a column contatining a number of daily observations).
The dataframe column has the following structure (below)
Thanks in advance!
df1 = pd.DataFrame({"date": pd.to_datetime(["2021-3-22", "2021-4-7", "2021-4-18", "2021-5-12","2022-3-22", "2022-4-7", "2022-4-18", "2022-5-12"]),
"x": [1, 1, 1, 3, 2, 3,4,2 ]})
date x
0 2021-03-22 1
1 2021-04-07 1
2 2021-04-18 1
3 2021-05-12 3
4 2022-03-22 2
5 2022-04-07 3
6 2022-04-18 4
7 2022-05-12 2

Pandas has plotting feature with matplotlib backend as default, so you can do it like this:
df1.x.hist()
More: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

You can do this with pyplot:
from matplotlib import pyplot as plt
plt.hist(df1.x)
#if you just want to look at the plot
plt.show()
#if you want to save the plot to a file
plt.savefig('filename.png')
Here's the documentation with all the options: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html.

Related

Create separate graph of each series and save as pdf in Python [duplicate]

This question already has answers here:
Pandas dataframe groupby plot
(3 answers)
Saving plots (AxesSubPlot) generated from python pandas with matplotlib's savefig
(6 answers)
How to save a Seaborn plot into a file
(10 answers)
Closed 6 months ago.
I have a pandas dataframe as below:
Well Name
READTIME
WL
0
A
02-Jul-20
12
1
B
03-Aug-22
18
2
C
05-Jul-21
14
3
A
03-May-21
16
4
B
01-Jan-19
19
5
C
12-Dec-20
20
6
D
14-Nov-21
14
7
A
01-Mar-22
17
8
B
15-Feb-21
11
9
C
10-Oct-20
10
10
D
14-Sep-21
5
groupByName = df.groupby(['Well Name', 'READTIME'])
After grouping them by 'Well Name' and Readtime, i got the following:
Well Name READTIME WL
A 2020-07-02 12
2021-05-03 16
2022-03-01 17
B 2019-01-01 19
2021-02-15 11
2022-08-03 18
C 2020-10-10 10
2020-12-12 20
2021-07-05 14
D 2021-09-14 5
2021-11-14 14
I have got the following graph by running this code:
sns.relplot(data=df, x="READTIME", y="WL", hue="Well Name",kind="line", height=4, aspect=3)
I want to have a separate graph for each "Well Name" and saved it as a pdf. I will really appreciate your help with this. Thank you
To separate out the plots, you can iterate over the four unique Well Names in your dataset and filter the dataset for each Well Name before plotting:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# I saved your data as an Excel file
df = pd.read_excel('Book1.xlsx')
print(df)
# Get the set of unique Well Names
well_names = set(df['Well Name'].to_list())
for wn in well_names:
# Create dataframe containing only rows with this Well Name
this_wn = df[df['Well Name'] == wn]
# Plot, save, and show
sns.relplot(data=this_wn, x="READTIME", y="WL", hue="Well Name",kind="line", height=4, aspect=3)
plt.savefig(f'{wn}.png')
plt.show(block=True)
This generated the following 4 image files:
For saving in a PDF file, please see this answer.
In this case, specifying a row results in a faceted graph.
sns.relplot(data=df, x="READTIME", y="WL", hue="Well Name", kind="line", row='Well Name', height=4, aspect=3)

Plot with Histogram an attribute from a dataframe

I have a dataframe with the weight and the number of measures of each user. The df looks like:
id_user
weight
number_of_measures
1
92.16
4
2
80.34
5
3
71.89
11
4
81.11
7
5
77.23
8
6
92.37
2
7
88.18
3
I would like to see an histogram with the attribute of the table (weight, but I want to do it for both cases) at the x-axis and the frequency in the y-axis.
Does anyone know how to do it with matplotlib?
Ok, it seems to be quite easy:
import pandas as pd
import matplotlib.pyplot as plt
hist = df.hist(bins=50)
plt.show()

How to create a groupby dataframe without a multi-level index

I have the following pandas groupby object, and I'd like to turn the result into a new dataframe.
Following, is the code to get the conditional probability:
bin_probs = data.groupby('season')['bin'].value_counts()/data.groupby('season')['bin'].count()
I've tried the following code, but it returns as follows.
I like the season to fill in each row. How can I do that?
a = pd.DataFrame(data_5.groupby('season')['bin'].value_counts()/data_5.groupby('season')['bin'].count())
a is a DataFrame, but with a 2-level index, so my interpretation is you want a dataframe without a multi-level index.
The index can't be reset when the name in the index and the column are the same.
Use pandas.Series.reset_index, and set name='normalized_bin, to rename the bin column.
This would not work with the implementation in the OP, because that is a dataframe.
This works with the following implementation, because a pandas.Series is created with .groupby.
The correct way to normalize the column is to use the normalize=True parameter in .value_counts.
import pandas as pd
import random # for test data
import numpy as np # for test data
# setup a dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'bin': np.random.randint(10, size=(rows)),
'season': [random.choice(['fall', 'winter', 'summer', 'spring']) for _ in range(rows)]}
df = pd.DataFrame(data)
# display(df.head())
bin season
0 2 summer
1 4 winter
2 1 summer
3 5 winter
4 2 spring
# groupby, normalize and reset the the Series index
a = df.groupby(['season'])['bin'].value_counts(normalize=True).reset_index(name='normalized_bin')
# display(a.head(15))
season bin normalized_bin
0 fall 2 0.15600
1 fall 9 0.11600
2 fall 3 0.10800
3 fall 4 0.10400
4 fall 6 0.10000
5 fall 0 0.09600
6 fall 8 0.09600
7 fall 5 0.08400
8 fall 7 0.08000
9 fall 1 0.06000
10 spring 0 0.11524
11 spring 8 0.11524
12 spring 9 0.11524
13 spring 3 0.11152
14 spring 1 0.10037
Using the OP code for a
As already noted above, use normalize=True to get normalized values
The solution in the OP, creates a DataFrame, because the .groupby is wrapped with the DataFrame constructor, pandas.DataFrame.
To reset the index, you must first pandas.DataFrame.rename the bin column, and then use pandas.DataFrame.reset_index
a = pd.DataFrame(df.groupby('season')['bin'].value_counts()/df.groupby('season')['bin'].count()).rename(columns={'bin': 'normalized_bin'}).reset_index()
Other Resources
See Pandas unable to reset index because name exist to reset by a level.
Plotting
It is easier to plot from the multi-index Series, by using pandas.Series.unstack(), and then use pandas.DataFrame.plot.bar
For side-by-side bars, set stacked=False.
The bars are all equal to 1, because this is normalized data.
s = df.groupby(['season'])['bin'].value_counts(normalize=True).unstack()
# plot a stacked bar
s.plot.bar(stacked=True, figsize=(8, 6))
plt.legend(title='bin', bbox_to_anchor=(1.05, 1), loc='upper left')
You are looking for parameter normalize:
bin_probs = data.groupby('season')['bin'].value_counts(normalize=True)
Read more about it here:

Pandas: Histogram Plotting

I have a dataframe with dates (datetime) in python. How can I plot a histogram with 30 min bins from the occurrences using this dataframe?
starttime
1 2016-09-11 00:24:24
2 2016-08-28 00:24:24
3 2016-07-31 05:48:31
4 2016-09-11 00:23:14
5 2016-08-21 00:55:23
6 2016-08-21 01:17:31
.............
989872 2016-10-29 17:31:33
989877 2016-10-02 10:00:35
989878 2016-10-29 16:42:41
989888 2016-10-09 07:43:27
989889 2016-10-09 07:42:59
989890 2016-11-05 14:30:59
I have tried looking at examples from Plotting series histogram in Pandas and A per-hour histogram of datetime using Pandas. But they seem to be using a bar plot which is not what I need. I have attempted to create the histogram using temp.groupby([temp["starttime"].dt.hour, temp["starttime"].dt.minute]).count().plot(kind="hist") giving me the results as shown below
If possible I would like the X axis to display the time(e.g 07:30:00)
I think you need bar plot and for axis with times simpliest is convert datetimes to strings by strftime:
temp = temp.resample('30T', on='starttime').count()
ax = temp.groupby(temp.index.strftime('%H:%M')).sum().plot(kind="bar")
#for nicer bar some ticklabels are hidden
spacing = 2
visible = ax.xaxis.get_ticklabels()[::spacing]
for label in ax.xaxis.get_ticklabels():
if label not in visible:
label.set_visible(False)

xticks values as dataframe column values in matplotlib plot [duplicate]

This question already has answers here:
Using datetime as ticks in Matplotlib
(3 answers)
Closed 5 years ago.
I have data.frame below
values years
0 24578.0 2007-09
1 37491.0 2008-09
2 42905.0 2009-09
3 65225.0 2010-09
4 108249.0 2011-09
5 156508.0 2012-09
6 170910.0 2013-09
7 182795.0 2014-09
8 233715.0 2015-09
9 215639.0 2016-09
10 215639.0 TTM
The plotted image is attached, the issue is i want years values '2007-09' to 'TTM' as xtick values in plot
One way to do this would be to access the current idices of the xticks in the x data. Use that value to select the values from df.year and then set the labels to those values:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
df.plot(ax=ax)
tick_idx = plt.xticks()[0]
year_labels = df.years[tick_idx].values
ax.xaxis.set_ticklabels(year_labels)
You could also set the x axis to display all years like so:
fig, ax = plt.subplots()
df.plot(ax=ax, xticks=df.index, rot=45)
ax.set_xticklabels(df.years)

Categories