Plot stacked bar chart from pandas data frame - python

I have dataframe:
payout_df.head(10)
What would be the easiest, smartest and fastest way to replicate the following excel plot?
I've tried different approaches, but couldn't get everything into place.
Thanks

If you just want a stacked bar chart, then one way is to use a loop to plot each column in the dataframe and just keep track of the cumulative sum, which you then pass as the bottom argument of pyplot.bar
import pandas as pd
import matplotlib.pyplot as plt
# If it's not already a datetime
payout_df['payout'] = pd.to_datetime(payout_df.payout)
cumval=0
fig = plt.figure(figsize=(12,8))
for col in payout_df.columns[~payout_df.columns.isin(['payout'])]:
plt.bar(payout_df.payout, payout_df[col], bottom=cumval, label=col)
cumval = cumval+payout_df[col]
_ = plt.xticks(rotation=30)
_ = plt.legend(fontsize=18)

Besides the lack of data, I think the following code will produce the desired graph
import pandas as pd
import matplotlib.pyplot as plt
df.payout = pd.to_datetime(df.payout)
grouped = df.groupby(pd.Grouper(key='payout', freq='M')).sum()
grouped.plot(x=grouped.index.year, kind='bar', stacked=True)
plt.show()
I don't know how to reproduce this fancy x-axis style. Also, your payout column must be a datetime, otherwise pd.Grouper won't work (available frequencies).

Related

How can one create histograms with subplots according to grouped variables in seaborn?

I am attempting to create a histogram using seaborn and census data that displays 3 subplots for age composition, and I have the data grouped the way that I would like it, but I am struggling to turn that into a histogram.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
filename = "/scratch/%s_class_root/%s_class/materials/data/pums_short.csv.gz"
acs = pd.read_csv(filename)
R65_agg = acs.groupby(["R65", "PUMA"])["HINCP"]
R65_meds = R65_agg.agg(np.median).unstack()
R65_f = R65_meds.dropna()
R65_f = R65_meds.reset_index(drop = True)
I was expecting this code to give me data that I could plug into a histogram but instead of being distinct subplots, the "0.0, 1.0, 2,0" in the final variable just get added together when I apply the .describe() function. Any advice for how I can convert this into a form that's readable with the sns.histplot() function?

Python vs matplotlib - Chart generation issue

I have the below python code. but as an output it gives a chart like in the attachment. And its really messy in python. Can anybody tell me hw to fix the issue and make the day in ascenting order in X axis?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel("C/desktop/data.xlsx")
df = df.loc[df['month'] == 8]
df = df.astype({'day': str})
plt.plot( 'day', 'cases', data=df)
In the first instance, i didnt take the day as str. So it came like this.
Because it had decimal numbers, i have converted it to str. now this happens.
What you got is typical of an unsorted dataset with many points per group.
As you did not provide an example, here is one:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'day': np.random.randint(1,21,size=100),
'cases': np.random.randint(0,50000,size=100),
})
plt.plot('day', 'cases', data=df)
There is no reason to plot a line in this case, you can use a scatter plot instead:
plt.scatter('day', 'cases', data=df)
To make more sense of your data, you can also compute an aggregated value (ex. mean):
plt.plot('day', 'cases', data=df.groupby('day', as_index=False)['cases'].mean())

python plotting multiple bars

I've been trying to do it for several hours and I have a mistake every time. I want to create 3 bar plots in one graph. The y-axis is to be between 0 and 1000.
The end result should be this
Thats my code:
import matplotlib.pyplot as plt
import numpy as np
import csv
df = pd.read_csv('razemKM.csv')
dfn = pd.read_csv('razemNPM.csv')
print(df)
y=[0,1000]
a=(df["srednia"]-df["odchStand"])
a1=df["srednia"]
a2=(df["srednia"]+df["odchStand"])
plt.bar(y,a,width=0.1,color='r')
plt.bar(y,a1,width=0.1,color='g')
plt.bar(y,a2,width=0.1,color='y')
plt.show()
You can use pandas plot function:
df['Sum'] = df["srednia"]+df["odchStand"]
df['Dif'] = df["srednia"]-df["odchStand"]
df.plot.bar(y=['Diff','srednia', 'Sum'],width=0.1)
plt.show()

Clustermapping in Python using Seaborn

I am trying to create a heatmap with dendrograms on Python using Seaborn and I have a csv file with about 900 rows. I'm importing the file as a pandas dataframe and attempting to plot that but a large number of the rows are not being represented in the heatmap. What am I doing wrong?
This is the code I have right now. But the heatmap only represents about 49 rows.
Here is an image of the clustermap I've obtained but it is not displaying all of my data.
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
df = pd.read_csv('diff_exp_gene.csv', index_col = 0)
# Default plot
sns.clustermap(df, cmap = 'RdBu', row_cluster=True, col_cluster=True)
plt.show()
Thank you.
An alternative approach would be to use imshow in matpltlib. I'm not exactly sure what your question is but I demonstrate a way to graph points on a plane from csv file
import numpy as np
import matplotlib.pyplot as plt
import csv
infile = open('diff_exp_gene.csv')
df = csv.DictReader(in_file)
temp = np.zeros((128,128), dtype = int)
for row in data:
if row['TYPE'] == types:
temp[int(row['Y'])][int(row['X'])] = temp[int(row['Y'])][int(row['X'])] + 1
plt.imshow(temp, cmap = 'hot', origin = 'lower')
plt.show()
As far as I know, keywords that apply to seaborn heatmaps also apply to clustermap, as the sns.clustermap passes to the sns.heatmap. In that case, all you need to do in your example is to set yticklabels=True as a keyword argument in sns.clustermap(). That will make all of the 900 rows appear.
By default, it is set as "auto" to avoid overlap. The same applies to the xticklabels. See more here: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Time-series boxplot in pandas

How can I create a boxplot for a pandas time-series where I have a box for each day?
Sample dataset of hourly data where one box should consist of 24 values:
import pandas as pd
n = 480
ts = pd.Series(randn(n),
index=pd.date_range(start="2014-02-01",
periods=n,
freq="H"))
ts.plot()
I am aware that I could make an extra column for the day, but I would like to have proper x-axis labeling and x-limit functionality (like in ts.plot()), so being able to work with the datetime index would be great.
There is a similar question for R/ggplot2 here, if it helps to clarify what I want.
If its an option for you, i would recommend using Seaborn, which is a wrapper for Matplotlib. You could do it yourself by looping over the groups from your timeseries, but that's much more work.
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
Which gives:
Note that i'm passing the day of year as the grouper to seaborn, if your data spans multiple years this wouldn't work. You could then consider something like:
ts.index.to_series().apply(lambda x: x.strftime('%Y%m%d'))
Edit, for 3-hourly you could use this as a grouper, but it only works if there are no minutes or lower defined. :
[(dt - datetime.timedelta(hours=int(dt.hour % 3))).strftime('%Y%m%d%H') for dt in ts.index]
(Not enough rep to comment on accepted solution, so adding an answer instead.)
The accepted code has two small errors: (1) need to add numpy import and (2) nned to swap the x and y parameters in the boxplot statement. The following produces the plot shown.
import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
I have a solution that may be helpful-- It only uses native pandas and allows for hierarchical date-time grouping (i.e spanning years). The key is that if you pass a function to groupby(), it will be called on each element of the dataframe's index. If your index is a DatetimeIndex (or similar), you can access all of the dt's convenience functions for resampling!
Try this:
n = 480
ts = pd.DataFrame(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
ts.groupby(lambda x: x.strftime("%Y-%m-%d")).boxplot(subplots=False, figsize=(12,9), rot=90)

Categories