How to make a bar chart with multiple series and count - python

I want to have x-axis = 'brand', y-axis = 'count', and 2 series for 'online_order' (True & False)
How can I do this on Python (using Jupyter?)
Right now, my Y axis comes on a scale of 0-1. I want to ensure that the Y axis is automated based on the values
This is the result I am getting :

I'm guessing the plot was made with something like the following:
Since the plot code is not included, it's just a guess.
df.groupby(['brand', 'online_order'])['count'].size().unstack().plot.bar(legend=True)
The issue is, size is not the value in 'count', it's .Groupby.size which computes group sizes, of which there is 1 of each.
Using seaborn
The easiest way to get the desired plot is using seaborn, which is a high-level API for matplolib.
Use seaborn.barplot with hue='online_order'.
The dataframe does not need to be reshaped.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# test data
df = pd.DataFrame({'brand': ['Solex', 'Solex', 'Giant Bicycles', 'Giant Bicycles'], 'online_order': [False, True, True, False], 'count': [2122, 2047, 1640, 1604]})
# plot
plt.figure(figsize=(7, 5))
sns.barplot(x='brand', y='count', hue='online_order', data=df)
Using pandas.DataFrame.pivot
.pivot changes the shape of the dataframe to accommodate the plot API
This option also uses pandas.DataFrame.plot.bar
df.pivot('brand', 'online_order', 'count').plot.bar()

If the data is a csv file, you can import matplotlib and pandas to create a graph and view the data.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("file name here")
plt.bar(data.brand,data.count)
plt.xlabel("brand")
plt.ylabel("count")

Related

How can one create histograms with subplots according to grouped variables in seaborn?

I am attempting to create a histogram using seaborn and census data that displays 3 subplots for age composition, and I have the data grouped the way that I would like it, but I am struggling to turn that into a histogram.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
filename = "/scratch/%s_class_root/%s_class/materials/data/pums_short.csv.gz"
acs = pd.read_csv(filename)
R65_agg = acs.groupby(["R65", "PUMA"])["HINCP"]
R65_meds = R65_agg.agg(np.median).unstack()
R65_f = R65_meds.dropna()
R65_f = R65_meds.reset_index(drop = True)
I was expecting this code to give me data that I could plug into a histogram but instead of being distinct subplots, the "0.0, 1.0, 2,0" in the final variable just get added together when I apply the .describe() function. Any advice for how I can convert this into a form that's readable with the sns.histplot() function?

Seaborn pairplot with log scale only for specific columns

I have a dataframe and I'm using seaborn pairplot to plot one target column vs the rest of the columns.
Code is below,
import seaborn as sns
import matplotlib.pyplot as plt
tgt_var = 'AB'
var_lst = ['A','GH','DL','GT','MS']
pp = sns.pairplot(data=df,
y_vars=[tgt_var],
x_vars=var_lst)
pp.fig.set_figheight(6)
pp.fig.set_figwidth(20)
The var_lst is not a static list, I just provided an example.
What I need is to plot tgt_var on Y axis and each var_lst on x axis.
I'm able to do this with above code, but I also want to use log scale on X axis only if the var_lst item is 'GH' or 'MS', for the rest normal scale. Is there any way to achieve this?
Iterate pp.axes.flat and set xscale="log" if the xlabel matches "GH" or "MS":
log_columns = ["GH", "MS"]
for ax in pp.axes.flat:
if ax.get_xlabel() in log_columns:
ax.set(xscale="log")
Full example with the iris dataset where the petal columns are xscale="log":
import seaborn as sns
df = sns.load_dataset("iris")
pp = sns.pairplot(df)
log_columns = ["petal_length", "petal_width"]
for ax in pp.axes.flat:
if ax.get_xlabel() in log_columns:
ax.set(xscale="log")

How to remove certain values before plotting data

I'm using python for the first time. I have a csv file with a few columns of data: location, height, density, day etc... I am plotting height (i_h100) v density (i_cd) and have managed to constrain the height to values below 50 with the code below. I now want to constrain the values on the y axis to be within a certain 'day' range say (85-260). I can't work out how to do this.
import pandas
import matplotlib.pyplot as plt
data=pandas.read_csv('data.csv')
data.plot(kind='scatter',x='i_h100',y='i_cd')
plt.xlim(right=50)
Use .loc to subset data going into graph.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Make some dummy data
np.random.seed(42)
df = pd.DataFrame({'a':np.random.randint(0,365,20),
'b':np.random.rand(20),
'c':np.random.rand(20)})
# all data: plot of 'b' vs. 'c'
df.plot(kind='scatter', x='b', y='c')
plt.show()
# use .loc to subset data displayed based on value in 'a'
# can also use .loc to restrict values of 'b' displayed rather than plt.xlim
df.loc[df['a'].between(85,260) & (df['b'] < 0.5)].plot(kind='scatter', x='b', y='c')
plt.show()

Clustermapping in Python using Seaborn

I am trying to create a heatmap with dendrograms on Python using Seaborn and I have a csv file with about 900 rows. I'm importing the file as a pandas dataframe and attempting to plot that but a large number of the rows are not being represented in the heatmap. What am I doing wrong?
This is the code I have right now. But the heatmap only represents about 49 rows.
Here is an image of the clustermap I've obtained but it is not displaying all of my data.
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
df = pd.read_csv('diff_exp_gene.csv', index_col = 0)
# Default plot
sns.clustermap(df, cmap = 'RdBu', row_cluster=True, col_cluster=True)
plt.show()
Thank you.
An alternative approach would be to use imshow in matpltlib. I'm not exactly sure what your question is but I demonstrate a way to graph points on a plane from csv file
import numpy as np
import matplotlib.pyplot as plt
import csv
infile = open('diff_exp_gene.csv')
df = csv.DictReader(in_file)
temp = np.zeros((128,128), dtype = int)
for row in data:
if row['TYPE'] == types:
temp[int(row['Y'])][int(row['X'])] = temp[int(row['Y'])][int(row['X'])] + 1
plt.imshow(temp, cmap = 'hot', origin = 'lower')
plt.show()
As far as I know, keywords that apply to seaborn heatmaps also apply to clustermap, as the sns.clustermap passes to the sns.heatmap. In that case, all you need to do in your example is to set yticklabels=True as a keyword argument in sns.clustermap(). That will make all of the 900 rows appear.
By default, it is set as "auto" to avoid overlap. The same applies to the xticklabels. See more here: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Plot stacked bar chart from pandas data frame

I have dataframe:
payout_df.head(10)
What would be the easiest, smartest and fastest way to replicate the following excel plot?
I've tried different approaches, but couldn't get everything into place.
Thanks
If you just want a stacked bar chart, then one way is to use a loop to plot each column in the dataframe and just keep track of the cumulative sum, which you then pass as the bottom argument of pyplot.bar
import pandas as pd
import matplotlib.pyplot as plt
# If it's not already a datetime
payout_df['payout'] = pd.to_datetime(payout_df.payout)
cumval=0
fig = plt.figure(figsize=(12,8))
for col in payout_df.columns[~payout_df.columns.isin(['payout'])]:
plt.bar(payout_df.payout, payout_df[col], bottom=cumval, label=col)
cumval = cumval+payout_df[col]
_ = plt.xticks(rotation=30)
_ = plt.legend(fontsize=18)
Besides the lack of data, I think the following code will produce the desired graph
import pandas as pd
import matplotlib.pyplot as plt
df.payout = pd.to_datetime(df.payout)
grouped = df.groupby(pd.Grouper(key='payout', freq='M')).sum()
grouped.plot(x=grouped.index.year, kind='bar', stacked=True)
plt.show()
I don't know how to reproduce this fancy x-axis style. Also, your payout column must be a datetime, otherwise pd.Grouper won't work (available frequencies).

Categories