ggplot geom_histogram behaves differently between Python and R - python

I am trying to do some exploratory data analysis and I have a data frame with an integer age column and a "category" column. Making a histogram of the age is easy enough. What I want to do is maintain this age histogram but color the bars based on the categorical variables.
import numpy as np
import pandas as pd
ageSeries.hist(bins=np.arange(-0.5, 116.5, 1))
I was able to do what I wanted easily in one line with ggplot2 in R
ggplot(data, aes(x=Age, fill=Category)) + geom_histogram(binwidth = 1)
I wasn't able to find a good solution in Python, but then I realized there was a ggplot2 library for Python and installed it. I tried to do the same ggplot command...
ggplot(data, aes(x="Age", fill="Category")) + geom_histogram(binwidth = 1)
Looking at these results we can see that the different categories are treated as different series and and overlaid rather than stacked. I don't want to mess around with transperancies, and I still want to maintain the overall distribution of the the population.
Is this something I can fix with a parameter in the ggplot call, or is there a straightforward way to do this in Python at all without doing a bunch of extra dataframe manipulations?

Related

Grouped bar chart of multiindex

first of all: I'm completely new to python.
I'm trying to visualize some measured data. Each entry has a quadrant, number and sector. The original data lies in a .xlsx file. I've managed to use a .pivot_table to sort the data according to its sector. Due to overlapping, number and quadrant also have to be indexed. Now I want to plot it as a bar chart, where the bars are grouped by sector and the colors represent the quadrant.
But because number also has to be indexed, it shows up in the bar chart as a separate group. There should only be three groups, 0, i and a.
MWE:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
d = {'quadrant': ["0","0","0","0","0","0","I","I","I","I","I","I","I","I","I","I","I","I","II","II","II","II","II","II","II","II","II","II","II","II","III","III","III","III","III","III","III","III","III","III","III","III","IV","IV","IV","IV","IV","IV","IV","IV","IV","IV","IV","IV"], 'sector': [0,"0","0","0","0","0","a","a","a","a","a","a","i","i","i","i","i","i","a","a","a","a","a","a","i","i","i","i","i","i","a","a","a","a","a","a","i","i","i","i","i","i","a","a","a","a","a","a","i","i","i","i","i","i"], 'number': [1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6], 'Rz_m': [67.90,44.17,44.30,63.43,49.87,39.33,61.17,69.37,66.20,44.20,64.77,39.93,44.33,50.97,55.90,51.33,58.23,44.53,50.03,47.40,58.67,71.57,57.60,70.77,63.93,47.37,46.90,34.73,41.27,48.23,58.30,47.07,50.53,51.20,32.67,50.37,37.50,55.50,41.20,48.07,56.80,49.77,40.87,44.43,44.00,60.03,63.73,72.80,51.60,45.53,60.27,71.00,59.63,48.70]}
df = pd.DataFrame(data=d)
B = df.pivot_table(index=['sector','number', 'quadrant'])
B.unstack().plot.bar(y='Rz_m')
The data viz ecosystem in Python is pretty diverse and there are multiple libraries you can use to produce the same chart. Matplotlib is a very powerful library, but it's also quite low-level, meaning you often have to do a lot of preparatory work before getting to the chart, so usually you'll find people use seaborn for static visualisations, especially if there is a scientific element to them (it has built-in support for things like error bars, etc.)
Out of the box, it has a lot of chart types to support exploratory data analysis and is built on top of matplotlib. For your example, if I understood it right, it would be as simple as:
import seaborn as sns
sns.catplot(x="sector", y="Rz_m", hue="quadrant", data=df, ci=None,
height=6, kind="bar", palette="muted")
And the output would look like this:
Note that in your example, you missed out "" for one of the zeroes and 0 and "0" are plotted as separate columns. If you're using seaborn, you don't need to pivot the data, just feed it the df as you've defined it.
For interactive visualisations (with tooltips, zoom, pan, etc.), you can also check out bokeh.
There is an interesting wrinkle to this - how to center the nested bars on the label. By default the bars are drawn with center alignment which works fine for an odd number of columns. However, for an even number, you'd want them to be centered on the right edge. You can make a small alteration in the source code categorical.py, lines beginning 1642 like so:
# Draw the bars
offpos = barpos + self.hue_offsets[j]
barfunc(offpos, self.statistic[:, j], -self.nested_width,
color=self.colors[j], align="edge",
label=hue_level, **kws)
Save the .png and then change it back, but it's not ideal. Probably worth flagging up to the library maintainers.

Python: A histogram for selected columns split by a variable

I do most of my work in R and am trying to explore a bit more of Python. My fluency of the latter is pretty rubbish so explaining anything super simple won't offend me :)
I am starting some exploratory analysis and want to show the distribution of each variable by what will become the target variable. The outcome I would like a histogram for every column in the DF with the data split by the target. Writing in R this is super simple, in the example below x,z and y are the columns and 'cut' the target.
How could I reproduce this in Python?
# R
library(ggplot2)
library(tidyr)
shinyStuff <- gather(diamonds,KPI,numbers,x:z)
ggplot(data = shinyStuff)+geom_histogram(aes(x=numbers,color=cut),stat='count') + facet_wrap(~KPI)
I have tried looping over DF like this:
# Python
for num, col in enumerate(diamonds):
print(num)
plt.figure()
axs[num].hist(diamonds[diamonds['cut']=='Fair'].iloc[:,num],alpha=0.6)
axs[num].hist(diamonds[diamonds['cut']=='Good'].iloc[:,num],alpha=0.6)
This didn't work full stop.
I have tried splitting the DF and mapping
# Python
fig, ax = plt.subplots()
diamonds[diamonds['Cut']=='Fair'].hist(figsize = (16,20),color='red',ax=ax,alpha=0.6)
diamonds[diamonds['Cut']=='Good'].hist(figsize = (16,20),color='blue',ax=ax,alpha=0.6);
This just over writes the first.
Tried a few more things which I won't post - they may well have been along the write lines but I am not versed enough in Python to get them right so I don't think a list of failed examples will help here.
I am using Python 3 and open to all solutions using any dependencies.

How to graph a function in Python using plotnine library

I've been a longtime R user, recently transitioning over to Python, and I've been trying to carry over my knowledge of plotting with ggplot2, since it is so intuitive. Plotnine is supposedly the most ggplot2-esque plotting library, and I've successfully recreated most graphs with it, except critically how to plot regular functions.
In base R, you can easily define an eq., as in so, input the result into a stat_function() layer, and set the limits of the graph in place of the data arg., and successfully plot a parabola or the like. However, the syntax for setting the graph's limits must be different in Python (perhaps using numpy?), and equations are defined using sympy, which is another divergence for me.
So how would I go about plotting functions with plotnine? The above two hurdles are the two differences with ggplot2 that I think are causing me trouble, since plotnine has so few examples online.
P.S. This is an example of what I want to recreate in Python using plotnine:
> library(ggplot2)
> basic_plot <- function(x) x^2 + 2.5
> graph <- ggplot(data.frame(x=c(-5,5)), aes(x=x)) +
+ stat_function(fun = basic_plot)
> graph
You do not need numpy, it works just fine the "standard" way! :)
from plotnine import *
import pandas as pd
(ggplot(pd.DataFrame(data={"x": [-5, 5]}), aes(x="x"))
+ stat_function(fun=lambda x: x**2+2.5))
One of the main differences that caused me problems was the same as posted in the question. Specifically:
in R
aes(x = x) or aes(x)
in plotnine
aes(x = 'x')

Plot Subset of Dataframe without Being Redundant

a bit of a Python newb here. As a beginner it's easy to learn different functions and methods from training classes but it's another thing to learn how to "best" code in Python.
I have a simple scenario where I'm looking to plot a portion of a dataframe spdf. I only want to plot instances where speed is greater than 0 and use datetime as my X-axis. The way I've managed to get the job done seems awfully redundant to me:
ts = pd.Series(spdf[spdf['speed']>0]['speed'].values, index=spdf[spdf['speed']>0]['datetime'])
ts.dropna().plot(title='SP1 over Time')
Is there a better way to plot this data without specifying the subset of my dataframe twice?
You don't need to build a new Series. You can plot using your original df
df[df['col'] > 0]].plot()
In your case:
spdf[spdf['speed'] > 0].dropna().plot(title='SP1 over Time')
I'm not sure what your spdf object is or how it was created. If you'll often need to plot using the 'datetime' column you can set that to be the index of the df.If you're reading the data from a csv you can do this using the parse_dates keyword argument or it you already have the dfyou can change the index using df.set_index('datetime'). You can use df.info() to see what is currently being used at your index and its datatype.

Plotting one scatterplot with multiple dataframes with ggplot in python

I am trying to get data from two separate dataframes onto the same scatterplot. I have seen solutions in R that use something like:
ggplot() + geom_point(data = df1, aes(df1.x,df2.y)) + geom_point(data = df2,aes(df2.x, df2.y))
But in python, with the ggplot module, I get errors when I try to use ggplot() with no args. Is this just a limitation of the module? I know I can likely use another tool to do the plotting but I would prefer a ggplot solution if possible.
My first data frame consists of Voltage information every 2 minutes and temperature information every one hour, so combining the two dataframes is not 1 to 1. Also, I would prefer to stick with Python because the rest of my solution is in python.
Just giving one dataframe as argument for ggplot() and the other inside the second geom_point declaration should do the work:
ggplot(aes(x='x', y='y'), data=df1) + geom_point() +
geom_point(aes(x='x', y='y'), data=df2)
(I prefer using the column name notation, I think is more elegant, but this is just a personal preference)

Categories