Python: A histogram for selected columns split by a variable

Python: A histogram for selected columns split by a variable - python

I do most of my work in R and am trying to explore a bit more of Python. My fluency of the latter is pretty rubbish so explaining anything super simple won't offend me :)
I am starting some exploratory analysis and want to show the distribution of each variable by what will become the target variable. The outcome I would like a histogram for every column in the DF with the data split by the target. Writing in R this is super simple, in the example below x,z and y are the columns and 'cut' the target.
How could I reproduce this in Python?
# R
library(ggplot2)
library(tidyr)
shinyStuff <- gather(diamonds,KPI,numbers,x:z)
ggplot(data = shinyStuff)+geom_histogram(aes(x=numbers,color=cut),stat='count') + facet_wrap(~KPI)
I have tried looping over DF like this:
# Python
for num, col in enumerate(diamonds):
print(num)
plt.figure()
axs[num].hist(diamonds[diamonds['cut']=='Fair'].iloc[:,num],alpha=0.6)
axs[num].hist(diamonds[diamonds['cut']=='Good'].iloc[:,num],alpha=0.6)
This didn't work full stop.
I have tried splitting the DF and mapping
# Python
fig, ax = plt.subplots()
diamonds[diamonds['Cut']=='Fair'].hist(figsize = (16,20),color='red',ax=ax,alpha=0.6)
diamonds[diamonds['Cut']=='Good'].hist(figsize = (16,20),color='blue',ax=ax,alpha=0.6);
This just over writes the first.
Tried a few more things which I won't post - they may well have been along the write lines but I am not versed enough in Python to get them right so I don't think a list of failed examples will help here.
I am using Python 3 and open to all solutions using any dependencies.

Related

Indexing my dataframe properly with pandas

I'm trying to plot a bargraph with errorbars acquired from my tests, i found some code on the internet on how to make it. But the code does not fit the way i want the table to look like.
I've tried leaving things out however i don't understand the dataframe enough to know what kind of code i need to process the data correctly.
order=pd.MultiIndex.from_arrays([['402515','402515','402515','402510','402510','402510'],
['z','z','z','z','z','z']],names=['letter','word'])
datas=pd.DataFrame({'first cracking strength':[em1,em2,em3,em4,em5,em6],'flexural strength':[en1,en2,en3,en4,en5,en6]},index=order)
gp4 = datas.groupby(level=('letter', 'word'))
means = gp4.mean()
errors = gp4.std()
print(means)
fig, ax = plt.subplots()
means.plot.bar(yerr=errors, ax=ax, capsize=4);
The multi-index code requires two labels (the 'z' and the '402515/402510', I only want the '402515/402510') on your dataset, but I only want one. What other code does that?
How it looks when I run the code.
How I want it to look.

ggplot geom_histogram behaves differently between Python and R

I am trying to do some exploratory data analysis and I have a data frame with an integer age column and a "category" column. Making a histogram of the age is easy enough. What I want to do is maintain this age histogram but color the bars based on the categorical variables.
import numpy as np
import pandas as pd
ageSeries.hist(bins=np.arange(-0.5, 116.5, 1))
I was able to do what I wanted easily in one line with ggplot2 in R
ggplot(data, aes(x=Age, fill=Category)) + geom_histogram(binwidth = 1)
I wasn't able to find a good solution in Python, but then I realized there was a ggplot2 library for Python and installed it. I tried to do the same ggplot command...
ggplot(data, aes(x="Age", fill="Category")) + geom_histogram(binwidth = 1)
Looking at these results we can see that the different categories are treated as different series and and overlaid rather than stacked. I don't want to mess around with transperancies, and I still want to maintain the overall distribution of the the population.
Is this something I can fix with a parameter in the ggplot call, or is there a straightforward way to do this in Python at all without doing a bunch of extra dataframe manipulations?

Plot Subset of Dataframe without Being Redundant

a bit of a Python newb here. As a beginner it's easy to learn different functions and methods from training classes but it's another thing to learn how to "best" code in Python.
I have a simple scenario where I'm looking to plot a portion of a dataframe spdf. I only want to plot instances where speed is greater than 0 and use datetime as my X-axis. The way I've managed to get the job done seems awfully redundant to me:
ts = pd.Series(spdf[spdf['speed']>0]['speed'].values, index=spdf[spdf['speed']>0]['datetime'])
ts.dropna().plot(title='SP1 over Time')
Is there a better way to plot this data without specifying the subset of my dataframe twice?

You don't need to build a new Series. You can plot using your original df
df[df['col'] > 0]].plot()
In your case:
spdf[spdf['speed'] > 0].dropna().plot(title='SP1 over Time')
I'm not sure what your spdf object is or how it was created. If you'll often need to plot using the 'datetime' column you can set that to be the index of the df.If you're reading the data from a csv you can do this using the parse_dates keyword argument or it you already have the dfyou can change the index using df.set_index('datetime'). You can use df.info() to see what is currently being used at your index and its datatype.

Store data based on location in a dataset python

Forgive me if this question is trivial, I am just having some trouble finding a solution online, and I'm a bit new to python. Essentially, I have a dataset which is full of various numbers all of which are arranged in this format:
6.1101,17.592
5.5277,9.1302
8.5186,13.662
I'm trying to write some python to get the number on either side of the comma. I assume it's some type of splitting, but I can't seem to find anything that works for this problem specifically since I want to take the ALL the numbers from the left and store them in a variable, then take ALL the numbers on the right store them in a variable. The goal is to plot the data points, and normally I would modify the data set, but it's a challenge problem so I am trying to figure this out with the data as is.

Here's one way:
with open('mydata.csv') as f:
lines = f.read().splitlines()
left_numbers, right_numbers = [], []
for line in lines:
numbers = line.split(',')
left_num = float(numbers[0])
right_num = float(numbers[1])
left_numbers.append(left_num)
right_numbers.append(right_num)
Edit: added float conversion

Plotting one scatterplot with multiple dataframes with ggplot in python

I am trying to get data from two separate dataframes onto the same scatterplot. I have seen solutions in R that use something like:
ggplot() + geom_point(data = df1, aes(df1.x,df2.y)) + geom_point(data = df2,aes(df2.x, df2.y))
But in python, with the ggplot module, I get errors when I try to use ggplot() with no args. Is this just a limitation of the module? I know I can likely use another tool to do the plotting but I would prefer a ggplot solution if possible.
My first data frame consists of Voltage information every 2 minutes and temperature information every one hour, so combining the two dataframes is not 1 to 1. Also, I would prefer to stick with Python because the rest of my solution is in python.

Just giving one dataframe as argument for ggplot() and the other inside the second geom_point declaration should do the work:
ggplot(aes(x='x', y='y'), data=df1) + geom_point() +
geom_point(aes(x='x', y='y'), data=df2)
(I prefer using the column name notation, I think is more elegant, but this is just a personal preference)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: A histogram for selected columns split by a variable - python

Related

Indexing my dataframe properly with pandas

ggplot geom_histogram behaves differently between Python and R

Plot Subset of Dataframe without Being Redundant

Store data based on location in a dataset python

Plotting one scatterplot with multiple dataframes with ggplot in python

Categories

Resources