Plotting one scatterplot with multiple dataframes with ggplot in python

Plotting one scatterplot with multiple dataframes with ggplot in python - python

I am trying to get data from two separate dataframes onto the same scatterplot. I have seen solutions in R that use something like:
ggplot() + geom_point(data = df1, aes(df1.x,df2.y)) + geom_point(data = df2,aes(df2.x, df2.y))
But in python, with the ggplot module, I get errors when I try to use ggplot() with no args. Is this just a limitation of the module? I know I can likely use another tool to do the plotting but I would prefer a ggplot solution if possible.
My first data frame consists of Voltage information every 2 minutes and temperature information every one hour, so combining the two dataframes is not 1 to 1. Also, I would prefer to stick with Python because the rest of my solution is in python.

Just giving one dataframe as argument for ggplot() and the other inside the second geom_point declaration should do the work:
ggplot(aes(x='x', y='y'), data=df1) + geom_point() +
geom_point(aes(x='x', y='y'), data=df2)
(I prefer using the column name notation, I think is more elegant, but this is just a personal preference)

Related

Why am I unable to make a plot containing subplots in plotly using a px.scatter plot?

I have been trying to make a figure using plotly that combines multiple figures together. In order to do this, I have been trying to use the make_subplots function, but I have found it very difficult to have the plots added in such a way that they are properly formatted. I can currently make singular plots (as seen directly below):
However, whenever I try to combine these singular plots using make_subplots, I end up with this:
This figure has the subplots set up completely wrong, since I need each of the four subplots to contain data pertaining to the four methods (A, B, C, and D). In other words, I would like to have four subplots that look like my singular plot example above.
I have set up the code in the following way:
for sequence in sequences:
#process for making sequence profile is done here
sequence_df = pd.DataFrame(sequence_profile)
row_number=1
grand_figure = make_subplots(rows=4, cols=1)
#there are four groups per sequence, so the grand figure should have four subplots in total
for group in sequence_df["group"].unique():
figure_df_group = sequence_df[(sequence_df["group"]==group)]
figure_df_group.sort_values("sample", ascending=True, inplace=True)
figure = px.line(figure_df_group, x = figure_df_group["sample"], y = figure_df_group["intensity"], color= figure_df_group["method"])
figure.update_xaxes(title= "sample")
figure.update_traces(mode='markers+lines')
#note: the next line fails, since data must be extracted from the figure, hence why it is commented out
#grand_figure.append_trace(figure, row = row_number, col=1)
figure.update_layout(title_text="{} Profile Plot".format(sequence))
grand_figure.append_trace(figure.data[0], row = row_number, col=1)
row_number+=1
figure.write_image(os.path.join(output_directory+"{}_profile_plot_subplots_in_{}.jpg".format(sequence, group)))
grand_figure.write_image(os.path.join(output_directory+"grand_figure_{}_profile_plot_subplots.jpg".format(sequence)))
I have tried following directions (like for example, here: ValueError: Invalid element(s) received for the 'data' property) but I was unable to get my figures added as is as subplots. At first it seemed like I needed to use the graph object (go) module in plotly (https://plotly.com/python/subplots/), but I would really like to keep the formatting/design of my current singular plot. I just want the plots to be conglomerated in groups of four. However, when I try to add the subplots like I currently do, I need to use the data property of the figure, which causes the design of my scatter plot to be completely messed up. Any help for how I can ameliorate this problem would be great.

Ok, so I found a solution here. Rather than using the make_subplots function, I just instead exported all the figures onto an .html file (Plotly saving multiple plots into a single html) and then converted it into an image (HTML to IMAGE using Python). This isn't exactly the approach I would have preferred to have, but it does work.
UPDATE
I have found that plotly express offers another solution, as the px.line object has the parameter of facet that allows one to set up multiple subplots within their plot. My code is set up like this, and is different from the code above in that the dataframe does not need to be iterated in a for loop based on its groups:
sequence_df = pd.DataFrame(sequence_profile)
figure = px.line(sequence_df, x = sequence_df["sample"], y = sequence_df["intensity"], color= sequence_df["method"], facet_col= sequence_df["group"])
Although it still needs more formatting, my plot now looks like this, which is works much better for my purposes:

Python: A histogram for selected columns split by a variable

I do most of my work in R and am trying to explore a bit more of Python. My fluency of the latter is pretty rubbish so explaining anything super simple won't offend me :)
I am starting some exploratory analysis and want to show the distribution of each variable by what will become the target variable. The outcome I would like a histogram for every column in the DF with the data split by the target. Writing in R this is super simple, in the example below x,z and y are the columns and 'cut' the target.
How could I reproduce this in Python?
# R
library(ggplot2)
library(tidyr)
shinyStuff <- gather(diamonds,KPI,numbers,x:z)
ggplot(data = shinyStuff)+geom_histogram(aes(x=numbers,color=cut),stat='count') + facet_wrap(~KPI)
I have tried looping over DF like this:
# Python
for num, col in enumerate(diamonds):
print(num)
plt.figure()
axs[num].hist(diamonds[diamonds['cut']=='Fair'].iloc[:,num],alpha=0.6)
axs[num].hist(diamonds[diamonds['cut']=='Good'].iloc[:,num],alpha=0.6)
This didn't work full stop.
I have tried splitting the DF and mapping
# Python
fig, ax = plt.subplots()
diamonds[diamonds['Cut']=='Fair'].hist(figsize = (16,20),color='red',ax=ax,alpha=0.6)
diamonds[diamonds['Cut']=='Good'].hist(figsize = (16,20),color='blue',ax=ax,alpha=0.6);
This just over writes the first.
Tried a few more things which I won't post - they may well have been along the write lines but I am not versed enough in Python to get them right so I don't think a list of failed examples will help here.
I am using Python 3 and open to all solutions using any dependencies.

ggplot geom_histogram behaves differently between Python and R

I am trying to do some exploratory data analysis and I have a data frame with an integer age column and a "category" column. Making a histogram of the age is easy enough. What I want to do is maintain this age histogram but color the bars based on the categorical variables.
import numpy as np
import pandas as pd
ageSeries.hist(bins=np.arange(-0.5, 116.5, 1))
I was able to do what I wanted easily in one line with ggplot2 in R
ggplot(data, aes(x=Age, fill=Category)) + geom_histogram(binwidth = 1)
I wasn't able to find a good solution in Python, but then I realized there was a ggplot2 library for Python and installed it. I tried to do the same ggplot command...
ggplot(data, aes(x="Age", fill="Category")) + geom_histogram(binwidth = 1)
Looking at these results we can see that the different categories are treated as different series and and overlaid rather than stacked. I don't want to mess around with transperancies, and I still want to maintain the overall distribution of the the population.
Is this something I can fix with a parameter in the ggplot call, or is there a straightforward way to do this in Python at all without doing a bunch of extra dataframe manipulations?

Plot Subset of Dataframe without Being Redundant

a bit of a Python newb here. As a beginner it's easy to learn different functions and methods from training classes but it's another thing to learn how to "best" code in Python.
I have a simple scenario where I'm looking to plot a portion of a dataframe spdf. I only want to plot instances where speed is greater than 0 and use datetime as my X-axis. The way I've managed to get the job done seems awfully redundant to me:
ts = pd.Series(spdf[spdf['speed']>0]['speed'].values, index=spdf[spdf['speed']>0]['datetime'])
ts.dropna().plot(title='SP1 over Time')
Is there a better way to plot this data without specifying the subset of my dataframe twice?

You don't need to build a new Series. You can plot using your original df
df[df['col'] > 0]].plot()
In your case:
spdf[spdf['speed'] > 0].dropna().plot(title='SP1 over Time')
I'm not sure what your spdf object is or how it was created. If you'll often need to plot using the 'datetime' column you can set that to be the index of the df.If you're reading the data from a csv you can do this using the parse_dates keyword argument or it you already have the dfyou can change the index using df.set_index('datetime'). You can use df.info() to see what is currently being used at your index and its datatype.

Python xarray: grouping by multiple parameters

When using the xarray package for Python 2.7, is it possible to group over multiple parameters like you can in pandas? In essence, an operation like:
data.groupby(['time.year','time.month']).mean()
if you wanted to get mean values for each year and month of a dataset.

Unfortunately, xarray does not support grouping with multiple arguments yet. This is something we would like to support and it would be relatively straightforward, but nobody has had the time to implement it yet (contributions would be welcome!).

An easy way around is to construct a multiindex and group by that "new" coordinate:
da_multiindex = da.stack(my_multiindex=['time.year','time.month'])
da_mean = da.groupby("my_multiindex").mean()
da_mean.unstack() # go back to normal index

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Plotting one scatterplot with multiple dataframes with ggplot in python - python

Related

Why am I unable to make a plot containing subplots in plotly using a px.scatter plot?

Python: A histogram for selected columns split by a variable

ggplot geom_histogram behaves differently between Python and R

Plot Subset of Dataframe without Being Redundant

Python xarray: grouping by multiple parameters

Categories

Resources