Plotting multiple overlapping histograms with columns from two separate pandas data frames - python

I've got two pandas data frames with the same column names ('Height', 'Weight', 'Alc'), but one is with dirty data and one with cleaned data.
To visualise my cleaning, I want to make a histogram with overlapping bars showing the dirty data and the cleaned data for a given variable on the same chart. Eg for 'Height' I want a chart looking like the one pictured below, with dirty 'Height' data in one colour, and cleaned 'Height' data in another.
Importantly, I want to make such a chart for all of my variables (ie 'Height', 'Weight', and 'Alc').
In order to achieve this I've tried appending the data frames and plotting the histograms from a single data frame. Unfortunately it produces just regular histograms, not overlapping ones.
Thanks for any help!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# DIRTY DATA FRAME
data = [[1.0, 10, 0], [0.0, 12, 0.4], [2.0, 918, 0.9], [2.0, 918, 0.9]]
df_dirty = pd.DataFrame(data, columns = ['Height', 'Weight', 'Alc'])
# CLEAN DATA FRAME
data2 = [[1.0, 10, np.nan], [0.0, 12, 0.4], [np.nan, np.nan, 0.9], [np.nan, np.nan, 0.9]]
df_clean = pd.DataFrame(data2, columns = ['Height', 'Weight', 'Alc'])
# ADDITION OF GROUP-BY VARIABLE
my_dfs = [df_dirty, df_clean]
my_status = ['dirty','clean']
for dataframe, status in zip(my_dfs, my_status):
dataframe['groupbyvar'] = status
# DATA FRAME FOR PLOTTING
df_plotting = df_dirty.append(df_clean, ignore_index=True)
# HISTOGRAMS FOR ALL COLUMNS IN DATA FRAME FOR PLOTTING
for col in df_plotting:
df_plotting.hist(by="groupbyvar", column=col)
EDIT:
The above chart is just a picture to explain what I want, my current code does NOT make it.

Related

How to subplot histogram using multiple columns with plotly?

So I have data that I transformed up to this point (pic below). How can I now subplot histograms that will show me the programming languages used per job? I tried with just 2 columns at first:
px.histogram(languages_job_title_grouped, x =['Python','SQL'], facet_col = 'Role', facet_col_wrap = 4,height = 1000)
But it didn't work - it plots histogram by job, but the bars are the same for every role (2nd picture below).
How can I do it the right way?
From the context of your question, it seems like you are looking for a bar plot instead.
I.e. If I understand correctly, you are starting from a dataframe equivalent to
and trying to plot
where the facets are the index, the x-axis is each column, and the bar heights are the values in the dataframe.
The code that generates this is:
import pandas as pd
import plotly.express as px
df = pd.DataFrame(
[[0.1, 0.3, 0.5], [0.2, 0.1, 0.8], [0.5, 0.3, 0.9]],
columns=["a", "b", "c"],
index=["index1", "index2", "index3"],
)
px.bar(
df.melt(ignore_index=False).reset_index(),
facet_col="index",
x="variable",
y="value",
barmode="group",
)
The key being to reformat your DataFrame using melt before trying to plot with plotly.express.

Using Altair to generate an unstacked barplot with an already stacked data

I created a DataFrame with stacked data, i.e., tested \in total.
import pandas as pd
import altair as alt
df = pd.DataFrame({
'date': ['2021-01-01', '2021-02-01', '2021-03-01'],
'total': [10, 15, 20],
'tested': [0, 5, 10]
})
dfm = df.melt(id_vars=['date'])
I would like to plot a stacked bar plot with Altair. Since the tested column is already contained in the total column, I would expect a chart with the max values of the total, but the result shows the sum.
alt.Chart(dfm).mark_bar().encode(
x='date:O',
y='value:Q',
color='variable:O'
)
I know I can use pandas to create an untested column and generate the plot using tested and untested columns, but I would like to know if I can achieve this result without transforming the data.
To create an unstacked bar chart you can set stack=False:
alt.Chart(dfm).mark_bar().encode(
x='date:O',
y=alt.Y('value:Q', stack=False),
color='variable:O'
)
Note that it will always show the rightmost column on top (tested in the image above).

Plotly: Grouped and Stacked barchart, different colors for stacked bars?

I have barchart that is both grouped and stacked. Now I want to color the Values that are stacked in a different color. My plot currently looks like this (with a marker of the parts that should be colored differently):
The group corresponds to a category and the stacking is based on delta between two value columns (the delta is stacked on top of the lower values).
I want to color the delta (i.e. the upper part of the stacked bar) in a slightly lighter color (here: light blue for Cat1 and light red for Cat1).
However, I cannot find an appropriate option for doing so.
My data looks like this:
import pandas as pd
data = [
["Station1", 2.5, 3.0, "Cat1"],
["Station1", 3.7, 4.2, "Cat2"],
["Station2", 1.7, 2.1, "Cat1"],
["Station2", 3.9, 4.0, "Cat2"],
]
df = pd.DataFrame(data, columns=["station", "val1", "val2", "category"])
df["delta"] = df["val2"] - df["val1"]
And here is the plotting function:
import plotly.express as px
px.bar(df, x="station", y=["val1", "delta"], color="category", barmode="group").show()
How can I color the delta differently? The solution can also be in plotly (does not have to be plotly express).
Is there a way to do the workaround of manually computing the delta column?
I've put together a suggestion that should fit your needs pretty well. At least visually as long as delta >=0. To my knowledge, grouped and stacked bar charts are still a bit tricky. But building two different figures using px.Bar and then combining them using fig._data in a go.Figure object() will let you build this:
As you will see from the code, delta has not been explicitly implemented, but rather appears visually between the opaque bars and the non-opaque bars. If this is something you can use I'll see if I can include delta in the hoverinfo.
Complete code:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
data = [
["Station1", 2.5, 3.0, "Cat1"],
["Station1", 3.7, 4.2, "Cat2"],
["Station2", 1.7, 2.1, "Cat1"],
["Station2", 3.9, 4.0, "Cat2"],
]
df = pd.DataFrame(data, columns=["station", "val1", "val2", "category"])
df["delta"] = df["val2"] - df["val1"]
fig1_data = px.bar(df, x="station", y=["val1"], color="category", barmode="group")._data
fig2_data = px.bar(df, x="station", y=['val2'], color="category", barmode="group")._data
fig1_data[0]['marker']['color'] = 'rgba(255,0,0,1)'
fig1_data[1]['marker']['color'] = 'rgba(0,0,255,1)'
fig2_data[0]['marker']['color'] = 'rgba(255,0,0,0.2)'
fig2_data[1]['marker']['color'] = 'rgba(0,0,255,0.2)'
dat = fig1_data+fig2_data
fig = go.Figure(dat)
fig3 = go.Figure(fig._data)
fig.update_layout(barmode='relative', title_text='Relative Barmode')
fig3.show()

Plotly: How to display individual value on histogram?

I am trying to make dynamic plots with plotly. I want to plot a count of data that have been aggregated (using groupby).
I want to facet the plot by color (and maybe even by column). The problem is that I want the value count to be displayed on each bar. With histogram, I get smooth bars but I can't find how to display the count:
With a bar plot I can display the count but I don't get smooth bar and the count does not appear for the whole bar but for each case composing that bar
Here is my code for the barplot
val = pd.DataFrame(data2.groupby(["program", "gender"])["experience"].value_counts())
px.bar(x=val.index.get_level_values(0), y=val, color=val.index.get_level_values(1), barmode="group", text=val)
It's basically the same for the histogram.
Thank you for your help!
px.histogram does not seem to have a text attribute. So if you're willing to do any binning before producing your plot, I would use px.Bar. Normally, you apply text to your barplot using px.Bar(... text = <something>). But this gives the results you've described with text for all subcategories of your data. But since we know that px.Bar adds data and annotations in the order that the source is organized, we can simply update text to the last subcategory applied using fig.data[-1].text = sums. The only challenge that remains is some data munging to retrieve the correct sums.
Plot:
Complete code with data example:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
# data
df = pd.DataFrame({'x':['a', 'b', 'c', 'd'],
'y1':[1, 4, 9, 16],
'y2':[1, 4, 9, 16],
'y3':[6, 8, 4.5, 8]})
df = df.set_index('x')
# calculations
# column sums for transposed dataframe
sums= []
for col in df.T:
sums.append(df.T[col].sum())
# change dataframe format from wide to long for input to plotly express
df = df.reset_index()
df = pd.melt(df, id_vars = ['x'], value_vars = df.columns[1:])
fig = px.bar(df, x='x', y='value', color='variable')
fig.data[-1].text = sums
fig.update_traces(textposition='inside')
fig.show()
If your first graph is with graph object librairy you can try:
# Use textposition='auto' for direct text
fig=go.Figure(data[go.Bar(x=val.index.get_level_values(0),
y=val, color=val.index.get_level_values(1),
barmode="group", text=val, textposition='auto',
)])

Generate and save plots from rows of a data frame

Python newbie here.
My problem is the following. I have this (80, 1002) DataFrame of continuous data loaded from a .csv file. My goal with it is to go through every row of this df (80) and plot each row on a basic pyplot.plot. In this df, the first 2 columns are to be used as title so that each plot has it's specific name (here it's the time of the recording and the name of the electrode).
What I did to plot one row is :
import matplotlib.pyplot as plt
import pandas as pd
Location = r'/pathtothefile/name.csv'
df=pd.read_csv(Location,sep=';')
time=range(1,1001);
plt.plot(time,df.loc[0, "0":"999"],'g')
plt.axhline(0, color='black',linewidth=0.5)
plt.xlabel('Time (ms)')
plt.ylabel('Power (mV)')
plt.axis([1, 1000, -5, 5])
plt.title(str(df.iloc[0,0]) + str(df.iloc[0,1]))
plt.show()
row.savefig('/pathwheretosave/name.eps',
format='eps', dpi=1000)
The "time" variable is to be plotted with the rows data. From here I want to loop on the rows of a data frame and plot/save each row in a separated file but so far : I failed. Any idea on how to do that ?
Ideally I want to write the title of the plot in the name of the file to be saved.
You will want to loop each row. This can be achieved by using the itertuple method as follows.
Example data:
sales = [{'values': [1,2,3,4], 'title': 'title 1'},
{'values': [2,3,5,7], 'title': 'title 2'},
{'values': [4,5,5,7], 'title': 'title 3'}]
df = pd.DataFrame(sales)
Produce a plot for the values in each row with the title specified in each row
for row in df.itertuples():
plt.plot(row.values,marker='o')
plt.title(row.title)
plt.savefig(row.title + '.png')
plt.clf()
The output of this is 3 separate plots (one for each row in the dataframe).
How about this? It will get a little more complicated if you want your x-axis to use time, rather than timestamp-labels, but it sounds like you are taking 1 measurement per second, across your electrodes.
import pandas as pd
import datetime
import matplotlib.pyplot as plt
# Make a sample DataFrame
ts = datetime.datetime.now()
df = pd.DataFrame({'time': [ts, ts, ts, ts],
'electrode': [1, 2, 3, 4],
1: [0.1, 0.1, 0.1, 0.1],
2: [0.22, 0.2, 0.2, 0.2],
3: [0.37, 0.3, 0.3, 0.3]},
columns = ['time', 'electrode', 1, 2, 3] )
number_of_measurements = df.shape[1] - 2
for i in range(0, len(df)):
fig, ax = plt.subplots(figsize=(8, 8))
df.iloc[i][2:].plot(ax=ax, xticks=range(1, number_of_measurements + 1, 1))
plot.set_xlabel("Time (ms)")
plot.set_ylabel("Power (mV)")
fig.suptitle('{} electrode:{}'.format(df.iloc[i].time, df.iloc[i].electrode))
fig.savefig('plot{}-{}.png'.format(df.iloc[i].time, df.iloc[i].electrode), bbox_inches='tight')

Categories