Pandas: grouping and xticks in plots - python

I'm working with same date which has the following structure:
I want to group the data by the column B,and get the mean value for plot and compare.
sub_data = data_composite.groupby(['B']).aggregate(np.mean)
ax = sub_data.plot()
Obtaining:
However, I would like to get the correspondent xticks in the figure. Which it will be KP40, KP08, etc... Something like this:
Is there any way to do that?
Thank very much. Kind regards,

It should work for you
import numpy as np
import matplotlib.pyplot as plt
tmp_labels = data_composite.drop_duplicates(subset='B', keep = 'first')
xlabels = tmp_labels['B'].values
plt.xticks(np.arange(sub_data.shape[0]),list(xlabels), rotation=90)

I found out this can be done easier using the index column as input for plt.xticks.
sub_data = data_composite.groupby(['B']).aggregate(np.mean)
ax = sub_data.plot()
ax.set_xlabel(x)
ax.set_ylabel('Quantity of {}'.format(y))
plt.xticks(np.arange(groupped_data.shape[0]),list(groupped_data.index), rotation=90);

Related

How to make clustered heatmap of a large dataset look nicer?

I have a distance matrix which I normalized, trimmed the row and column headers with python regular expressions and tried to make a clustered heatmap from it with the following code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df = pd.read_csv('distance_matrix_Mult_Align(distance).csv', index_col=0)
row_sums = df.sum(axis=1)
new_matrix = df / row_sums[:, np.newaxis]
def acc_id(s):
import re
match = re.search('\|(.*)\|', s)
if match:
return match.group(1)
sns.clustermap(new_matrix.rename(columns=acc_id, index=acc_id),
row_cluster=False,
xticklabels=True,
yticklabels=True,
cmap='RdBu',
center=0,
vmin=0,
vmax=1)
plt.figure()
plt.show
My clustered map look like this:
I have tried to read the documentations of clustermap and pyplot: https://seaborn.pydata.org/generated/seaborn.clustermap.html
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure
But I can not seem to understand how to make the plot look something useful. I would really appreciate any help. Thanks!
The problem is in your vmax = 1 argument. If you look at the maximum value in the whole dataset using new_matrix.max().max() , it is about 0.17.
So, just removing vmax as: or just set a lower value for vmax

Changing the order of pandas/matplotlib line plotting without changing data order

Given the following example:
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
df.plot(linewidth=10)
The order of plotting puts the last column on top:
How can I make this keep the data & legend order but change the behaviour so that it plots X on top of Y on top of Z?
(I know I can change the data column order and edit the legend order but I am hoping for a simpler easier method leaving the data as is)
UPDATE: final solution used:
(Thanks to r-beginners) I used the get_lines to modify the z-order of each plot
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
fig = plt.figure()
ax = fig.add_subplot(111)
df.plot(ax=ax, linewidth=10)
lines = ax.get_lines()
for i, line in enumerate(lines, -len(lines)):
line.set_zorder(abs(i))
fig
In a notebook produces:
Get the default zorder and sort it in the desired order.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.random.seed(2021)
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
ax = df.plot(linewidth=10)
l = ax.get_children()
print(l)
l[0].set_zorder(3)
l[1].set_zorder(1)
l[2].set_zorder(2)
Before definition
After defining zorder
I will just put this answer here because it is a solution to the problem, but probably not the one you are looking for.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# generate data
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
# read columns in reverse order and plot them
# so normally, the legend will be inverted as well, but if we invert it again, you should get what you want
df[df.columns[::-1]].plot(linewidth=10, legend="reverse")
Note that in this example, you don't change the order of your data, you just read it differently, so I don't really know if that's what you want.
You can also make it easier on the eyes by creating a corresponding method.
def plot_dataframe(df: pd.DataFrame) -> None:
df[df.columns[::-1]].plot(linewidth=10, legend="reverse")
# then you just have to call this
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
plot_dataframe(df)

Plot stacked bar chart from pandas data frame

I have dataframe:
payout_df.head(10)
What would be the easiest, smartest and fastest way to replicate the following excel plot?
I've tried different approaches, but couldn't get everything into place.
Thanks
If you just want a stacked bar chart, then one way is to use a loop to plot each column in the dataframe and just keep track of the cumulative sum, which you then pass as the bottom argument of pyplot.bar
import pandas as pd
import matplotlib.pyplot as plt
# If it's not already a datetime
payout_df['payout'] = pd.to_datetime(payout_df.payout)
cumval=0
fig = plt.figure(figsize=(12,8))
for col in payout_df.columns[~payout_df.columns.isin(['payout'])]:
plt.bar(payout_df.payout, payout_df[col], bottom=cumval, label=col)
cumval = cumval+payout_df[col]
_ = plt.xticks(rotation=30)
_ = plt.legend(fontsize=18)
Besides the lack of data, I think the following code will produce the desired graph
import pandas as pd
import matplotlib.pyplot as plt
df.payout = pd.to_datetime(df.payout)
grouped = df.groupby(pd.Grouper(key='payout', freq='M')).sum()
grouped.plot(x=grouped.index.year, kind='bar', stacked=True)
plt.show()
I don't know how to reproduce this fancy x-axis style. Also, your payout column must be a datetime, otherwise pd.Grouper won't work (available frequencies).

How to mark the beginning of a new year while plotting pandas Series?

I am plotting such data:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
a = pd.DatetimeIndex(start='2010-01-01',end='2011-06-01' , freq='M')
b = pd.Series(np.random.randn(len(a)), index=a)
I would like the plot to be in the format of bars, so I use this:
b.plot(kind='bar')
This is what I get:
As you can see, the dates are formatted in full, which is very ugly and unreadable. I happened to test this command which creates a very nice Date format:
b.plot()
As you can see:
I like this format very much, it includes the months, marks the beginning of the year and is easily readable.
After doing some search, the closest I could get to that format is using this:
fig, ax = plt.subplots()
ax.plot(b.index, b)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
However the output looks like this:
I am able to have month names on x axis this way, but I like the first formatting much more. That is much more elegant. Does anyone know how I can get the same exact xticks for my bar plot?
Here's a solution that will get you the format you're looking for. You can edit the tick labels directly, and use set_major_formatter() method:
fig, ax = plt.subplots()
ax.bar(b.index, b)
ticklabels = [item.strftime('%b') for item in b.index] #['']*len(b.index)
ticklabels[::12] = [item.strftime('%b\n%Y') for item in b.index[::12]]
ax.xaxis.set_major_formatter(matplotlib.ticker.FixedFormatter(ticklabels))
ax.set_xticks(b.index)
plt.gcf().autofmt_xdate()
Output:

Multiple single plots in seaborn with pandas groupby data

My issue is very specific, i guess, but i can't seem to find a proper solution, and im clueless with the error output that i get.
Anyway, i have a pandas dataframe loaded from an sqlite database.
data_frame = pd.read_sql_query(
"SELECT (total_comb + total_comb_rc) as total_comb, p_val, w_length from {tn}".format(
tn=table_name), conn)
With that loaded, i group the data by the 'w_length' value.
for i, group in data_frame.groupby('w_length'):
Now, i want to plot a scatter plot for each group created with seaborn lmplot.
for i, group in data_frame.groupby('w_length'):
sns.lmplot(x=group['total_comb'], y=group['p_val'],
data=group,
fit_reg=False)
sns.despine()
plt.savefig('test_scatter'+i+'.png', dpi=400)
But for some reason im getting, this output.
'[ 6.95485628e-02 3.53641178e-01 3.46862200e+06 4.11684800e+06] not in index'
and no plot file.
I know im doing something wrong, but i cant seem to figure it out.
pd: i know i can do something like this.
sns.lmplot(x='total_comb', y='p_val',
data=data_frame,
fit_reg=False,
hue="w_length", x_jitter=.1, col="w_length", col_wrap=3, size=4)
but i also need the separeted plots for each 'w_length'.
Thanks!!
Supposing the problem is not due to the data collection from the sql database, it's probably due to the fact that you call
sns.lmplot(x=group['total_comb'], y=group['p_val'], data=group)
instead of
sns.lmplot(x='total_comb', y='p_val', data=group)
Here is a working example, which produces two separate plots:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np; np.random.seed(42)
x = np.arange(24)
y = np.random.randint(1,10, len(x))
cat = np.random.choice(["A", "B"], size=len(x))
df = pd.DataFrame({"x": x, "y": y, "cat": cat})
for i, group in df.groupby('cat'):
sns.lmplot(x="x", y="y", data=group, fit_reg=False)
plt.savefig(__file__+str(i)+".png")
plt.show()

Categories