Seaborn Python Filter Columns [duplicate] - python

I have two queries:
I want to remove the empty bar from the bar graph (present in the first column).
I have to use this graph in a PowerPoint presentation. How can I increase the height of the bar graph such that it fixes the height of the slide? I have tried to increase the height but it is not increasing any further. Is it possible? If not what are other options that I can try?
plt.figure(figsize=(40,20))
g = sns.catplot(x = 'Subject', y = 'EE Score',data = df , hue = 'Session',col='Grade',sharey = True,sharex = True,
hue_order=["2017-18", "2018-19", "2019-20"], kind="bar");
#plt.legend(bbox_to_anchor=(1, 1), loc=2)
g.set(ylim=(0, 100))
g.set_axis_labels("Subject", "EE Score")
ax = g.facet_axis(0,0)
for p in ax.patches:
ax.text(p.get_x() + 0.015,
p.get_height() * 1.02,
'{0:.1f}'.format(p.get_height()),
color='black', rotation='horizontal', size=12)
ax = g.facet_axis(0,1)
for p in ax.patches:
ax.text(p.get_x() + 0.015,
p.get_height() * 1.02,
'{0:.1f}'.format(p.get_height()),
color='black', rotation='horizontal', size=12)
ax = g.facet_axis(0,2)
for p in ax.patches:
ax.text(p.get_x() + 0.015,
p.get_height() * 1.02,
'{0:.1f}'.format(p.get_height()),
color='black', rotation='horizontal', size=12)
ax = g.facet_axis(0,3)
for p in ax.patches:
ax.text(p.get_x() + 0.015,
p.get_height() * 1.02,
'{0:.1f}'.format(p.get_height()),
color='black', rotation='horizontal', size=12)
#g.set_ylabel('')
plt.savefig('2.png', bbox_inches = 'tight')

Like #JohanC, I initially thought it was not possible to remove an empty category from a catplot(). However, Michael's comment provides the solution: sharex=False.
This solution will not work if the column used for x= is a category dtype, which can be checked with pandas.DataFrame.info()
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2
seaborn is a high-level api for matplotlib
object dtype x-axis
import seaborn as sns
titanic = sns.load_dataset('titanic')
# remove one category
titanic.drop(titanic.loc[(titanic['class']=='First')&(titanic['who']=='child')].index, inplace=True)
g = sns.catplot(x="who", y="survived", col="class", data=titanic, kind="bar", ci=None, sharex=False, hue='embarked', estimator=sum)
categorical dtype x-axis
See that tips.day is categorical and sharex=False will not work
The column can be converted to object dtype with tips.day = tips.day.astype('str'), in which case, sharex=False will work, but the days of the week will not be ordered.
tips = sns.load_dataset('tips')
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null category
3 smoker 244 non-null category
4 day 244 non-null category
5 time 244 non-null category
6 size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
g = sns.catplot(x="day", y="total_bill", col="time", kind="bar", data=tips, ci=None, sharex=False, hue='smoker')
With converting the column to a object dtype
Note the days are no longer ordered.
tips.day = tips.day.astype('str')
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null category
3 smoker 244 non-null category
4 day 244 non-null object
5 time 244 non-null category
6 size 244 non-null int64
dtypes: category(3), float64(2), int64(1), object(1)
memory usage: 8.8+ KB
g = sns.catplot(x="day", y="total_bill", col="time", kind="bar", data=tips, ci=None, sharex=False, hue='smoker')

Related

getting strange error while calculating z-score

i want to calculate z-score of my whole dataset. i have tried two types of code but unfortunately they both gave me the same error.
my 1 code is here:
zee=stats.zscore(df)
print(zee)
my 2 code is:
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df))
print(z)
am using jupyter
The error i have got:
-----
TypeError Traceback (most recent call last)
<ipython-input-23-ef429aebacfd> in <module>
1 from scipy import stats
2 import numpy as np
----> 3 z = np.abs(stats.zscore(df))
4 print(z)
~/.local/lib/python3.8/site-packages/scipy/stats/stats.py in zscore(a, axis, ddof, nan_policy)
2495 sstd = np.nanstd(a=a, axis=axis, ddof=ddof, keepdims=True)
2496 else:
-> 2497 mns = a.mean(axis=axis, keepdims=True)
2498 sstd = a.std(axis=axis, ddof=ddof, keepdims=True)
2499
~/.local/lib/python3.8/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
160 ret = umr_sum(arr, axis, dtype, out, keepdims)
161 if isinstance(ret, mu.ndarray):
--> 162 ret = um.true_divide(
163 ret, rcount, out=ret, casting='unsafe', subok=False)
164 if is_float16_result and out is None:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
and here the info of my dataframe,if theres something wrong with my datafarme.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Region 100 non-null object
1 Country 100 non-null object
2 Item Type 100 non-null object
3 Sales Channel 100 non-null object
4 Order Priority 100 non-null object
5 Order Date 100 non-null object
6 Order ID 100 non-null int64
7 Ship Date 100 non-null object
8 Units Sold 100 non-null int64
9 Unit Price 100 non-null float64
10 Unit Cost 100 non-null float64
11 Total Revenue 100 non-null float64
12 Total Cost 100 non-null float64
13 Total Profit 100 non-null float64
dtypes: float64(5), int64(2), object(7)
memory usage: 11.1+ KB
thanks in advance.
Your df contains non float/int values, please try sending only int/float cols to your zscore func.
stats.zscore(df[['Unit Cost', 'Total Revenue', 'Total Cost', 'Total Profit']])

Altair Stripplot - bring columns together

with this dataframe structure, df_ppp:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MeanPPP 628 non-null object
1 StdPPP 626 non-null object
2 MeanPPG 628 non-null object
3 MeanPrice 628 non-null object
4 MeanSelected 628 non-null object
5 TotalMinutes 628 non-null object
6 TotalPoints 628 non-null object
7 Position 628 non-null object
8 Team 628 non-null object
9 Player 628 non-null object
10 Color 628 non-null object
and the following code:
stripplot = alt.Chart(df_ppp, width=120).mark_circle().encode(
x=alt.X(
'jitter:Q',
title=None,
axis=alt.Axis(values=[0], ticks=True, grid=False, labels=False),
scale=alt.Scale(),
),
y=alt.Y('MeanPPP:Q'),
color=alt.Color('Color:N', legend=None, scale=None),
tooltip = [alt.Tooltip('Player:N'),
alt.Tooltip('Position:N'),
alt.Tooltip('Team:N'),
alt.Tooltip('MeanPPP:Q'),
alt.Tooltip('MeanPPG:Q'),
alt.Tooltip('MeanPrice:Q'),
alt.Tooltip('MeanSelected:Q'),
alt.Tooltip('TotalMinutes:Q'),
alt.Tooltip('TotalPoints:Q')],
column=alt.Column(
'Team:N',
header=alt.Header(
labelAngle=-90,
titleOrient='top',
labelOrient='bottom',
labelAlign='right',
labelPadding=10,
),
),
).transform_calculate(
# Generate Gaussian jitter with a Box-Muller transform
jitter='sqrt(-2*log(random()))*cos(2*PI*random())'
).configure_facet(
spacing=0
).configure_view(
stroke=None
).configure_axis(
grid=False
).properties(height=300, width=50)
I'm plotting this:
This is the result I'm aiming at, with stripplots closer to each value.
Altair examples - stripplot
How do I bring the columns closer togeher?
Altair code was perfect.
The problem with column width did not belong to altair, but to streamlit config, which is being used to plot altair charts.
streamlit was overriding column width.
So I've changed:
st.altair_chart(stripplot, use_container_width=True)
to:
st.altair_chart(stripplot)
and now I plot:

Filter by conditions and plot batch graphs in python

I have a dataset df as shown below:
id timestamp data group_id
99 265 2019-11-28 15:44:34.027 22.5 1
100 266 2019-11-28 15:44:34.027 23.5 2
101 267 2019-11-28 15:44:34.027 27.5 3
102 273 2019-11-28 15:44:38.653 22.5 1
104 275 2019-11-28 15:44:38.653 22.5 2
I have plotted a graph for a chunk of data grouped by a particular group_id and date, eg. group_id ==3, date =2020-01-01, using code below:
df['timestamp'] = pd.to_datetime(df['timestamp'])
GROUP_ID = 2
df = df[df['group_id'] == GROUP_ID]
df['Date'] = [datetime.datetime.date(d) for d in df['timestamp']]
df = df[df['Date'] == pd.to_datetime('2020-01-01')]
df.plot(x='timestamp', y='data', figsize=(42, 16))
plt.axhline(y=40, color='r', linestyle='-')
plt.axhline(y=25, color='b', linestyle='-')
df['top_lim'] = 40
df['bottom_lim'] = 25
plt.fill_between(df['timestamp'], df['bottom_lim'], df['data'],
where=(df['data'] >= df['bottom_lim'])&(df['data'] <= df['top_lim']),
facecolor='orange', alpha=0.3)
mask = (df['data'] <= df['top_lim'])&(df['data'] >= df['bottom_lim'])
plt.scatter(df['timestamp'][mask], df['data'][mask], marker='.', color='black')
cumulated_time = df['timestamp'][mask].diff().sum()
plt.gcf().subplots_adjust(left = 0.3)
plt.xlabel('Timestamp')
plt.ylabel('Data')
plt.show()
Now I want to plot a graph for eachgroup_id for each date. How can I do it? Is there a way to group data by these two conditions, and plot the graphs? Or is it better to use a for-loop?
Using for-loop you can take the following approach. Assuming that for each group you have 2 dates, a nice way to plot would be to have 2 columns, and rows equal to the number of groups
rows=len(groups) #set the desired number of rows
cols=2 #set the desired number of columns
fig, ax = plt.subplots(rows, cols, figsize=(13,8),sharex=False,sharey=False) # if you want to turn off sharing axis.
g=0 #to iterate over rows/cols
d=0 #to iterate over rows/cols
for group in groups:
for date in dates:
GROUP_ID = group
df = df[df['group_id'] == GROUP_ID]
df['Date'] = [datetime.datetime.date(d) for d in df['timestamp']]
df = df[df['Date'] == date]
df.plot(x='timestamp', y='data', figsize=(42, 16))
ax[g][d].axhline(y=40, color='r', linestyle='-')
ax[g][d].axhline(y=25, color='b', linestyle='-')
df['top_lim'] = 40
df['bottom_lim'] = 25
ax[g][d].fill_between(df['timestamp'], df['bottom_lim'], df['data'],
where=(df['data'] >= df['bottom_lim'])&(df['data'] <= df['top_lim']),
facecolor='orange', alpha=0.3)
mask = (df['data'] <= df['top_lim'])&(df['data'] >= df['bottom_lim'])
ax[g][d].scatter(df['timestamp'][mask], df['data'][mask], marker='.', color='black')
cumulated_time = df['timestamp'][mask].diff().sum()
d=d+1
if d==1:
g=g
else:
g=g+1
d=0
fig.text(0.5, -0.01, 'Timestamp', ha='center', va='center',fontsize=20)
fig.text(-0.01, 0.5, 'Data', ha='center', va='center', rotation='vertical',fontsize=20)
plt.subplots_adjust(left = 0.3)

In seaborn, how to increase the graph and save as image?

In python3 and pandas I have this dataframe:
gastos_anuais.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 2 columns):
ano 5 non-null int64
valor_pago 5 non-null float64
dtypes: float64(1), int64(1)
memory usage: 280.0 bytes
gastos_anuais.reset_index()
index ano valor_pago
0 0 2014 13,082,008,854.37
1 3 2017 9,412,069,205.73
2 2 2016 7,617,420,559.22
3 1 2015 7,470,391,492.24
4 4 2018 7,099,199,179.11
I did a pointplot chart:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.pointplot(x='ano', y='valor_pago', data=gastos_anuais)
plt.xticks(rotation=65)
plt.grid(True, linestyle="--")
plt.title("Gastos Destinados pelo Governo Federal (2014-2018)\n")
plt.xlabel("Anos")
plt.ylabel("Em bilhões de R$")
plt.show()
It worked. But I would like to:
Increase the size of the chart that appears on the screen
Can save image format, .jpeg file for example
And I do not understand why below the title of the graph appears '1e10'
Please, does anyone know how I can do it?
Increase the size of the chart that appears on the screen
Add sns.set(rc={'figure.figsize':(w, h)}) before plotting. For example:
sns.set(rc={'figure.figsize':(20, 5)})
Save as jpg
Keep a reference to the plot, get the figure and save it:
p = sns.pointplot(x='ano', y='valor_pago', data=gastos_anuais)
plt.xticks(rotation=65)
#...
# All your editions with `plt`
#...
fig = p.get_figure()
fig.savefig("gastos_anuais.jpg")
What is the 1e10 in the corner?
It is the scale. This means that the values shown in the y axis should be multiplied by 10^10 to recover the actual values of the data.
If you want to remove it, you can use:
plt.ticklabel_format(style='plain', axis='y')
But you will need to do something with the values since they distort the image.

Plotting Pandas' pivot_table from long data

I have a xls file with data organized in long format. I have four columns: the variable name, the country name, the year and the value.
After importing the data in Python with pandas.read_excel, I want to plot the time series of one variable for different countries. To do so, I create a pivot table that transforms the data in wide format. When I try to plot with matplotlib, I get an error
ValueError: could not convert string to float: 'ZAF'
(where 'ZAF' is the label of one country)
What's the problem?
This is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel('raw_emissions_energy.xls','raw data', index_col = None, thousands='.',parse_cols="A,C,F,M")
data['Year'] = data['Year'].astype(str)
data['COU'] = data['COU'].astype(str)
# generate sub-datasets for specific VARs
data_CO2PROD = pd.pivot_table(data[(data['VAR']=='CO2_PBPROD')], index='COU', columns='Year')
plt.plot(data_CO2PROD)
The xls file with raw data looks like:
raw data Excel view
This is what I get from data_CO2PROD.info()
<class 'pandas.core.frame.DataFrame'>
Index: 105 entries, ARE to ZAF
Data columns (total 16 columns):
(Value, 1990) 104 non-null float64
(Value, 1995) 105 non-null float64
(Value, 2000) 105 non-null float64
(Value, 2001) 105 non-null float64
(Value, 2002) 105 non-null float64
(Value, 2003) 105 non-null float64
(Value, 2004) 105 non-null float64
(Value, 2005) 105 non-null float64
(Value, 2006) 105 non-null float64
(Value, 2007) 105 non-null float64
(Value, 2008) 105 non-null float64
(Value, 2009) 105 non-null float64
(Value, 2010) 105 non-null float64
(Value, 2011) 105 non-null float64
(Value, 2012) 105 non-null float64
(Value, 2013) 105 non-null float64
dtypes: float64(16)
memory usage: 13.9+ KB
None
Using data_CO2PROD.plot() instead of plt.plot(data_CO2PROD) allowed me to plot the data. http://pandas.pydata.org/pandas-docs/stable/visualization.html.
Simple code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data= pd.DataFrame(np.random.randn(3,4), columns=['VAR','COU','Year','VAL'])
data['VAR'] = ['CC','CC','KK']
data['COU'] =['ZAF','NL','DK']
data['Year']=['1987','1987','2006']
data['VAL'] = [32,33,35]
data['Year'] = data['Year'].astype(str)
data['COU'] = data['COU'].astype(str)
# generate sub-datasets for specific VARs
data_CO2PROD = pd.pivot_table(data=data[(data['VAR']=='CC')], index='COU', columns='Year')
data_CO2PROD.plot()
plt.show()
I think you need add parameter values to pivot_table:
data_CO2PROD = pd.pivot_table(data=data[(data['VAR']=='CC')],
index='COU',
columns='Year',
values='Value')
data_CO2PROD.plot()
plt.show()

Categories