Xtick frequency in pandas boxplot - python

I am using pandas groupby for plotting wind speed Vs direction using a bar and whisker plot. However the xaxis is not readable due to so many wind direction value close to each other.
I have tried the oc_params ax.set_xticks but instead I am having empty x-axis or modified xaxis with different values
The head of my dataframe
Kvit_TIU dir_cat
0 0.064740 14
1 0.057442 15
2 0.056750 15
3 0.069002 17
4 0.068464 17
5 0.067057 17
6 0.071901 12
7 0.050464 5
8 0.066165 1
9 0.073993 27
10 0.090784 34
11 0.121366 33
12 0.087172 34
13 0.066197 30
14 0.073020 17
15 0.071784 16
16 0.081699 17
17 0.088014 14
18 0.076758 14
19 0.078574 14
I used groupby = dir_cat to create a box plot
fig = plt.figure() # create the canvas for plotting
ax1 = plt.subplot(1,1,1)
ax1 = df_KvTr10hz.boxplot(column='Kvit_TIU', by='dir_cat', showfliers=False, showmeans=True)
ax1.set_xticks([30,90, 180,270, 330])
I would like to have the x-axis plotted with a reduced frequency. So that the plot can be readable

ax1 = df_KvTr10hz.dropna().boxplot(column='Kvit_TIU', by='dir_cat', showfliers=False, showmeans=True)
EDIT: Using OP sample dataframe
However, if we substitute with NaNs the Kvit_TIU values for 'dir_cat'>=30

Related

Python- compress lower end of y-axis in contourf plot

The issue
I have a contourf plot I made with a pandas dataframe that plots some 2-dimensional value with time on the x-axis and vertical pressure level on the y-axis. The field, time, and pressure data I'm pulling is all from a netCDF file. I can plot it fine, but I'd like to scale the y-axis to better represent the real atmosphere. (The default scaling is linear, but the pressure levels in the file imply a different king of scaling.) Basically, it should look something like the plot below on the y-axis. It's like a log scale, but compressing the bottom part of the axis instead of the top. (I don't know the term for this... like a log scale but inverted?) It doesn't need to be exact.
Working example (written in Jupyter notebook)
#modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker, colors
#data
time = np.arange(0,10)
lev = np.array([900,800,650,400,100])
df = pd.DataFrame(np.arange(50).reshape(5,10),index=lev,columns=time)
df.index.name = 'Level'
print(df)
0 1 2 3 4 5 6 7 8 9
Level
900 0 1 2 3 4 5 6 7 8 9
800 10 11 12 13 14 15 16 17 18 19
650 20 21 22 23 24 25 26 27 28 29
400 30 31 32 33 34 35 36 37 38 39
100 40 41 42 43 44 45 46 47 48 49
#lists for plotting
levtick = np.arange(len(lev))
clevels = np.arange(0,55,5)
#Main plot
fig, ax = plt.subplots(figsize=(10, 5))
im = ax.contourf(df,levels=clevels,cmap='RdBu_r')
#x-axis customization
plt.xticks(time)
ax.set_xticklabels(time)
ax.set_xlabel('Time')
#y-axis customization
plt.yticks(levtick)
ax.set_yticklabels(lev)
ax.set_ylabel('Pressure')
#title and colorbar
ax.set_title('Some mean time series')
cbar = plt.colorbar(im,values=clevels,pad=0.01)
tick_locator = ticker.MaxNLocator(nbins=11)
cbar.locator = tick_locator
cbar.update_ticks()
The Question
How can I scale the y-axis such that values near the bottom (900, 800) are compressed while values near the top (200) are expanded and given more plot space, like in the sample above my code? I tried using ax.set_yscale('function', functions=(forward, inverse)) but didn't understand how it works. I also tried simply ax.set_yscale('log'), but log isn't what I need.
You can use a custom scale transformation with ax.set_yscale('function', functions=(forward, inverse)) as you suggested. From the documentation:
forward and inverse are callables that return the scale transform
and its inverse.
In this case, define in forward() the function you want, such as the inverse of the log function, or a more custom one for your need. Call this function before your y-axis customization.
def forward(x):
return 2**x
def inverse(x):
return np.log2(x)
ax.set_yscale('function', functions=(forward,inverse))

legends not print fully when multiple plots are plotted on same figure

I have the code as below to plot multiple plots on the same figure
fig, ax = plt.subplots(figsize=(25, 10))
def wl_ratioplot(wavelength1,wavelength2, dataframe, x1=0.1,x2=1.5,y1=-500,y2=25000):
a=dataframe[['asphalt_index','layer_thickness',wavelength1,wavelength2]].copy()
sns.scatterplot(x=a[wavelength1]/a[wavelength2],y=a['layer_thickness'],data=a)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
leg = "{} vs {}".format(wavelength1,wavelength2)
print(leg) #this line is only to see the variable legend has the proper content
ax.legend(leg)
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
I get the plot as the below pic where the legend seems to be first 5 letters separately even though the variable legend has the right content
There was another similar question & the solution was to put a square bracket to the variable legend. I tried this with the code as below.
fig, ax = plt.subplots(figsize=(25, 10))
def wl_ratioplot(wavelength1,wavelength2, dataframe, x1=0.1,x2=1.5,y1=-500,y2=25000):
a=dataframe[['asphalt_index','layer_thickness',wavelength1,wavelength2]].copy()
sns.scatterplot(x=a[wavelength1]/a[wavelength2],y=a['layer_thickness'],data=a)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
leg = "{} vs {}".format(wavelength1,wavelength2)
print(leg)#this line is only to see the variable legend has the proper content
ax.legend([leg])
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
Now I get the full legend but only the first legend is shown as the pic below
Can someone let me know how to get the full legend for all the plots? Thanks.
dummy data (the plot in pic will NOT match)
14nm 15nm 16nm 17nm 18nm 19nm layer_thickness
1 2 3 4 5 6 0
1 2 3 4 5 6 0
3 5 7 9 11 13 5700
1 2 3 4 5 6 0
3 5 7 9 11 13 8600
1 2 3 4 5 6 0
3 5 7 9 11 13 5000
1 2 3 4 5 6 0
45 55 65 75 85 95 100
1 2 3 4 5 6 0
8 15 22 29 36 43 16600
wave_lengths=['15nm','16nm','14nm','18nm']
Answer Update
Based on answer from Quang Hoang. The output pics using scatter plot from matplotlib & sns.scatterplot
With plt it is pretty natural:
def wl_ratioplot(wavelength1,wavelength2, dataframe,
x1=0.1,x2=1.5,y1=-500,y2=25000,
ax=None):
leg = "{} vs {}".format(wavelength1,wavelength2)
# set the label here, and let plt deal with it
# also, you don't need to copy the dataframe:
ax.scatter(x=dataframe[wavelength1]/dataframe[wavelength2],
y=dataframe['layer_thickness'],label=leg)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
fig, ax = plt.subplots(figsize=(25, 10))
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
ax.legend()
Output:
every time you call the function wl_ratioplot the legend is being reset the final value. use a array to store all the legends then access it all through a loop.
ax.legend([leg]) #it is resetting the legend after each call.
use a legends = [];
legends.append([leg])
after all function calls, draw the legend differently
ax.legend(legends)

panda DataFrame.value_counts().plot().bar() and DataFrame.value_counts().cumsum().plot() not using the same axis

I am trying to draw a frequency bar plot and a cumulative "ogive" in the same plot. If I draw them separately both are shown OK, but when shown in the same figure, the cumulative graphic is shown shifted. Below the code used.
df = pd.DataFrame({'Correctas': [4,6,5,4,7,2,8,3,5,6,9,6,6,7,5,5,8,10,4,8,3,6,9,5,11,5,12,7,7,5,4,6]});
df['Correctas'].value_counts(sort = False).plot.bar();
df['Correctas'].value_counts(sort = False).cumsum().plot();
plt.show()
The frequency data is
2 1
3 3
4 7
5 14
6 20
7 24
8 27
9 29
10 30
11 31
12 32
So the cumulative shall start from 2 and it starts from 4 on x axis.
image showing the error
This has to do with bar chart plotting categorical x-axis. Here is a quick fix:
df = pd.DataFrame({'Correctas': [4,6,5,4,7,2,8,3,5,6,9,6,6,7,5,5,8,10,4,8,3,6,9,5,11,5,12,7,7,5,4,6]});
df_counts = df['Correctas'].value_counts(sort = False)
df_counts.index = df_counts.index.astype('str')
df_counts.plot.bar(alpha=.8);
df_counts.cumsum().plot(color='k', kind='line');
plt.show();
Output:

Line chart in matplotlib with a double axis(strings on the axis)

I am trying to create a chart using python from a data in an Excel sheet. The data looks like this
Location Values
Trial 1 Edge 12
M-2 13
Center 14
M-4 15
M-5 12
Top 13
Trial 2 Edge 10
N-2 11
Center 11
N-4 12
N-5 13
Top 14
Trial 3 Edge 15
R-2 13
Center 12
R-4 11
R-5 10
Top 3
I want my graph to look like this:
Chart-1
.The chart should have the Location column values as X-axis, i.e, string object. This can be done easily(by using/creating Location as an array),
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
datalink=('/Users/Maxwell/Desktop/W1.xlsx')
df=pd.read_excel(datalink,skiprows=2)
x1=df.loc[:,['Location']]
x2=df.loc[:,['Values']]
x3=np.linspace(1,len(x2),num=len(x2),endpoint=True)
vals=['Location','Edge','M-2','Center','M-4','M-5','Top','Edge','N-2','Center','N-4','N-5','Top','Edge','R-2']
plt.figure(figsize=(12,8),dpi=300)
plt.subplot(1,1,1)
plt.xticks(x3,vals)
plt.plot(x3,x2)
plt.show()
But, I also want to show Trial-1, Trial-2 .. on X-axis. Upto now I had been using Excel to generate chart but, I have a lot of similar data and want to use python to automate the task.
With your excel sheet that has data as follows,
,
you can use matplotlib to create the plot you wanted. It is not straightforward but can be done. See below:
EDIT: earlier I suggested factorplot, but it is not applicable because your location values for each trial are not constant.
df = pd.read_excel(r'test_data.xlsx', header = 1, parse_cols = "D:F",
names = ['Trial', 'Location', 'Values'])
'''
Trial Location Values
0 Trial 1 Edge 12
1 NaN M-2 13
2 NaN Center 14
3 NaN M-4 15
4 NaN M-5 12
5 NaN Top 13
6 Trial 2 Edge 10
7 NaN N-2 11
8 NaN Center 11
9 NaN N-4 12
10 NaN N-5 13
11 NaN Top 14
12 Trial 3 Edge 15
13 NaN R-2 13
14 NaN Center 12
15 NaN R-4 11
16 NaN R-5 10
17 NaN Top 3
'''
# this will replace the nan with corresponding trial number for each set of trials
df = df.fillna(method = 'ffill')
'''
Trial Location Values
0 Trial 1 Edge 12
1 Trial 1 M-2 13
2 Trial 1 Center 14
3 Trial 1 M-4 15
4 Trial 1 M-5 12
5 Trial 1 Top 13
6 Trial 2 Edge 10
7 Trial 2 N-2 11
8 Trial 2 Center 11
9 Trial 2 N-4 12
10 Trial 2 N-5 13
11 Trial 2 Top 14
12 Trial 3 Edge 15
13 Trial 3 R-2 13
14 Trial 3 Center 12
15 Trial 3 R-4 11
16 Trial 3 R-5 10
17 Trial 3 Top 3
'''
from matplotlib import rcParams
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
rcParams.update({'font.size': 10})
fig1 = plt.figure()
f, ax1 = plt.subplots(1, figsize = (10,3))
ax1.plot(list(df.Location.index), df['Values'],'o-')
ax1.set_xticks(list(df.Location.index))
ax1.set_xticklabels(df.Location, rotation=90 )
ax1.yaxis.set_label_text("Values")
# create a secondary axis
ax2 = ax1.twiny()
# hide all the spines that we dont need
ax2.spines['top'].set_visible(False)
ax2.spines['bottom'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.spines['left'].set_visible(False)
pos1 = ax2.get_position() # get the original position
pos2 = [pos1.x0 + 0, pos1.y0 -0.2, pos1.width , pos1.height ] # create a new position by offseting it
ax2.xaxis.set_ticks_position('bottom')
ax2.set_position(pos2) # set a new position
trials_ticks = 1.0 * df.Trial.value_counts().cumsum()/ (len(df.Trial)) # create a series object for ticks for each trial group
trials_ticks_positions = [0]+list(trials_ticks) # add a additional zero. this will make tick at zero.
trials_labels_offset = 0.5 * df.Trial.value_counts()/ (len(df.Trial)) # create an offset for the tick label, we want the tick label to between ticks
trials_label_positions = trials_ticks - trials_labels_offset # create the position of tick labels
# set the ticks and ticks labels
ax2.set_xticks(trials_ticks_positions)
ax2.xaxis.set_major_formatter(ticker.NullFormatter())
ax2.xaxis.set_minor_locator(ticker.FixedLocator(trials))
ax2.xaxis.set_minor_formatter(ticker.FixedFormatter(list(trials_label_positions.index)))
ax2.tick_params(axis='x', length = 10,width = 1)
plt.show()
results in

Pandas Plotting Y-Axis indexing issue

I have this pandas data frame set up:
FY NY_State
0 1986-87 89431973
1 1987-88 95958200
2 1988-89 100664606
3 1989-90 99703990
4 1990-91 95446076
5 1991-92 91487047
6 1992-93 92658482
7 1993-94 88026334
8 1994-95 90845207
9 1995-96 80070860
10 1996-97 77357591
11 1997-98 87040859
12 1998-99 89547598
13 1999-00 93484650
14 2000-01 118696779
15 2001-02 132748185
16 2002-03 111932612
17 2003-04 116911977
18 2004-05 119898693
19 2005-06 149293542
20 2006-07 161647387
21 2007-08 193891526
22 2008-09 170071041
23 2009-10 180069745
24 2010-11 174704520
FWIW:
In [50]: totalData.dtypes
Out[50]:
FY object
NY_State int64
dtype: object
I want to make a bar chart with the FY on the x-axis and the y-axis being the amount in the NY_State column.
I've been getting some progress with this:
totalData.plot(x=totalData.FY, kind='bar')
but that gives me this:
Then I tried this:
totalData.plot(x=totalData.FY, kind='bar', ylim=(70000000, 240000000))
And that gave me this:
Which is better, but still not what I want. I tried:
totalData.plot(x=totalData.FY, y=totalData.NY_State, kind='bar')
but that gives me an exception of
IndexError: indices are out-of-bounds
...which makes no sense whatsoever to me how that's possible.
Would really appreciate help.

Categories