Plotting three dimensions of categorical data in Python - python

My data has three categorical variables I'm trying to visualize:
City (one of five)
Occupation (one of four)
Blood type (one of four)
So far, I've succeeded in grouping the data in a way that I think will be easy to work with:
import numpy as np, pandas as pd
# Make data
cities = ['Tijuana','Las Vegas','Los Angeles','Anaheim','Atlantis']
occupations = ['Doctor','Lawyer','Engineer','Drone security officer']
bloodtypes = ['A','B','AB','O']
df = pd.DataFrame({'City': np.random.choice(cities,500),
'Occupation': np.random.choice(occupations,500),
'Blood Type':np.random.choice(bloodtypes,500)})
# You need to make a dummy column, otherwise the groupby returns an empty df
df['Dummy'] = np.ones(500)
# This is now what I'd like to plot
df.groupby(by=['City','Occupation','Blood Type']).count().unstack(level=1)
Returns:
Dummy
Occupation Doctor Drone security officer Engineer Lawyer
City Blood Type
Anaheim A 7 7 7 7
AB 6 10 8 5
B 2 10 4 2
O 4 3 3 6
Atlantis A 6 5 5 7
AB 12 7 7 10
B 7 4 7 3
O 7 4 6 4
Las Vegas A 8 4 8 5
AB 5 6 8 9
B 6 10 6 6
O 6 9 5 9
Los Angeles A 7 4 8 8
AB 9 8 8 8
B 3 6 4 1
O 9 11 11 9
Tijuana A 3 4 5 3
AB 9 5 5 7
B 3 6 4 9
O 3 5 5 8
My goal is to create something like the Seaborn swarmplot shown below, which comes from the Seaborn documentation. Seaborn applies jitter to the quantitative data so that you can see the individual data points and their hues:
With my data, I'd like to plot City on the x-axis and Occupation on the y-axis, applying jitter to each, and then hue by Blood type. However, sns.swarmplot requires one of the axes to be quantitative:
sns.swarmplot(data=df,x='City',y='Occupation',hue='Blood Type')
returns an error.
An acceptable alternative might be to create 20 categorical bar plots, one for each intersection of City and Occupation, which I would do by running a for loop over each category, but I can't imagine how I'd feed that to matplotlib subplots to get them in a 4x5 grid.
The most similar question I could find was in R, and the asker only wanted to indicate the most common value for the third variable, so I didn't get any good ideas from there.
Thanks for any help you can provide.

Alright, I got to work on the "acceptable alternative" today and I have found a solution using basically pure matplotlib (but I stuck the Seaborn styling on top of it, just because).
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import get_cmap
from matplotlib.patches import Patch
import seaborn as sns
# Make data
cities = ['Tijuana','Las Vegas','Los Angeles','Anaheim','Atlantis']
occupations = ['Doctor','Lawyer','Engineer','Drone security officer']
bloodtypes = ['A','B','AB','O']
df = pd.DataFrame({'City': np.random.choice(cities,500),
'Occupation': np.random.choice(occupations,500),
'Blood Type':np.random.choice(bloodtypes,500)})
# Make a dummy column, otherwise the groupby returns an empty df
df['Dummy'] = np.ones(500)
# This is now what I'd like to plot
grouped = df.groupby(by=['City','Occupation','Blood Type']).count().unstack()
# List of blood types, to use later as categories in subplots
kinds = grouped.columns.levels[1]
# colors for bar graph
colors = [get_cmap('viridis')(v) for v in np.linspace(0,1,len(kinds))]
sns.set(context="talk")
nxplots = len(grouped.index.levels[0])
nyplots = len(grouped.index.levels[1])
fig, axes = plt.subplots(nxplots,
nyplots,
sharey=True,
sharex=True,
figsize=(10,12))
fig.suptitle('City, occupation, and blood type')
# plot the data
for a, b in enumerate(grouped.index.levels[0]):
for i, j in enumerate(grouped.index.levels[1]):
axes[a,i].bar(kinds,grouped.loc[b,j],color=colors)
axes[a,i].xaxis.set_ticks([])
axeslabels = fig.add_subplot(111, frameon=False)
plt.tick_params(labelcolor='none', top=False, bottom=False, left=False, right=False)
plt.grid(False)
axeslabels.set_ylabel('City',rotation='horizontal',y=1,weight="bold")
axeslabels.set_xlabel('Occupation',weight="bold")
# x- and y-axis labels
for i, j in enumerate(grouped.index.levels[1]):
axes[nyplots,i].set_xlabel(j)
for i, j in enumerate(grouped.index.levels[0]):
axes[i,0].set_ylabel(j)
# Tune this manually to make room for the legend
fig.subplots_adjust(right=0.82)
fig.legend([Patch(facecolor = i) for i in colors],
kinds,
title="Blood type",
loc="center right")
Returns this:
I'd appreciate any feedback, and I'd still love it if someone could provide the preferred solution.

Related

Create separate graph of each series and save as pdf in Python [duplicate]

This question already has answers here:
Pandas dataframe groupby plot
(3 answers)
Saving plots (AxesSubPlot) generated from python pandas with matplotlib's savefig
(6 answers)
How to save a Seaborn plot into a file
(10 answers)
Closed 6 months ago.
I have a pandas dataframe as below:
Well Name
READTIME
WL
0
A
02-Jul-20
12
1
B
03-Aug-22
18
2
C
05-Jul-21
14
3
A
03-May-21
16
4
B
01-Jan-19
19
5
C
12-Dec-20
20
6
D
14-Nov-21
14
7
A
01-Mar-22
17
8
B
15-Feb-21
11
9
C
10-Oct-20
10
10
D
14-Sep-21
5
groupByName = df.groupby(['Well Name', 'READTIME'])
After grouping them by 'Well Name' and Readtime, i got the following:
Well Name READTIME WL
A 2020-07-02 12
2021-05-03 16
2022-03-01 17
B 2019-01-01 19
2021-02-15 11
2022-08-03 18
C 2020-10-10 10
2020-12-12 20
2021-07-05 14
D 2021-09-14 5
2021-11-14 14
I have got the following graph by running this code:
sns.relplot(data=df, x="READTIME", y="WL", hue="Well Name",kind="line", height=4, aspect=3)
I want to have a separate graph for each "Well Name" and saved it as a pdf. I will really appreciate your help with this. Thank you
To separate out the plots, you can iterate over the four unique Well Names in your dataset and filter the dataset for each Well Name before plotting:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# I saved your data as an Excel file
df = pd.read_excel('Book1.xlsx')
print(df)
# Get the set of unique Well Names
well_names = set(df['Well Name'].to_list())
for wn in well_names:
# Create dataframe containing only rows with this Well Name
this_wn = df[df['Well Name'] == wn]
# Plot, save, and show
sns.relplot(data=this_wn, x="READTIME", y="WL", hue="Well Name",kind="line", height=4, aspect=3)
plt.savefig(f'{wn}.png')
plt.show(block=True)
This generated the following 4 image files:
For saving in a PDF file, please see this answer.
In this case, specifying a row results in a faceted graph.
sns.relplot(data=df, x="READTIME", y="WL", hue="Well Name", kind="line", row='Well Name', height=4, aspect=3)

Plot with Histogram an attribute from a dataframe

I have a dataframe with the weight and the number of measures of each user. The df looks like:
id_user
weight
number_of_measures
1
92.16
4
2
80.34
5
3
71.89
11
4
81.11
7
5
77.23
8
6
92.37
2
7
88.18
3
I would like to see an histogram with the attribute of the table (weight, but I want to do it for both cases) at the x-axis and the frequency in the y-axis.
Does anyone know how to do it with matplotlib?
Ok, it seems to be quite easy:
import pandas as pd
import matplotlib.pyplot as plt
hist = df.hist(bins=50)
plt.show()

legends not print fully when multiple plots are plotted on same figure

I have the code as below to plot multiple plots on the same figure
fig, ax = plt.subplots(figsize=(25, 10))
def wl_ratioplot(wavelength1,wavelength2, dataframe, x1=0.1,x2=1.5,y1=-500,y2=25000):
a=dataframe[['asphalt_index','layer_thickness',wavelength1,wavelength2]].copy()
sns.scatterplot(x=a[wavelength1]/a[wavelength2],y=a['layer_thickness'],data=a)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
leg = "{} vs {}".format(wavelength1,wavelength2)
print(leg) #this line is only to see the variable legend has the proper content
ax.legend(leg)
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
I get the plot as the below pic where the legend seems to be first 5 letters separately even though the variable legend has the right content
There was another similar question & the solution was to put a square bracket to the variable legend. I tried this with the code as below.
fig, ax = plt.subplots(figsize=(25, 10))
def wl_ratioplot(wavelength1,wavelength2, dataframe, x1=0.1,x2=1.5,y1=-500,y2=25000):
a=dataframe[['asphalt_index','layer_thickness',wavelength1,wavelength2]].copy()
sns.scatterplot(x=a[wavelength1]/a[wavelength2],y=a['layer_thickness'],data=a)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
leg = "{} vs {}".format(wavelength1,wavelength2)
print(leg)#this line is only to see the variable legend has the proper content
ax.legend([leg])
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
Now I get the full legend but only the first legend is shown as the pic below
Can someone let me know how to get the full legend for all the plots? Thanks.
dummy data (the plot in pic will NOT match)
14nm 15nm 16nm 17nm 18nm 19nm layer_thickness
1 2 3 4 5 6 0
1 2 3 4 5 6 0
3 5 7 9 11 13 5700
1 2 3 4 5 6 0
3 5 7 9 11 13 8600
1 2 3 4 5 6 0
3 5 7 9 11 13 5000
1 2 3 4 5 6 0
45 55 65 75 85 95 100
1 2 3 4 5 6 0
8 15 22 29 36 43 16600
wave_lengths=['15nm','16nm','14nm','18nm']
Answer Update
Based on answer from Quang Hoang. The output pics using scatter plot from matplotlib & sns.scatterplot
With plt it is pretty natural:
def wl_ratioplot(wavelength1,wavelength2, dataframe,
x1=0.1,x2=1.5,y1=-500,y2=25000,
ax=None):
leg = "{} vs {}".format(wavelength1,wavelength2)
# set the label here, and let plt deal with it
# also, you don't need to copy the dataframe:
ax.scatter(x=dataframe[wavelength1]/dataframe[wavelength2],
y=dataframe['layer_thickness'],label=leg)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
fig, ax = plt.subplots(figsize=(25, 10))
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
ax.legend()
Output:
every time you call the function wl_ratioplot the legend is being reset the final value. use a array to store all the legends then access it all through a loop.
ax.legend([leg]) #it is resetting the legend after each call.
use a legends = [];
legends.append([leg])
after all function calls, draw the legend differently
ax.legend(legends)

Using pandas series date as xtick label

I have this dataframe called 'dfArrivalDate' (with the first 11 rows shown)
arrival_date count
0 2013-06-08 9
1 2013-06-27 8
2 2013-03-06 8
3 2013-06-01 8
4 2013-06-28 6
5 2012-11-28 6
6 2013-06-11 5
7 2013-06-29 5
8 2013-06-09 4
9 2013-06-03 3
10 2013-05-31 3
sortedArrivalDate = transform.sort('arrival_date')
I wanted to plot them in a bar chart to see the count by arrival date. I called
sortedArrivalDate.plot(kind = 'bar') [![enter image description here][1]]
but i'm getting the index as the row ticks of my bar chart. I figured i need to use 'xticks'.
sortedArrivalDate.plot(kind = 'bar', xticks = sortedArrivalDate.arrival_date)
but I run into the error: TypeError: Cannot compare type 'Timestamp' with type 'float'
I tried a different approach.
fig, ax = plt.subplots()
ax.plot(sortedArrivalDate.arrival_date, sortedArrivalDate.count)
This time the error is ValueError: x and y must have same first dimension
I'm thinking this might just be an easy fix and since I don't have much experience coding in pandas and matplotlib, I might be missing a very simple thing here. Care to guide me in the right direction? thanks.
IIUC:
df = df.sort_values(by='arrival_date')
df.plot(x='arrival_date', y='count', kind='bar')

Line chart in matplotlib with a double axis(strings on the axis)

I am trying to create a chart using python from a data in an Excel sheet. The data looks like this
Location Values
Trial 1 Edge 12
M-2 13
Center 14
M-4 15
M-5 12
Top 13
Trial 2 Edge 10
N-2 11
Center 11
N-4 12
N-5 13
Top 14
Trial 3 Edge 15
R-2 13
Center 12
R-4 11
R-5 10
Top 3
I want my graph to look like this:
Chart-1
.The chart should have the Location column values as X-axis, i.e, string object. This can be done easily(by using/creating Location as an array),
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
datalink=('/Users/Maxwell/Desktop/W1.xlsx')
df=pd.read_excel(datalink,skiprows=2)
x1=df.loc[:,['Location']]
x2=df.loc[:,['Values']]
x3=np.linspace(1,len(x2),num=len(x2),endpoint=True)
vals=['Location','Edge','M-2','Center','M-4','M-5','Top','Edge','N-2','Center','N-4','N-5','Top','Edge','R-2']
plt.figure(figsize=(12,8),dpi=300)
plt.subplot(1,1,1)
plt.xticks(x3,vals)
plt.plot(x3,x2)
plt.show()
But, I also want to show Trial-1, Trial-2 .. on X-axis. Upto now I had been using Excel to generate chart but, I have a lot of similar data and want to use python to automate the task.
With your excel sheet that has data as follows,
,
you can use matplotlib to create the plot you wanted. It is not straightforward but can be done. See below:
EDIT: earlier I suggested factorplot, but it is not applicable because your location values for each trial are not constant.
df = pd.read_excel(r'test_data.xlsx', header = 1, parse_cols = "D:F",
names = ['Trial', 'Location', 'Values'])
'''
Trial Location Values
0 Trial 1 Edge 12
1 NaN M-2 13
2 NaN Center 14
3 NaN M-4 15
4 NaN M-5 12
5 NaN Top 13
6 Trial 2 Edge 10
7 NaN N-2 11
8 NaN Center 11
9 NaN N-4 12
10 NaN N-5 13
11 NaN Top 14
12 Trial 3 Edge 15
13 NaN R-2 13
14 NaN Center 12
15 NaN R-4 11
16 NaN R-5 10
17 NaN Top 3
'''
# this will replace the nan with corresponding trial number for each set of trials
df = df.fillna(method = 'ffill')
'''
Trial Location Values
0 Trial 1 Edge 12
1 Trial 1 M-2 13
2 Trial 1 Center 14
3 Trial 1 M-4 15
4 Trial 1 M-5 12
5 Trial 1 Top 13
6 Trial 2 Edge 10
7 Trial 2 N-2 11
8 Trial 2 Center 11
9 Trial 2 N-4 12
10 Trial 2 N-5 13
11 Trial 2 Top 14
12 Trial 3 Edge 15
13 Trial 3 R-2 13
14 Trial 3 Center 12
15 Trial 3 R-4 11
16 Trial 3 R-5 10
17 Trial 3 Top 3
'''
from matplotlib import rcParams
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
rcParams.update({'font.size': 10})
fig1 = plt.figure()
f, ax1 = plt.subplots(1, figsize = (10,3))
ax1.plot(list(df.Location.index), df['Values'],'o-')
ax1.set_xticks(list(df.Location.index))
ax1.set_xticklabels(df.Location, rotation=90 )
ax1.yaxis.set_label_text("Values")
# create a secondary axis
ax2 = ax1.twiny()
# hide all the spines that we dont need
ax2.spines['top'].set_visible(False)
ax2.spines['bottom'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.spines['left'].set_visible(False)
pos1 = ax2.get_position() # get the original position
pos2 = [pos1.x0 + 0, pos1.y0 -0.2, pos1.width , pos1.height ] # create a new position by offseting it
ax2.xaxis.set_ticks_position('bottom')
ax2.set_position(pos2) # set a new position
trials_ticks = 1.0 * df.Trial.value_counts().cumsum()/ (len(df.Trial)) # create a series object for ticks for each trial group
trials_ticks_positions = [0]+list(trials_ticks) # add a additional zero. this will make tick at zero.
trials_labels_offset = 0.5 * df.Trial.value_counts()/ (len(df.Trial)) # create an offset for the tick label, we want the tick label to between ticks
trials_label_positions = trials_ticks - trials_labels_offset # create the position of tick labels
# set the ticks and ticks labels
ax2.set_xticks(trials_ticks_positions)
ax2.xaxis.set_major_formatter(ticker.NullFormatter())
ax2.xaxis.set_minor_locator(ticker.FixedLocator(trials))
ax2.xaxis.set_minor_formatter(ticker.FixedFormatter(list(trials_label_positions.index)))
ax2.tick_params(axis='x', length = 10,width = 1)
plt.show()
results in

Categories