Using condition to label selection of points based on a value - python

How can I label only the points where X >= 3? I don't see any points labelled with this output.
This is very similar to the simple labelled points example but I feel like I am missing something simple.
import altair as alt
import pandas as pd
source = pd.DataFrame({
'x': [1, 3, 5, 7, 9],
'y': [1, 3, 5, 7, 9],
'label': ['A', 'B', 'C', 'D', 'E']
})
points = alt.Chart(source).mark_point().encode(
x='x:Q',
y='y:Q'
)
text = points.mark_text(
align='left',
baseline='middle',
dx=7
).encode(
text=alt.condition(alt.FieldGTEPredicate('x:Q', 3), 'label', alt.value(' '))
)
points + text

Predicates do not recognize encoding type shorthands; you should use the field name directly:
text=alt.condition(alt.FieldGTEPredicate('x', 3), 'label', alt.value(' '))
Even better, since this is essentially a filter operation, is to use a filter transform in place of the conditional value:
import altair as alt
import pandas as pd
source = pd.DataFrame({
'x': [1, 3, 5, 7, 9],
'y': [1, 3, 5, 7, 9],
'label': ['A', 'B', 'C', 'D', 'E']
})
points = alt.Chart(source).mark_point().encode(
x='x:Q',
y='y:Q'
)
text = points.transform_filter(
alt.datum.x >= 3
).mark_text(
align='left',
baseline='middle',
dx=7
).encode(
text='label'
)
points + text

Related

How to create a Crosstab Plot?

I would like to create a 'Crosstab' plot like the below using matplotlib or seaborn:
Using the following dataframe:
import pandas as pd
data = [['A', 'C', 2], ['A', 'D', 8], ['B', 'C', 25], ['B', 'D', 30]]
df = pd.DataFrame(data = data, columns = ['col', 'row', 'val'])
col row val
0 A C 2
1 A D 8
2 B C 25
3 B D 30
An option in matplotlib could be by adding Rectangles to the origin via plt.gca and add_patch. The problem is that I did here all manually like this:
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
plt.xlim(-10, 40)
plt.ylim(-40, 40)
plt.rcParams['figure.figsize'] = (10,16)
someX, someY = 0, 0
currentAxis = plt.gca()
currentAxis.add_patch(Rectangle((someX, someY), 30, 30, facecolor="purple"))
ax.text(15, 15, '30')
currentAxis.add_patch(Rectangle((someX, someY), 25, -25, facecolor="blue"))
ax.text(12.5, -12.5, '25')
currentAxis.add_patch(Rectangle((someX, someY), -2, -2, facecolor="red"))
ax.text(-1, -1, '2')
currentAxis.add_patch(Rectangle((someX, someY), -8, 8, facecolor="green"))
ax.text(-4, 4, '8')
Output:
As you can see, the plot doesn't look that nice. So I was wondering if it is possible to somehow automatically create 'Crosstab' plots using matplotlib or seaborn?
I am not sure whether matplotlib or seaborn have dedicated functions for this type of plot or not, but using plt.bar and plt.bar_label instead of Rectangle and plt.Text might help automatize things a little (label placement etc.).
See code below:
import matplotlib.pyplot as plt
data = [['A', 'C', 2], ['A', 'D', 8], ['B', 'C', 25], ['B', 'D', 30]]
pos={'A':-1,'B':0,'C':-1,'D':1}
fig,ax=plt.subplots(figsize=(10,10))
p=[ax.bar(pos[d[0]]*d[2],pos[d[1]]*d[2],width=d[2],align='edge') for d in data]
[ax.bar_label(p[i],labels=[data[i][2]], label_type='center',fontsize=18) for i in range(len(data))]
ax.set_aspect('equal')

How to add prefix to column name according to data in another column

There is a dataframe like bellow
import pandas as pd
data = {‘ID': [1, 2, 3, 4, 5, 6, 7, 8],
‘LABEL': [’text', ‘logo', ‘logo', ‘person’,’text’,’text’,’person’,’logo'],
‘cluster_label': [c_0, c_0, c_0, c_1, c_1, c_2, c_2, c_3]}
df = pd.DataFrame(data)
I want to make dummy columns for the “cluster_label” column
pd.get_dummies(df,columns=[‘cluster_label'])
however I need to add a prefix regraded to the LABEL column.
Basically, the columns must be text_c_0, logo_c_0, …
How can I do that
Many thanx in advance
Try this:
import pandas as pd
data = {
'ID': [1, 2, 3, 4, 5, 6, 7, 8],
'LABEL': ['text', 'logo', 'logo', 'person', 'text', 'text', 'person', 'logo'],
'cluster_label': ['c_0', 'c_0', 'c_0', 'c_1', 'c_1', 'c_2', 'c_2', 'c_3']
}
df = pd.DataFrame(data)
pd.get_dummies(df,columns=['cluster_label'])
df['dummy'] = df.apply (lambda row: row['LABEL']+'_'+row['cluster_label'], axis=1)
pd.get_dummies(df['dummy'])
## If you want to keep ['ID','LABEL','cluster_label'] in your df :
df = df.join(pd.get_dummies(df['dummy']))

Finding original variable names of the important attibutes in a FAMD PCA using Prince

I'm using the package Prince to perform a FAMD on data that consists of mixed data (so both categorical and non-categorical).
My code is the following:
famd = prince.FAMD(n_components=10, n_iter=3, copy=True, check_input=True, engine='auto', random_state=42)
famd = famd.fit(df_pca)
Which gives as output
Explained inertia
[0.08057161 0.05946225 0.03875787 0.03203083 0.02978785 0.02868602
0.02499968 0.02416245 0.02207422 0.02055546]
I have already tried df = pd.DataFrame(pca.components_, columns=list(dfPca.columns)) as mentioned in PCA on sklearn - how to interpret pca.components_ . Next to that I have attempted to implement the solution offered by user seralouk with some minor changes to make it fit the Prince FAMD.
n_pcs = len(inertia)
most_important = [inertia[i].argmax() for i in range(n_pcs)]
initial_feature_names = df_pca.columns
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
pca_results = pd.DataFrame(dic.items())
However this does not appear to work for the Prince FAMD. Are there any ways to link the output of the FAMD to the original variable names?
The link you cited is for a pca in sklearn. You are using a famd from another package now which is quite different altogether.
In the link cited, the solution by #seralouk basically goes through the eigenvector for each PC and takes out the column with highest absolute weight. Note this is NOT linking each component to the original column. This is finding the original column which contribute most to the PC.
You can do something like below, but I would suggest reading up on FAMD / PCA in this book to be sure of what you are actually extracting:
Below is a rough implementation to get the columns that contribute most to each component, using the V matrix. Using the example in prince help page:
import pandas as pd
X = pd.DataFrame(
data=[
['A', 'A', 'A', 2, 5, 7, 6, 3, 6, 7],
['A', 'A', 'A', 4, 4, 4, 2, 4, 4, 3],
['B', 'A', 'B', 5, 2, 1, 1, 7, 1, 1],
['B', 'A', 'B', 7, 2, 1, 2, 2, 2, 2],
['B', 'B', 'B', 3, 5, 6, 5, 2, 6, 6],
['B', 'B', 'A', 3, 5, 4, 5, 1, 7, 5]
],
columns=['E1 fruity', 'E1 woody', 'E1 coffee',
'E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody',
'E3 fruity', 'E3 butter', 'E3 woody'],
index=['Wine {}'.format(i+1) for i in range(6)]
)
import prince
import numpy as np
n_pcs = 3
famd = prince.FAMD(n_components=n_pcs)
famd = famd.fit(X)
most_important = np.abs(famd.V_).argmax(axis=1)
initial_feature_names = X.columns
most_important_names = initial_feature_names[most_important]
dic = {'PC{}'.format(i+1): most_important_names[i] for i in range(n_pcs)}
pca_results = pd.DataFrame(dic.items())
0 1
0 PC1 E1 coffee
1 PC2 E1 woody
2 PC3 E2 red fruit

Creating stacked chart in Altair with multiple axes and gaps

I am trying to create a chart depicting two different processes on the same timescale in Altair. Here is an example in excel
I have generated the stacked horizontal bar chart below in excel using the following data. The numbers in red are offsets/gaps, not to be displayed in the final plot. Nothing special about these numbers, please feel free to use any other set of numbers.
The numbers in red are offsets and
I would have posted an attempt, but I am completely out of my depth guessing what functionality to begin with. Any help would be greatly appreciated.
Here's an example of how you might make a chart like this, using a conditional opacity to hide the offset values:
import altair as alt
import pandas as pd
df = pd.DataFrame({
'axis': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [0.5, 0.9, 2, 1, 3, 1, 0.8, 1, 1.4, 1.1, 4.1, 0.3, 1.1],
'label': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', None, 'C', None, 'F', 'G']
})
alt.Chart(df.reset_index()).mark_bar().encode(
y=alt.Y('axis:O', scale=alt.Scale(domain=[2, 1])),
x='value:Q',
color=alt.Color('label:N', legend=None),
opacity=alt.condition('isValid(datum.label)', alt.value(1), alt.value(0)),
order=alt.Order('index:Q', sort='ascending')
)

bar chart legend based on coloring of bars by group not value

I've created a bar chart as described here where I have multiple variables (indicated in the 'value' column) and they belong to repeat groups. I've colored the bars by their group membership.
I want to create a legend ultimately equivalent to the colors dictionary, showing the color corresponding to a given group membership.
Code here:
d = {'value': [1, 2, 4, 5, 7 ,10], 'group': [1, 2, 3, 2, 2, 3]}
df = pd.DataFrame(data=d)
colors = {1: 'r', 2: 'b', 3: 'g'}
df['value'].plot(kind='bar', color=[colors[i] for i in df['group']])
plt.legend(df['group'])
In this way, I just get a legend with one color (1) instead of (1, 2, 3).
Thanks!
You can use sns:
sns.barplot(data=df, x=df.index, y='value',
hue='group', palette=colors, dodge=False)
Output:
With pandas, you could create your own legend as follows:
from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
d = {'value': [1, 2, 4, 5, 7 ,10], 'group': [1, 2, 3, 2, 2, 3]}
df = pd.DataFrame(data=d)
colors = {1: 'r', 2: 'b', 3: 'g'}
df['value'].plot(kind='bar', color=[colors[i] for i in df['group']])
handles = [mpatches.Patch(color=colors[i]) for i in colors]
labels = [f'group {i}' for i in colors]
plt.legend(handles, labels)
plt.show()

Categories