Related
I am having difficulty with this. I have the results from my initial model (`Unfiltered´), that I plot like so:
df = pd.DataFrame(
{'class': ['foot', 'bike', 'bus', 'car', 'metro'],
'Precision': [0.7, 0.66, 0.41, 0.61, 0.11],
'Recall': [0.58, 0.35, 0.13, 0.89, 0.02],
'F1-score': [0.64, 0.45, 0.2, 0.72, 0.04]}
)
groups = df.melt(id_vars=['class'], var_name=['Metric'])
sns.barplot(data=groups, x='class', y='value', hue='Metric')
To produce this nice plot:
Now, I obtained a second results from my improved model (filtered), so I add a column (status) to my df to indicate the results from each model like this:
df2 = pd.DataFrame(
{'class': ['foot','foot','bike','bike','bus','bus',
'car','car','metro','metro'],
'Precison': [0.7, 0.62, 0.66, 0.96, 0.41, 0.42, 0.61, 0.75, 0.11, 0.3],
'Recall': [0.58, 0.93, 0.35, 0.4, 0.13, 0.1, 0.89, 0.86, 0.02, 0.01],
'F1-score': [0.64, 0.74, 0.45, 0.56, 0.2, 0.17, 0.72, 0.8, 0.04, 0.01],
'status': ['Unfiltered', 'Filtered', 'Unfiltered','Filtered','Unfiltered',
'Filtered','Unfiltered','Filtered','Unfiltered','Filtered']}
)
df2.head()
class Precison Recall F1-score status
0 foot 0.70 0.58 0.64 Unfiltered
1 foot 0.62 0.93 0.74 Filtered
2 bike 0.66 0.35 0.45 Unfiltered
3 bike 0.96 0.40 0.56 Filtered
4 bus 0.41 0.13 0.20 Unfiltered
And I want to plot this, in similar grouping as above (i.e. foot, bike, bus, car, metro). However, for each of the metrics, I want to place the two values side-by-side. Take for example, the foot group, I would have two bars Precision[Unfiltered, filtered], then 2 bars for Recall[Unfiltered, filtered] and also 2 bars for F1-score[Unfiltered, filtered]. Likewise all other groups.
My attempt:
group2 = df2.melt(id_vars=['class', 'status'], var_name=['Metric'])
sns.barplot(data=group2, x='class', y='value', hue='Metric')
Totally not what I want.
You can pass in hue any sequence as long as it has the same length as your data, and assign colours through it.
So you could try with
group2 = df2.melt(id_vars=['class', 'status'], var_name=['Metric'])
sns.barplot(data=group2, x='class', y='value', hue=group2[['Metric','status']].agg(tuple, axis=1))
plt.legend(fontsize=7)
But the result is a bit hard to read:
Seaborn grouped barplots don't allow for multiple grouping variables. One workaround is to recode the two grouping variables (Metric and status) as one variable with 6 levels. Another possibility is to use facets. If you are open to another plotting package, I might recommend plotnine, which allows multiple grouping variables as follows:
import plotnine as p9
fig = (
p9.ggplot(group2)
+ p9.geom_col(
p9.aes(x="class", y="value", fill="Metric", color="Metric", alpha="status"),
position=p9.position_dodge(1),
size=1,
width=0.5,
)
+ p9.scale_color_manual(("red", "blue", "green"))
+ p9.scale_fill_manual(("red", "blue", "green"))
)
fig.draw()
This generates the following image:
I have a DataFrame called "DataExample" and an ascending sorted list called "normalsizes".
import pandas as pd
if __name__ == "__main__":
DataExample = [[0.6, 0.36 ,0.00],
[0.6, 0.36 ,0.00],
[0.9, 0.81 ,0.85],
[0.8, 0.64 ,0.91],
[1.0, 1.00 ,0.92],
[1.0, 1.00 ,0.95],
[0.9, 0.81 ,0.97],
[1.2, 1.44 ,0.97],
[1.0, 1.00 ,0.97],
[1.0, 1.00 ,0.99],
[1.2, 1.44 ,0.99],
[1.1, 1.21 ,0.99]]
DataExample = pd.DataFrame(data = DataExample, columns = ['Lx', 'A', 'Ratio'])
normalsizes = [0, 0.75, 1, 1.25, 1.5, 1.75 ,2, 2.25, 2.4, 2.5, 2.75, 3,
3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6]
# for i in example.index:
#
# numb = example['Lx'][i]
What I am looking for is that each “DataExample [‘ Lx ’]” is analyzed and located within a range of normalsizes, for example:
For DataExample [‘Lx’] [0] = 0.6 -----> then it is between the interval of [0, 0.75] -----> 0.6> 0 and 0.6 <= 0.75 -----> so I take the largest of that interval, that is, 0.75. This for each row.
With this I should have the following result:
Lx A Ratio
1 0.36 0
1 0.36 0
1 0.81 0.85
1 0.64 0.91
1.25 1 0.92
1.25 1 0.95
1 0.81 0.97
1.25 1.44 0.97
1.25 1 0.97
1.25 1 0.99
1.25 1.44 0.99
1.25 1.21 0.99
numpy.searchsorted will get you what you want
import numpy as np
normalsizes = np.array(normalsizes) # convert to numpy array
DataExample["Lx"] = normalsizes[np.searchsorted(normalsizes, DataExample["Lx"])]
I have a dataframe that I would like to make a strip plot out of, the array consists of the following
Symbol Avg.Sentiment Weighted Mentions Sentiment
0 AMC 0.14 0.80 557 [-0.38, -0.48, -0.27, -0.42, 0.8, -0.8, 0.13, ...
2 GME 0.15 0.26 175 [-0.27, 0.13, -0.53, 0.65, -0.91, 0.66, 0.67, ...
1 BB 0.23 0.29 126 [-0.27, 0.34, 0.8, -0.14, -0.39, 0.4, 0.34, -0...
11 SPY -0.06 -0.03 43 [0.32, -0.38, -0.54, 0.36, -0.18, 0.18, -0.33,...
4 SPCE 0.26 0.09 35 [0.65, 0.57, 0.74, 0.48, -0.54, -0.15, -0.3, -...
13 AH 0.06 0.02 33 [0.62, 0.66, -0.18, -0.62, 0.12, -0.42, -0.59,...
12 PLTR 0.16 0.05 29 [0.66, 0.36, 0.64, 0.59, -0.42, 0.65, 0.15, -0...
15 TSLA 0.13 0.03 24 [0.1, 0.38, 0.64, 0.42, -0.32, 0.32, 0.44, -0....
and so on, the number of elements in the list of 'Sentiment' are the same as the number of mentions, I would like to make a strip plot with the Symbol as the x axis and sentiment as the y axis, I believe the problem that I'm encountering is because of the different lengths of list, the actual error reading I'm getting is
ValueError: setting an array element with a sequence.
the code that I'm trying to use to create the strip plot is this
def symbolSentimentVisualization(dataset):
sns.stripplot(x='Symbol',y='Sentiment',data=dataset.loc[:9])
plt.show()
the other part of my issue I would guess has something to do with numpy trying to set multidimensional arrays with different lengths before being put into a seaborn plot, but not 100% on that, if the solution is to plot one row at a time and then merge plots that would definitely work but I'm not sure what exactly I should call to do that because trying it out with the following doesn't seem to work either.
def symbolSentimentVisualization(dataset):
sns.stripplot(x=dataset['Symbol'][0],y=dataset['Sentiment'][0],data=dataset.loc[:9])
plt.show()
IIUC explode 'Sentiment' first then plot:
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
Sample Data:
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
Symbol Mentions Sentiment
0 AMC 557 [-0.556013657820521, 0.7414646123547528, -0.58...
1 GME 175 [-0.5673003921341209, -0.6504850189478857, 0.1...
2 BB 126 [0.7771316020052821, 0.26579994709269994, -0.4...
3 SPY 43 [-0.5966607678089173, -0.4473484233894889, 0.7...
4 SPCE 35 [0.7934741289205556, 0.17613102678923398, 0.58...
Resulting Graph:
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
plt.show()
I have a heatmap that I created from Pandas in this way:
tukey = tukey.set_index('index')
fix,ax = plt.subplots(figsize=(12,6))
ax.set_title(str(date)+' '+ str(hour)+':'+'00',fontsize=14)
heatmap_args = {'linewidths': 0.35, 'linecolor': '0.5', 'clip_on': False, 'square': True, 'cbar_ax_bbox': [0.75, 0.35, 0.04, 0.3]}
sp.sign_plot(tukey, **heatmap_args)
I have tried to do this with seaborn but I haven't gotten the desired output:
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(tukey, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 6))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(tukey, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
As seen, it still shows square where it is supposed to be masked and obviously the cbar is different.
My question is if there is any way to make it diagonal without using seaborn? Or at least just to get rid of the repeating part?
Edit: sample of my dataframe (the tukey):
>>> 1_a 1_b 1_c 1_d 1_e 1_f
index
1_a 1.00 0.900 0.75 0.736 0.900 0.400
1_b 0.9000 1.000 0.72 0.715 0.900 0.508
1_c 0.756 0.342 1.000 0.005 0.124 0.034
1_d 0.736 0.715 0.900 1.000 0.081 0.030
1_e 0.900 0.900 0.804 0.793 1.000 0.475
1_f 0.400 0.508 0.036 0.030 0.475 1.000
*I might have typo mistakes , the two diagonal sides suppose to be equal.
edit:
imports:
import scikit_posthocs as sp
import pandas as pd
import numpy as np
import statsmodels.api as sm
import scipy.stats as stats
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
scikit_posthocs' sign_plot() seems to create a QuadMesh (as does sns.heatmap). Setting an edge color to such a mesh will show horizontal and vertical lines for the full width and height of the mesh. To make the edges invisible in the "empty" region, they can be colored the same as the background (for example white). Making individual cells invisible can be done by setting their values to NaN such as in the code below.
Removing a column and a row (e.g. tukey.drop('1_f', axis=1, inplace=True) and
tukey.drop('1_a', axis=0, inplace=True)), doesn't help to make the plot a bit smaller because sign_plot adds them back in automatically.
import matplotlib.pyplot as plt
import scikit_posthocs as sp
import pandas as pd
import numpy as np
from io import StringIO
data_str = ''' 1_a 1_b 1_c 1_d 1_e 1_f
1_a 1.00 0.900 0.75 0.736 0.900 0.400
1_b 0.9000 1.000 0.72 0.715 0.900 0.508
1_c 0.756 0.342 1.000 0.005 0.124 0.034
1_d 0.736 0.715 0.900 1.000 0.081 0.030
1_e 0.900 0.900 0.804 0.793 1.000 0.475
1_f 0.400 0.508 0.036 0.030 0.475 1.000'''
tukey = pd.read_csv(StringIO(data_str), delim_whitespace=True)
cols = tukey.columns
for i in range(len(cols)):
for j in range(i, len(cols)):
tukey.iloc[i, j] = np.nan
fix, ax = plt.subplots(figsize=(12, 6))
heatmap_args = {'linewidths': 0.35, 'linecolor': 'white', 'clip_on': False, 'square': True,
'cbar_ax_bbox': [0.75, 0.35, 0.04, 0.3]}
sp.sign_plot(tukey, **heatmap_args)
plt.show()
I made dataframe and set column names by using np.arange(). However instead of exact numbers it (sometimes) sets them to numbers like 0.300000004.
I tried both rounding entire dataframe and using np.around() on np.arange() output but none of these seems to work.
I also tried to add these at the top:
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
Here is return statement of my function:
stepT = 0.1
%net is some numpy array
return pd.DataFrame(net, columns = np.arange(0,1+stepT, stepT),
index = np.around(np.arange(0,1+stepS,stepS),decimals = 3)).round(3)
Is there any function that will allow me to have these names as numbers with only one digit after comma?
The apparent imprecision of floating point numbers comes up often.
In [689]: np.arange(0,1+stepT, stepT)
Out[689]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [690]: _.tolist()
Out[690]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [691]: _689[3]
Out[691]: 0.30000000000000004
The numpy print options control how the arrays are displayed. but they have no effect when individual values are printed.
When I make a dataframe with this column specification I get a nice display. (_689 is ipython shorthand for the Out[689] array.) It is using the array formatting:
In [699]: df = pd.DataFrame(np.arange(11)[None,:], columns=_689)
In [700]: df
Out[700]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0 1 2 3 4 5 6 7 8 9 10
In [701]: df.columns
Out[701]:
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0],
dtype='float64')
But selecting columns with floats like this is tricky. Some work, some don't.
In [705]: df[0.4]
Out[705]:
0 4
Name: 0.4, dtype: int64
In [707]: df[0.3]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Looks like it's doing some sort of dictionary lookup. Floats don't work well for that, because of their inherent imprecision.
Doing an equality test on the arange:
In [710]: _689[3]==0.3
Out[710]: False
In [711]: _689[4]==0.4
Out[711]: True
I think you should create a list of properly formatted strings from the arange, and use that as column headers, not the floats themselves.
For example:
In [714]: alist = ['%.3f'%i for i in _689]
In [715]: alist
Out[715]:
['0.000',
'0.100',
'0.200',
'0.300',
'0.400',
'0.500',
'0.600',
'0.700',
'0.800',
'0.900',
'1.000']
In [716]: df = pd.DataFrame(np.arange(11)[None,:], columns=alist)
In [717]: df
Out[717]:
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
0 0 1 2 3 4 5 6 7 8 9 10
In [718]: df.columns
Out[718]:
Index(['0.000', '0.100', '0.200', '0.300', '0.400', '0.500', '0.600', '0.700',
'0.800', '0.900', '1.000'],
dtype='object')
In [719]: df['0.300']
Out[719]:
0 3
Name: 0.300, dtype: int64