Plotting arrays with different lengths in seaborn - python

I have a dataframe that I would like to make a strip plot out of, the array consists of the following
Symbol Avg.Sentiment Weighted Mentions Sentiment
0 AMC 0.14 0.80 557 [-0.38, -0.48, -0.27, -0.42, 0.8, -0.8, 0.13, ...
2 GME 0.15 0.26 175 [-0.27, 0.13, -0.53, 0.65, -0.91, 0.66, 0.67, ...
1 BB 0.23 0.29 126 [-0.27, 0.34, 0.8, -0.14, -0.39, 0.4, 0.34, -0...
11 SPY -0.06 -0.03 43 [0.32, -0.38, -0.54, 0.36, -0.18, 0.18, -0.33,...
4 SPCE 0.26 0.09 35 [0.65, 0.57, 0.74, 0.48, -0.54, -0.15, -0.3, -...
13 AH 0.06 0.02 33 [0.62, 0.66, -0.18, -0.62, 0.12, -0.42, -0.59,...
12 PLTR 0.16 0.05 29 [0.66, 0.36, 0.64, 0.59, -0.42, 0.65, 0.15, -0...
15 TSLA 0.13 0.03 24 [0.1, 0.38, 0.64, 0.42, -0.32, 0.32, 0.44, -0....
and so on, the number of elements in the list of 'Sentiment' are the same as the number of mentions, I would like to make a strip plot with the Symbol as the x axis and sentiment as the y axis, I believe the problem that I'm encountering is because of the different lengths of list, the actual error reading I'm getting is
ValueError: setting an array element with a sequence.
the code that I'm trying to use to create the strip plot is this
def symbolSentimentVisualization(dataset):
sns.stripplot(x='Symbol',y='Sentiment',data=dataset.loc[:9])
plt.show()
the other part of my issue I would guess has something to do with numpy trying to set multidimensional arrays with different lengths before being put into a seaborn plot, but not 100% on that, if the solution is to plot one row at a time and then merge plots that would definitely work but I'm not sure what exactly I should call to do that because trying it out with the following doesn't seem to work either.
def symbolSentimentVisualization(dataset):
sns.stripplot(x=dataset['Symbol'][0],y=dataset['Sentiment'][0],data=dataset.loc[:9])
plt.show()

IIUC explode 'Sentiment' first then plot:
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
Sample Data:
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
Symbol Mentions Sentiment
0 AMC 557 [-0.556013657820521, 0.7414646123547528, -0.58...
1 GME 175 [-0.5673003921341209, -0.6504850189478857, 0.1...
2 BB 126 [0.7771316020052821, 0.26579994709269994, -0.4...
3 SPY 43 [-0.5966607678089173, -0.4473484233894889, 0.7...
4 SPCE 35 [0.7934741289205556, 0.17613102678923398, 0.58...
Resulting Graph:
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
plt.show()

Related

seaborn: barplot of a dataframe by group

I am having difficulty with this. I have the results from my initial model (`Unfiltered´), that I plot like so:
df = pd.DataFrame(
{'class': ['foot', 'bike', 'bus', 'car', 'metro'],
'Precision': [0.7, 0.66, 0.41, 0.61, 0.11],
'Recall': [0.58, 0.35, 0.13, 0.89, 0.02],
'F1-score': [0.64, 0.45, 0.2, 0.72, 0.04]}
)
groups = df.melt(id_vars=['class'], var_name=['Metric'])
sns.barplot(data=groups, x='class', y='value', hue='Metric')
To produce this nice plot:
Now, I obtained a second results from my improved model (filtered), so I add a column (status) to my df to indicate the results from each model like this:
df2 = pd.DataFrame(
{'class': ['foot','foot','bike','bike','bus','bus',
'car','car','metro','metro'],
'Precison': [0.7, 0.62, 0.66, 0.96, 0.41, 0.42, 0.61, 0.75, 0.11, 0.3],
'Recall': [0.58, 0.93, 0.35, 0.4, 0.13, 0.1, 0.89, 0.86, 0.02, 0.01],
'F1-score': [0.64, 0.74, 0.45, 0.56, 0.2, 0.17, 0.72, 0.8, 0.04, 0.01],
'status': ['Unfiltered', 'Filtered', 'Unfiltered','Filtered','Unfiltered',
'Filtered','Unfiltered','Filtered','Unfiltered','Filtered']}
)
df2.head()
class Precison Recall F1-score status
0 foot 0.70 0.58 0.64 Unfiltered
1 foot 0.62 0.93 0.74 Filtered
2 bike 0.66 0.35 0.45 Unfiltered
3 bike 0.96 0.40 0.56 Filtered
4 bus 0.41 0.13 0.20 Unfiltered
And I want to plot this, in similar grouping as above (i.e. foot, bike, bus, car, metro). However, for each of the metrics, I want to place the two values side-by-side. Take for example, the foot group, I would have two bars Precision[Unfiltered, filtered], then 2 bars for Recall[Unfiltered, filtered] and also 2 bars for F1-score[Unfiltered, filtered]. Likewise all other groups.
My attempt:
group2 = df2.melt(id_vars=['class', 'status'], var_name=['Metric'])
sns.barplot(data=group2, x='class', y='value', hue='Metric')
Totally not what I want.
You can pass in hue any sequence as long as it has the same length as your data, and assign colours through it.
So you could try with
group2 = df2.melt(id_vars=['class', 'status'], var_name=['Metric'])
sns.barplot(data=group2, x='class', y='value', hue=group2[['Metric','status']].agg(tuple, axis=1))
plt.legend(fontsize=7)
But the result is a bit hard to read:
Seaborn grouped barplots don't allow for multiple grouping variables. One workaround is to recode the two grouping variables (Metric and status) as one variable with 6 levels. Another possibility is to use facets. If you are open to another plotting package, I might recommend plotnine, which allows multiple grouping variables as follows:
import plotnine as p9
fig = (
p9.ggplot(group2)
+ p9.geom_col(
p9.aes(x="class", y="value", fill="Metric", color="Metric", alpha="status"),
position=p9.position_dodge(1),
size=1,
width=0.5,
)
+ p9.scale_color_manual(("red", "blue", "green"))
+ p9.scale_fill_manual(("red", "blue", "green"))
)
fig.draw()
This generates the following image:

How to change the colours in plotly heatmap

I have the following sample dataframe matrix below which I generated using some functions I created:
Loan_ID Gender Married Dependents Education
Loan_ID 1.000 NaN NaN 0.000 0.000
Gender NaN 1.000 NaN NaN NaN
Married 0.638 0.638 1.000 0.638 0.638
Dependents 0.000 0.000 0.000 1.000 0.000
Education 0.502 0.502 0.502 0.502 1.000
I am trying to use plotly to plot heatmap but with specific colours based on some values. Based on the dataframe. if the value is less than 0.05, I want the cell to be green if the cell is greater than 0.05 but less than 0.1, I want the colour to be green. The conditional statement will look like something below:
colorscales = []
data_mask = df_mask.to_numpy()
for row in data_mask:
for value in row:
if np.isnan(value):
color = "#f8fffa"
elif float(value) < 0.05:
color = "#10c13b"
elif (float(value) > 0.05) and (float(value) < 0.1):
color = '#fac511'
colorscales.append(color)
I want the colour showed in the plotly heatmap to be reflected by these colours. I have tried using the colorscales and also the bgcolor in the figure layout but nothing works. Any suggestions will be highly appreciated
I solved this problem by using a scale [0.1] with the colorscale argument like below:
colorscale = [[0, "#10c13b"],
[0.05, '#10c13b'],
[0.051, '#fac511'],
[0.1, "#fac511"],
[0.11, "#f71907"],
[0.2, "#f71907"],
[0.3, "#f71907"],
[0.4, "#f71907"],
[0.5, "#f71907"],
[0.6, "#f71907"],
[0.7, "#f71907"],
[0.8, "#f71907"],
[0.9, "#f71907"],
[1.0, "#f71907"]]
This solved my problem efficiently, in case anyone ever needs to do this!

How to find averages for each bin of a column

I have a data frame below:
import pandas as pd
df = {'A':[1.06, 1.01, 0.99, 0.98, 1.05, 0.96], 'B':[2, 7, 22, 7, 15, 16]}
df = pd.DataFrame(df)
I want to find the mean of column B for each bin of column A.
For example, if I want to create bins of 0.02 starting the minimum value in column A, then bins will be like this (Inclusive):
1) 0.96-0.98
2) 0.99-1.01
3) 1.02-1.04
4) 1.05-1.07
The average of each bin will be
1) (16+7)/2 = 11.5
2) (7+22)/2 = 14.5
3) 0
4) (2+15)/2 = 8.5
Thus, the outcome will look like:
df = {'A':[1.06, 1.01, 0.99, 0.98, 1.05, 0.96], 'B':[2, 7, 22, 7, 15, 16], 'Avg':[8.5, 14.5, 14.5, 11.5, 8.5, 11.5]}
df = pd.DataFrame(df)
LOGIC
You can make bins group and then use bins to group data and then apply transform on groupby which will return the result for each row rather then grouped data.
PS: pd.cut is used for binning, assuming you have approach to get desired bins.
SOLUTION
import pandas as pd
df = {"A": [1.06, 1.01, 0.99, 0.98, 1.05, 0.96], "B": [2, 7, 22, 7, 15, 16]}
df = pd.DataFrame(df)
df["cat"] = pd.cut(df["A"], [0.95, 0.98, 1.01, 1.04, 1.07], right=True)
print(df)
OUTPUT with Bins
A B cat
0 1.06 2 (1.04, 1.07]
1 1.01 7 (0.98, 1.01]
2 0.99 22 (0.98, 1.01]
3 0.98 7 (0.95, 0.98]
4 1.05 15 (1.04, 1.07]
5 0.96 16 (0.95, 0.98]
df["Avg"] = df.groupby("cat")["B"].transform("mean")
# you can directly groupby pd.cut without making new column
print(df)
FINAL OUTPUT with average
A B cat Avg
0 1.06 2 (1.04, 1.07] 8.5
1 1.01 7 (0.98, 1.01] 14.5
2 0.99 22 (0.98, 1.01] 14.5
3 0.98 7 (0.95, 0.98] 11.5
4 1.05 15 (1.04, 1.07] 8.5
5 0.96 16 (0.95, 0.98] 11.5

How to set precision on column names made by np.arange()?

I made dataframe and set column names by using np.arange(). However instead of exact numbers it (sometimes) sets them to numbers like 0.300000004.
I tried both rounding entire dataframe and using np.around() on np.arange() output but none of these seems to work.
I also tried to add these at the top:
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
Here is return statement of my function:
stepT = 0.1
%net is some numpy array
return pd.DataFrame(net, columns = np.arange(0,1+stepT, stepT),
index = np.around(np.arange(0,1+stepS,stepS),decimals = 3)).round(3)
Is there any function that will allow me to have these names as numbers with only one digit after comma?
The apparent imprecision of floating point numbers comes up often.
In [689]: np.arange(0,1+stepT, stepT)
Out[689]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [690]: _.tolist()
Out[690]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [691]: _689[3]
Out[691]: 0.30000000000000004
The numpy print options control how the arrays are displayed. but they have no effect when individual values are printed.
When I make a dataframe with this column specification I get a nice display. (_689 is ipython shorthand for the Out[689] array.) It is using the array formatting:
In [699]: df = pd.DataFrame(np.arange(11)[None,:], columns=_689)
In [700]: df
Out[700]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0 1 2 3 4 5 6 7 8 9 10
In [701]: df.columns
Out[701]:
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0],
dtype='float64')
But selecting columns with floats like this is tricky. Some work, some don't.
In [705]: df[0.4]
Out[705]:
0 4
Name: 0.4, dtype: int64
In [707]: df[0.3]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Looks like it's doing some sort of dictionary lookup. Floats don't work well for that, because of their inherent imprecision.
Doing an equality test on the arange:
In [710]: _689[3]==0.3
Out[710]: False
In [711]: _689[4]==0.4
Out[711]: True
I think you should create a list of properly formatted strings from the arange, and use that as column headers, not the floats themselves.
For example:
In [714]: alist = ['%.3f'%i for i in _689]
In [715]: alist
Out[715]:
['0.000',
'0.100',
'0.200',
'0.300',
'0.400',
'0.500',
'0.600',
'0.700',
'0.800',
'0.900',
'1.000']
In [716]: df = pd.DataFrame(np.arange(11)[None,:], columns=alist)
In [717]: df
Out[717]:
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
0 0 1 2 3 4 5 6 7 8 9 10
In [718]: df.columns
Out[718]:
Index(['0.000', '0.100', '0.200', '0.300', '0.400', '0.500', '0.600', '0.700',
'0.800', '0.900', '1.000'],
dtype='object')
In [719]: df['0.300']
Out[719]:
0 3
Name: 0.300, dtype: int64

How to summarize segmented roadway data in Python using Pandas similar to a GIS dissolve operation?

I have segmented roadway data that looks like this:
import pandas as pd
input_df = pd.DataFrame({
'ROUTE': ['US9', 'US9', 'US9', 'US9', 'US9'],
'BMP': [0.0, 0.1, 0.2, 0.3, 0.4],
'EMP': [0.1, 0.2, 0.3, 0.4, 0.5],
'VALUE': [19, 19, 232, 232, 19]
})
>>> print(input_df)
BMP EMP ROUTE VALUE
0.0 0.1 US9 19
0.1 0.2 US9 19
0.2 0.3 US9 232
0.3 0.4 US9 232
0.4 0.5 US9 19
The BMP column represents the begin milepoint of this attribute along a linear referenced GIS representation of the road. The EMP is the associated end mileage. When the VALUE column is equal, I would like to combine adjacent segments.
There is a tool that does this operation in ArcGIS called Dissolve Route Events. I would like to use Pandas to complete this task. Here's the desired output:
output_df = pd.DataFrame({
'ROUTE': ['US9', 'US9', 'US9'],
'BMP': [0.0, 0.2, 0.4],
'EMP': [0.2, 0.4, 0.5],
'VALUE': [19, 232, 19]
})
>>> print(output_df)
BMP EMP ROUTE VALUE
0.0 0.2 US9 19
0.2 0.4 US9 232
0.4 0.5 US9 19
Try this!
input_df['trip'] = (input_df.VALUE.diff() != 0).cumsum()
output_df = input_df.groupby(['ROUTE','trip','VALUE']).agg({'BMP':'first','EMP':'last'})
output_df.reset_index()
#
ROUTE trip VALUE BMP EMP
0 US9 1 19 0.0 0.2
1 US9 2 232 0.2 0.4
2 US9 3 19 0.4 0.5

Categories