I made dataframe and set column names by using np.arange(). However instead of exact numbers it (sometimes) sets them to numbers like 0.300000004.
I tried both rounding entire dataframe and using np.around() on np.arange() output but none of these seems to work.
I also tried to add these at the top:
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
Here is return statement of my function:
stepT = 0.1
%net is some numpy array
return pd.DataFrame(net, columns = np.arange(0,1+stepT, stepT),
index = np.around(np.arange(0,1+stepS,stepS),decimals = 3)).round(3)
Is there any function that will allow me to have these names as numbers with only one digit after comma?
The apparent imprecision of floating point numbers comes up often.
In [689]: np.arange(0,1+stepT, stepT)
Out[689]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [690]: _.tolist()
Out[690]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [691]: _689[3]
Out[691]: 0.30000000000000004
The numpy print options control how the arrays are displayed. but they have no effect when individual values are printed.
When I make a dataframe with this column specification I get a nice display. (_689 is ipython shorthand for the Out[689] array.) It is using the array formatting:
In [699]: df = pd.DataFrame(np.arange(11)[None,:], columns=_689)
In [700]: df
Out[700]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0 1 2 3 4 5 6 7 8 9 10
In [701]: df.columns
Out[701]:
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0],
dtype='float64')
But selecting columns with floats like this is tricky. Some work, some don't.
In [705]: df[0.4]
Out[705]:
0 4
Name: 0.4, dtype: int64
In [707]: df[0.3]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Looks like it's doing some sort of dictionary lookup. Floats don't work well for that, because of their inherent imprecision.
Doing an equality test on the arange:
In [710]: _689[3]==0.3
Out[710]: False
In [711]: _689[4]==0.4
Out[711]: True
I think you should create a list of properly formatted strings from the arange, and use that as column headers, not the floats themselves.
For example:
In [714]: alist = ['%.3f'%i for i in _689]
In [715]: alist
Out[715]:
['0.000',
'0.100',
'0.200',
'0.300',
'0.400',
'0.500',
'0.600',
'0.700',
'0.800',
'0.900',
'1.000']
In [716]: df = pd.DataFrame(np.arange(11)[None,:], columns=alist)
In [717]: df
Out[717]:
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
0 0 1 2 3 4 5 6 7 8 9 10
In [718]: df.columns
Out[718]:
Index(['0.000', '0.100', '0.200', '0.300', '0.400', '0.500', '0.600', '0.700',
'0.800', '0.900', '1.000'],
dtype='object')
In [719]: df['0.300']
Out[719]:
0 3
Name: 0.300, dtype: int64
Related
I am having difficulty with this. I have the results from my initial model (`Unfiltered´), that I plot like so:
df = pd.DataFrame(
{'class': ['foot', 'bike', 'bus', 'car', 'metro'],
'Precision': [0.7, 0.66, 0.41, 0.61, 0.11],
'Recall': [0.58, 0.35, 0.13, 0.89, 0.02],
'F1-score': [0.64, 0.45, 0.2, 0.72, 0.04]}
)
groups = df.melt(id_vars=['class'], var_name=['Metric'])
sns.barplot(data=groups, x='class', y='value', hue='Metric')
To produce this nice plot:
Now, I obtained a second results from my improved model (filtered), so I add a column (status) to my df to indicate the results from each model like this:
df2 = pd.DataFrame(
{'class': ['foot','foot','bike','bike','bus','bus',
'car','car','metro','metro'],
'Precison': [0.7, 0.62, 0.66, 0.96, 0.41, 0.42, 0.61, 0.75, 0.11, 0.3],
'Recall': [0.58, 0.93, 0.35, 0.4, 0.13, 0.1, 0.89, 0.86, 0.02, 0.01],
'F1-score': [0.64, 0.74, 0.45, 0.56, 0.2, 0.17, 0.72, 0.8, 0.04, 0.01],
'status': ['Unfiltered', 'Filtered', 'Unfiltered','Filtered','Unfiltered',
'Filtered','Unfiltered','Filtered','Unfiltered','Filtered']}
)
df2.head()
class Precison Recall F1-score status
0 foot 0.70 0.58 0.64 Unfiltered
1 foot 0.62 0.93 0.74 Filtered
2 bike 0.66 0.35 0.45 Unfiltered
3 bike 0.96 0.40 0.56 Filtered
4 bus 0.41 0.13 0.20 Unfiltered
And I want to plot this, in similar grouping as above (i.e. foot, bike, bus, car, metro). However, for each of the metrics, I want to place the two values side-by-side. Take for example, the foot group, I would have two bars Precision[Unfiltered, filtered], then 2 bars for Recall[Unfiltered, filtered] and also 2 bars for F1-score[Unfiltered, filtered]. Likewise all other groups.
My attempt:
group2 = df2.melt(id_vars=['class', 'status'], var_name=['Metric'])
sns.barplot(data=group2, x='class', y='value', hue='Metric')
Totally not what I want.
You can pass in hue any sequence as long as it has the same length as your data, and assign colours through it.
So you could try with
group2 = df2.melt(id_vars=['class', 'status'], var_name=['Metric'])
sns.barplot(data=group2, x='class', y='value', hue=group2[['Metric','status']].agg(tuple, axis=1))
plt.legend(fontsize=7)
But the result is a bit hard to read:
Seaborn grouped barplots don't allow for multiple grouping variables. One workaround is to recode the two grouping variables (Metric and status) as one variable with 6 levels. Another possibility is to use facets. If you are open to another plotting package, I might recommend plotnine, which allows multiple grouping variables as follows:
import plotnine as p9
fig = (
p9.ggplot(group2)
+ p9.geom_col(
p9.aes(x="class", y="value", fill="Metric", color="Metric", alpha="status"),
position=p9.position_dodge(1),
size=1,
width=0.5,
)
+ p9.scale_color_manual(("red", "blue", "green"))
+ p9.scale_fill_manual(("red", "blue", "green"))
)
fig.draw()
This generates the following image:
I am new to python a bit.
I am trying to convert a dataframe to list after changing the datatype of a particular column to integer. The funny thing is when converted to list, the column still has float.
There are three columns in the dataframe, first two is float and I want the last to be integer, but it still comes as float.
If I change all to integer, then the list creates as integer.
0 1.53 3.13 0.0
1 0.58 2.83 0.0
2 0.28 2.69 0.0
3 1.14 2.14 0.0
4 1.46 3.39 0.0
... ... ... ...
495 2.37 0.93 1.0
496 2.85 0.52 1.0
497 2.35 0.39 1.0
498 2.96 1.68 1.0
499 2.56 0.16 1.0
Above is the Dataframe.
Below is the last column converted
#convert last column to integer datatype
data[6] = data[6].astype(dtype ='int64')
display(data.dtypes)
The below is converting the dataframe to list.
#Turn DF to list
data_to_List = data.values.tolist()
data_to_List
#below is what is shown now.
[[1.53, 3.13, 0.0],
[0.58, 2.83, 0.0],
[0.28, 2.69, 0.0],
[1.14, 2.14, 0.0],
[3.54, 0.75, 1.0],
[3.04, 0.15, 1.0],
[2.49, 0.15, 1.0],
[2.27, 0.39, 1.0],
[3.65, 1.5, 1.0],
I want the last column to be just 0 or 1 and not 0.0 or 1.0
Yes, you are correct pandas is converting int to float when you use data.values
You can convert your float to int by using the below list comprehension:
data_to_List = [[x[0],x[1],int(x[2])] for x in data.values.tolist()]
print(data_to_List)
[[1.53, 3.13, 0],
[0.58, 2.83, 0],
[0.28, 2.69, 0],
[1.14, 2.14, 0],
[1.46, 3.39, 0]]
I have a dataframe that I would like to make a strip plot out of, the array consists of the following
Symbol Avg.Sentiment Weighted Mentions Sentiment
0 AMC 0.14 0.80 557 [-0.38, -0.48, -0.27, -0.42, 0.8, -0.8, 0.13, ...
2 GME 0.15 0.26 175 [-0.27, 0.13, -0.53, 0.65, -0.91, 0.66, 0.67, ...
1 BB 0.23 0.29 126 [-0.27, 0.34, 0.8, -0.14, -0.39, 0.4, 0.34, -0...
11 SPY -0.06 -0.03 43 [0.32, -0.38, -0.54, 0.36, -0.18, 0.18, -0.33,...
4 SPCE 0.26 0.09 35 [0.65, 0.57, 0.74, 0.48, -0.54, -0.15, -0.3, -...
13 AH 0.06 0.02 33 [0.62, 0.66, -0.18, -0.62, 0.12, -0.42, -0.59,...
12 PLTR 0.16 0.05 29 [0.66, 0.36, 0.64, 0.59, -0.42, 0.65, 0.15, -0...
15 TSLA 0.13 0.03 24 [0.1, 0.38, 0.64, 0.42, -0.32, 0.32, 0.44, -0....
and so on, the number of elements in the list of 'Sentiment' are the same as the number of mentions, I would like to make a strip plot with the Symbol as the x axis and sentiment as the y axis, I believe the problem that I'm encountering is because of the different lengths of list, the actual error reading I'm getting is
ValueError: setting an array element with a sequence.
the code that I'm trying to use to create the strip plot is this
def symbolSentimentVisualization(dataset):
sns.stripplot(x='Symbol',y='Sentiment',data=dataset.loc[:9])
plt.show()
the other part of my issue I would guess has something to do with numpy trying to set multidimensional arrays with different lengths before being put into a seaborn plot, but not 100% on that, if the solution is to plot one row at a time and then merge plots that would definitely work but I'm not sure what exactly I should call to do that because trying it out with the following doesn't seem to work either.
def symbolSentimentVisualization(dataset):
sns.stripplot(x=dataset['Symbol'][0],y=dataset['Sentiment'][0],data=dataset.loc[:9])
plt.show()
IIUC explode 'Sentiment' first then plot:
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
Sample Data:
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
Symbol Mentions Sentiment
0 AMC 557 [-0.556013657820521, 0.7414646123547528, -0.58...
1 GME 175 [-0.5673003921341209, -0.6504850189478857, 0.1...
2 BB 126 [0.7771316020052821, 0.26579994709269994, -0.4...
3 SPY 43 [-0.5966607678089173, -0.4473484233894889, 0.7...
4 SPCE 35 [0.7934741289205556, 0.17613102678923398, 0.58...
Resulting Graph:
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
plt.show()
I have segmented roadway data that looks like this:
import pandas as pd
input_df = pd.DataFrame({
'ROUTE': ['US9', 'US9', 'US9', 'US9', 'US9'],
'BMP': [0.0, 0.1, 0.2, 0.3, 0.4],
'EMP': [0.1, 0.2, 0.3, 0.4, 0.5],
'VALUE': [19, 19, 232, 232, 19]
})
>>> print(input_df)
BMP EMP ROUTE VALUE
0.0 0.1 US9 19
0.1 0.2 US9 19
0.2 0.3 US9 232
0.3 0.4 US9 232
0.4 0.5 US9 19
The BMP column represents the begin milepoint of this attribute along a linear referenced GIS representation of the road. The EMP is the associated end mileage. When the VALUE column is equal, I would like to combine adjacent segments.
There is a tool that does this operation in ArcGIS called Dissolve Route Events. I would like to use Pandas to complete this task. Here's the desired output:
output_df = pd.DataFrame({
'ROUTE': ['US9', 'US9', 'US9'],
'BMP': [0.0, 0.2, 0.4],
'EMP': [0.2, 0.4, 0.5],
'VALUE': [19, 232, 19]
})
>>> print(output_df)
BMP EMP ROUTE VALUE
0.0 0.2 US9 19
0.2 0.4 US9 232
0.4 0.5 US9 19
Try this!
input_df['trip'] = (input_df.VALUE.diff() != 0).cumsum()
output_df = input_df.groupby(['ROUTE','trip','VALUE']).agg({'BMP':'first','EMP':'last'})
output_df.reset_index()
#
ROUTE trip VALUE BMP EMP
0 US9 1 19 0.0 0.2
1 US9 2 232 0.2 0.4
2 US9 3 19 0.4 0.5
I have a data frame with the below structure:
Ranges Relative_17-Aug Relative_17-Sep Relative_17-Oct
0 (0.0, 0.1] 1372 1583 1214
1 (0.1, 0.2] 440 337 648
2 (0.2, 0.3] 111 51 105
3 (0.3, 0.4] 33 10 19
4 (0.4, 0.5] 16 4 9
5 (0.5, 0.6] 7 7 1
6 (0.6, 0.7] 4 3 0
7 (0.7, 0.8] 5 1 0
8 (0.8, 0.9] 2 3 0
9 (0.9, 1.0] 2 0 1
10 (1.0, 2.0] 6 0 2
I am trying to replace column ranges with a dictionary using the below code but it is not working, any hints if I am doing something wrong:
mydict= {"(0.0, 0.1]":"<=10%","(0.1, 0.2]":">10% and <20%","(0.2, 0.3]":">20% and <30%", "(0.3, 0.4]":">30% and <40%", "(0.4, 0.5]":">40% and <50%", "(0.5, 0.6]":">50% and <60%", "(0.6, 0.7]":">60% and <70%", "(0.7, 0.8]":">70% and <80%", "(0.8, 0.9]":">80% and <90%", "(0.9, 1.0]":">90% and <100%", "(1.0, 2.0]":">100%"}
t_df["Ranges"].replace(mydict,inplace=True)
Thanks!
I think here is best use parameter labels in time of create Ranges column in cut:
labels = ['<=10%','>10% and <20%', ...]
#change by your bins
bins = [0,0.1,0.2...]
t_df['Ranges'] = pd.cut(t_df['col'], bins=bins, labels=labels)
If not possible, cast to string should help as suggest #Dark in comments, for better performance use map:
t_df["Ranges"] = t_df["Ranges"].astype(str).map(mydict)
By using map function this can be achieved easily and in a straight forward manner as shown below..
mydict= {"(0.0, 0.1]":"<=10%","(0.1, 0.2]":">10% and <20%","(0.2, 0.3]":">20% and <30%", "(0.3, 0.4]":">30% and <40%", "(0.4, 0.5]":">40% and <50%", "(0.5, 0.6]":">50% and <60%", "(0.6, 0.7]":">60% and <70%", "(0.7, 0.8]":">70% and <80%", "(0.8, 0.9]":">80% and <90%", "(0.9, 1.0]":">90% and <100%", "(1.0, 2.0]":">100%"}
t_df["Ranges"] = t_df["Ranges"].map(lambda x : mydict[str(x)])
Hope this helps..!!