issue when loading a data file with numpy - python

I want to train a classifier with scikit, but for doing this first I need to load the corresponding data. I am using the following data file available in:
https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/
When I open it in word it has the following contents:
ADT1_YEAST 0.58 0.61 0.47 0.13 0.50 0.00 0.48 0.22 MIT
ADT2_YEAST 0.43 0.67 0.48 0.27 0.50 0.00 0.53 0.22 MIT
ADT3_YEAST 0.64 0.62 0.49 0.15 0.50 0.00 0.53 0.22 MIT
AAR2_YEAST 0.58 0.44 0.57 0.13 0.50 0.00 0.54 0.22 NUC
Each file is separated by a double space and every line with a return carriage.
I want to read it with the following command:
f=open("yeast.data")
data = np.loadtxt(f,delimiter=" ")
and at the end I want to be able to use the following:
X = data[:,:-1] # select all columns except the last
y = data[:, -1] # select the last column
for using:
X_train, X_test, y_train, y_test = train_test_split(X, y)
but when I try to read it the following error appears:
ValueError: could not convert string to float: ADT1_YEAST
so how can I read this file in Python to use later the MLPClassifier?
Thanks

You can skip the f=open(...), and you can to use dtype='O' to make sure numpy reads it as an mix of numericals and strings. Because of some inconsistancies in the data structure in the file you linked, it's best to use genfromtxt instead of loadtxt:
data = np.genfromtxt('yeast.data',dtype='O')
>>> data
array([[b'ADT1_YEAST', b'0.58', b'0.61', ..., b'0.48', b'0.22', b'MIT'],
[b'ADT2_YEAST', b'0.43', b'0.67', ..., b'0.53', b'0.22', b'MIT'],
[b'ADT3_YEAST', b'0.64', b'0.62', ..., b'0.53', b'0.22', b'MIT'],
...,
[b'ZNRP_YEAST', b'0.67', b'0.57', ..., b'0.56', b'0.22', b'ME2'],
[b'ZUO1_YEAST', b'0.43', b'0.40', ..., b'0.53', b'0.39', b'NUC'],
[b'G6PD_YEAST', b'0.65', b'0.54', ..., b'0.53', b'0.22', b'CYT']], dtype=object)
>>> data.shape
(1484, 10)
You can change the dtypes when you call genfromtxt (see documentation), or you can change them manually after like this:
data[:,0] = data[:,0].astype(str)
data[:,1:-1]= data[:,1:-1].astype(float)
data[:,-1] = data[:,-1].astype(str)
>>> data
array([['ADT1_YEAST', 0.58, 0.61, ..., 0.48, 0.22, 'MIT'],
['ADT2_YEAST', 0.43, 0.67, ..., 0.53, 0.22, 'MIT'],
['ADT3_YEAST', 0.64, 0.62, ..., 0.53, 0.22, 'MIT'],
...,
['ZNRP_YEAST', 0.67, 0.57, ..., 0.56, 0.22, 'ME2'],
['ZUO1_YEAST', 0.43, 0.4, ..., 0.53, 0.39, 'NUC'],
['G6PD_YEAST', 0.65, 0.54, ..., 0.53, 0.22, 'CYT']], dtype=object)

Related

seaborn: barplot of a dataframe by group

I am having difficulty with this. I have the results from my initial model (`Unfiltered´), that I plot like so:
df = pd.DataFrame(
{'class': ['foot', 'bike', 'bus', 'car', 'metro'],
'Precision': [0.7, 0.66, 0.41, 0.61, 0.11],
'Recall': [0.58, 0.35, 0.13, 0.89, 0.02],
'F1-score': [0.64, 0.45, 0.2, 0.72, 0.04]}
)
groups = df.melt(id_vars=['class'], var_name=['Metric'])
sns.barplot(data=groups, x='class', y='value', hue='Metric')
To produce this nice plot:
Now, I obtained a second results from my improved model (filtered), so I add a column (status) to my df to indicate the results from each model like this:
df2 = pd.DataFrame(
{'class': ['foot','foot','bike','bike','bus','bus',
'car','car','metro','metro'],
'Precison': [0.7, 0.62, 0.66, 0.96, 0.41, 0.42, 0.61, 0.75, 0.11, 0.3],
'Recall': [0.58, 0.93, 0.35, 0.4, 0.13, 0.1, 0.89, 0.86, 0.02, 0.01],
'F1-score': [0.64, 0.74, 0.45, 0.56, 0.2, 0.17, 0.72, 0.8, 0.04, 0.01],
'status': ['Unfiltered', 'Filtered', 'Unfiltered','Filtered','Unfiltered',
'Filtered','Unfiltered','Filtered','Unfiltered','Filtered']}
)
df2.head()
class Precison Recall F1-score status
0 foot 0.70 0.58 0.64 Unfiltered
1 foot 0.62 0.93 0.74 Filtered
2 bike 0.66 0.35 0.45 Unfiltered
3 bike 0.96 0.40 0.56 Filtered
4 bus 0.41 0.13 0.20 Unfiltered
And I want to plot this, in similar grouping as above (i.e. foot, bike, bus, car, metro). However, for each of the metrics, I want to place the two values side-by-side. Take for example, the foot group, I would have two bars Precision[Unfiltered, filtered], then 2 bars for Recall[Unfiltered, filtered] and also 2 bars for F1-score[Unfiltered, filtered]. Likewise all other groups.
My attempt:
group2 = df2.melt(id_vars=['class', 'status'], var_name=['Metric'])
sns.barplot(data=group2, x='class', y='value', hue='Metric')
Totally not what I want.
You can pass in hue any sequence as long as it has the same length as your data, and assign colours through it.
So you could try with
group2 = df2.melt(id_vars=['class', 'status'], var_name=['Metric'])
sns.barplot(data=group2, x='class', y='value', hue=group2[['Metric','status']].agg(tuple, axis=1))
plt.legend(fontsize=7)
But the result is a bit hard to read:
Seaborn grouped barplots don't allow for multiple grouping variables. One workaround is to recode the two grouping variables (Metric and status) as one variable with 6 levels. Another possibility is to use facets. If you are open to another plotting package, I might recommend plotnine, which allows multiple grouping variables as follows:
import plotnine as p9
fig = (
p9.ggplot(group2)
+ p9.geom_col(
p9.aes(x="class", y="value", fill="Metric", color="Metric", alpha="status"),
position=p9.position_dodge(1),
size=1,
width=0.5,
)
+ p9.scale_color_manual(("red", "blue", "green"))
+ p9.scale_fill_manual(("red", "blue", "green"))
)
fig.draw()
This generates the following image:

Compare current row with next row in a DataFrame with pandas

I have a DataFrame called "DataExample" and an ascending sorted list called "normalsizes".
import pandas as pd
if __name__ == "__main__":
DataExample = [[0.6, 0.36 ,0.00],
[0.6, 0.36 ,0.00],
[0.9, 0.81 ,0.85],
[0.8, 0.64 ,0.91],
[1.0, 1.00 ,0.92],
[1.0, 1.00 ,0.95],
[0.9, 0.81 ,0.97],
[1.2, 1.44 ,0.97],
[1.0, 1.00 ,0.97],
[1.0, 1.00 ,0.99],
[1.2, 1.44 ,0.99],
[1.1, 1.21 ,0.99]]
DataExample = pd.DataFrame(data = DataExample, columns = ['Lx', 'A', 'Ratio'])
normalsizes = [0, 0.75, 1, 1.25, 1.5, 1.75 ,2, 2.25, 2.4, 2.5, 2.75, 3,
3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6]
# for i in example.index:
#
# numb = example['Lx'][i]
What I am looking for is that each “DataExample [‘ Lx ’]” is analyzed and located within a range of normalsizes, for example:
For DataExample [‘Lx’] [0] = 0.6 -----> then it is between the interval of [0, 0.75] -----> 0.6> 0 and 0.6 <= 0.75 -----> so I take the largest of that interval, that is, 0.75. This for each row.
With this I should have the following result:
Lx A Ratio
1 0.36 0
1 0.36 0
1 0.81 0.85
1 0.64 0.91
1.25 1 0.92
1.25 1 0.95
1 0.81 0.97
1.25 1.44 0.97
1.25 1 0.97
1.25 1 0.99
1.25 1.44 0.99
1.25 1.21 0.99
numpy.searchsorted will get you what you want
import numpy as np
normalsizes = np.array(normalsizes) # convert to numpy array
DataExample["Lx"] = normalsizes[np.searchsorted(normalsizes, DataExample["Lx"])]

Plotting arrays with different lengths in seaborn

I have a dataframe that I would like to make a strip plot out of, the array consists of the following
Symbol Avg.Sentiment Weighted Mentions Sentiment
0 AMC 0.14 0.80 557 [-0.38, -0.48, -0.27, -0.42, 0.8, -0.8, 0.13, ...
2 GME 0.15 0.26 175 [-0.27, 0.13, -0.53, 0.65, -0.91, 0.66, 0.67, ...
1 BB 0.23 0.29 126 [-0.27, 0.34, 0.8, -0.14, -0.39, 0.4, 0.34, -0...
11 SPY -0.06 -0.03 43 [0.32, -0.38, -0.54, 0.36, -0.18, 0.18, -0.33,...
4 SPCE 0.26 0.09 35 [0.65, 0.57, 0.74, 0.48, -0.54, -0.15, -0.3, -...
13 AH 0.06 0.02 33 [0.62, 0.66, -0.18, -0.62, 0.12, -0.42, -0.59,...
12 PLTR 0.16 0.05 29 [0.66, 0.36, 0.64, 0.59, -0.42, 0.65, 0.15, -0...
15 TSLA 0.13 0.03 24 [0.1, 0.38, 0.64, 0.42, -0.32, 0.32, 0.44, -0....
and so on, the number of elements in the list of 'Sentiment' are the same as the number of mentions, I would like to make a strip plot with the Symbol as the x axis and sentiment as the y axis, I believe the problem that I'm encountering is because of the different lengths of list, the actual error reading I'm getting is
ValueError: setting an array element with a sequence.
the code that I'm trying to use to create the strip plot is this
def symbolSentimentVisualization(dataset):
sns.stripplot(x='Symbol',y='Sentiment',data=dataset.loc[:9])
plt.show()
the other part of my issue I would guess has something to do with numpy trying to set multidimensional arrays with different lengths before being put into a seaborn plot, but not 100% on that, if the solution is to plot one row at a time and then merge plots that would definitely work but I'm not sure what exactly I should call to do that because trying it out with the following doesn't seem to work either.
def symbolSentimentVisualization(dataset):
sns.stripplot(x=dataset['Symbol'][0],y=dataset['Sentiment'][0],data=dataset.loc[:9])
plt.show()
IIUC explode 'Sentiment' first then plot:
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
Sample Data:
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
Symbol Mentions Sentiment
0 AMC 557 [-0.556013657820521, 0.7414646123547528, -0.58...
1 GME 175 [-0.5673003921341209, -0.6504850189478857, 0.1...
2 BB 126 [0.7771316020052821, 0.26579994709269994, -0.4...
3 SPY 43 [-0.5966607678089173, -0.4473484233894889, 0.7...
4 SPCE 35 [0.7934741289205556, 0.17613102678923398, 0.58...
Resulting Graph:
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
plt.show()

Regrid numpy array based on cell area

import numpy as np
from skimage.measure import block_reduce
arr = np.random.random((6, 6))
area_cell = np.random.random((6, 6))
block_reduce(arr, block_size=(2, 2), func=np.ma.mean)
I would like to regrid a numpy array arr from 6 x 6 size to 3 x 3. Using the skimage function block_reduce for this.
However, block_reduce assumes each grid cell has same size. How can I solve this problem, when each grid cell has a different size? In this case size of each grid cell is given by the numpy array area_cell
-- EDIT:
An example:
arr
0.25 0.58 0.69 0.74
0.49 0.11 0.10 0.41
0.43 0.76 0.65 0.79
0.72 0.97 0.92 0.09
If all elements of area_cell were 1, and we were to convert 4 x 4 arr into 2 x 2, result would be:
0.36 0.48
0.72 0.61
However, if area_cell is as follows:
0.00 1.00 1.00 0.00
0.00 1.00 0.00 0.50
0.20 1.00 0.80 0.80
0.00 0.00 1.00 1.00
Then, result becomes:
0.17 0.22
0.21 0.54
It seems you are still reducing by blocks, but after scaling arr with area_cell. So, you just need to perform element-wise multiplication between these two arrays and use the same block_reduce code on that product array, like so -
block_reduce(arr*area_cell, block_size=(2, 2), func=np.ma.mean)
Alternatively, we can simply use np.mean after reshaping to a 4D version of the product array, like so -
m,n = arr.shape
out = (arr*area_cell).reshape(m//2,2,n//2,2).mean(axis=(1,3))
Sample run -
In [21]: arr
Out[21]:
array([[ 0.25, 0.58, 0.69, 0.74],
[ 0.49, 0.11, 0.1 , 0.41],
[ 0.43, 0.76, 0.65, 0.79],
[ 0.72, 0.97, 0.92, 0.09]])
In [22]: area_cell
Out[22]:
array([[ 0. , 1. , 1. , 0. ],
[ 0. , 1. , 0. , 0.5],
[ 0.2, 1. , 0.8, 0.8],
[ 0. , 0. , 1. , 1. ]])
In [23]: block_reduce(arr*area_cell, block_size=(2, 2), func=np.ma.mean)
Out[23]:
array([[ 0.1725 , 0.22375],
[ 0.2115 , 0.5405 ]])
In [24]: m,n = arr.shape
In [25]: (arr*area_cell).reshape(m//2,2,n//2,2).mean(axis=(1,3))
Out[25]:
array([[ 0.1725 , 0.22375],
[ 0.2115 , 0.5405 ]])

Should stats.norm.pdf gives same result as stats.gaussian_kde in Python?

I was trying to estimate PDF of 1-D using gaussian_kde. However, when I plot pdf using stats.norm.pdf, it gives me different result. Please correct me if I am wrong, I think they should give quite similar result. Here's my code.
npeaks = 9
mean = np.array([0.2, 0.3, 0.38, 0.55, 0.65,0.7,0.75,0.8,0.82]) #peak locations
support = np.arange(0,1.01,0.01)
std = 0.03
pkfun = sum(stats.norm.pdf(support, loc=mean[i], scale=std) for i in range(0,npeaks))
df = pd.DataFrame(support)
X = df.iloc[:,0]
min_x, max_x = X.min(), X.max()
plt.figure(1)
plt.plot(support,pkfun)
kernel = stats.gaussian_kde(X)
grid = 100j
X= np.mgrid[min_x:max_x:grid]
Z = np.reshape(kernel(X), X.shape)
# plot KDE
plt.figure(2)
plt.plot(X, Z)
plt.show()
Also, when I get the first derivative of stats.gaussian_kde was far from the original signal. However, the result of first derivative of stats.norm.pdf does make sense. So, I am assuming I might have error in my code above.
Value of X= np.mgrid[min_x:max_x:grid]:
[
0. 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
0.06060606 0.07070707 0.08080808 0.09090909 0.1010101 0.11111111
0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
0.18181818 0.19191919 0.2020202 0.21212121 0.22222222 0.23232323
0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
0.3030303 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
0.36363636 0.37373737 0.38383838 0.39393939 0.4040404 0.41414141
0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
0.66666667 0.67676768 0.68686869 0.6969697 0.70707071 0.71717172
0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
0.78787879 0.7979798 0.80808081 0.81818182 0.82828283 0.83838384
0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
0.96969697 0.97979798 0.98989899 1. ]
Value of X = df.iloc[:,0]:
[ 0. 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11
0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23
0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35
0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47
0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71
0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83
0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95
0.96 0.97 0.98 0.99 1. ]
In the row below you make pdf calculations in every peak-point along 100 datapoints with the std = 0,03. So you get a matrix with array with 100 elements per row then you summerize it elementwise, result:
Thus you get a graph with 9 narrow -because of std = 0,03- U-shape.
Are you sure, that this was your purpose with this row?
This will never get the similar graph as the kernel estimate base of the original data, result:
pkfun = sum(stats.norm.pdf(support, loc=mean[i], scale=std) for i in
range(0,npeaks))

Categories