How to avoid ValueError: could not convert string to float: '?' - python

This is ML code and I am beginner.
X and y are class and feature matrix
print(X.shape)
X.dtypes
output:
Age int64
Sex int64
chest pain type int64
Trestbps int64
chol int64
fbs int64
restecg int64
thalach int64
exang int64
oldpeak float64
slope int64
ca object
thal object
dtype: object
from sklearn.feature_selection import SelectKBest, f_classif
#Using ANOVA to create the new dataset with only best three selected features
X_new_anova = SelectKBest(f_classif, k=3).fit_transform(X,y) #<-------- get error
X_new_anova = pd.DataFrame(X_new_anova, columns = ["Age", "Trestbps","chol"])
print("The dataset with best three selected features after using ANOVA:")
print(X_new_anova.head())
kmeans_anova = KMeans(n_clusters = 3).fit(X_new_anova)
labels_anova = kmeans_anova.labels_
#Counting the number of the labels in each cluster and saving the data into clustering_classes
clustering_classes_anova = {
0: [0,0,0,0,0],
1: [0,0,0,0,0],
2: [0,0,0,0,0]
}
for i in range(len(y)):
clustering_classes_anova[labels_anova[i]][y[i]] += 1
###Finding the most appeared label in each cluster and computing the purity score
purity_score_anova = (max(clustering_classes_anova[0])+max(clustering_classes_anova[1])+max(clustering_classes_anova[2]))/len(y)
print(f"Purity score of the new data after using ANOVA {round(purity_score_anova*100, 2)}%")
This is the error I got:
#Using ANOVA to create the new dataset with only best three selected features
----> 4 X_new_anova = SelectKBest(f_classif, k=3).fit_transform(X,y)
5 X_new_anova = pd.DataFrame(X_new_anova, columns = ["Age", "Trestbps","chol"])
6 print("The dataset with best three selected features after using ANOVA:")
ValueError: could not convert string to float: '?'
I don't know what is the meaning of "?"
could you please tell me how to avoid this error?

The meaning of the '?' is that there is this string (?) somewhere within your datafile that it cannot convert. I would just check your datafile to make sure that everything checks out. I would guess whoever made it put a ? somewhere that data could not be found.
can Delete a row using
DataFrame=Dataframe.drop(labels=3,axis=0)
'''
With 3 being used as a placeholder for whatever
row holds the ? so if row 40 has the empty ?, you would do # 40
'''

Related

Encoded target column shows only one category?

I am working on multiclass classification problem. My target column has 4 classes as Low, medium, high and very high. When I am trying to encode it, I am getting only 0 as value_counts(). I am not sure, why.
value count in original data frame is :
High 18767
Very High 15856
Medium 9212
Low 5067
Name: physician_segment, dtype: int64
I have tried below methods to encode my target column:
Using replace() method :
target_enc = {'Low':0,'Medium':1,'High':2,'Very High':3}
df1['physician_segment'] = df1['physician_segment'].astype(object)
df1['physician_segment'] = df1['physician_segment'].replace(target_enc)
df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
using factorize method():
from pandas.api.types import CategoricalDtype
df1['physician_segment'] = df1['physician_segment'].factorize()[0]
df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
Using Label Encoder :
from sklearn import preprocessing
labelencoder= LabelEncoder()
df1['physician_segment'] = labelencoder.fit_transform(df1['physician_segment']) df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
In all these three techniques, I am getting only one class as 0, length of dataframe is 48902.
Can someone please point out, what I am doing wrong.
I want my target column to have values as 0, 1, 2, 3.
target_enc = {'Low':0,'Medium':1,'High':2,'Very High':3}
df1['physician_segment'] = df1['physician_segment'].astype(object)
After that create/define a function:-
def func(val):
if val in target_enc.keys():
return target_enc[val]
and finally use apply() method:-
df1['physician_segment']=df1['physician_segment'].apply(func)
Now if you print df1['physician_segment'].value_counts() you will get correct output

if-else by column dtype in pandas

Formating output from pandas
I'm trying to automate getting output from pandas in a format that I can use with the minimum of messing about in a word processor. I'm using descriptive statistics as a practice case and so I'm trying to use the output from df[variable].describe(). My problem is that .describe() responds differently depending on the dtype of the column (if I'm understanding it properly).
In the case of a numerical column describe() produces this output:
count 306.000000
mean 36.823529
std 6.308587
min 10.000000
25% 33.000000
50% 37.000000
75% 41.000000
max 50.000000
Name: gses_tot, dtype: float64
However, for categorical columns, it produces:
count 306
unique 3
top Female
freq 166
Name: gender, dtype: object
Because of this difference, I need different code to capture the information I need, however, I can't seem to get my code to work on the categorical variables.
What I've tried
I've tried a few different versions of :
for v in df.columns:
if df[v].dtype.name == 'category': #i've also tried 'object' here
c, u, t, f, = df[v].describe()
print(f'******{str(v)}******')
print(f'Largest category = {t}')
print(f'Percentage = {(f/c)*100}%')
else:
c, m, std, mi, tf, f, sf, ma, = df[v].describe()
print(f'******{str(v)}******')
print(f'M = {m}')
print(f'SD = {std}')
print(f'Range = {float(ma) - float(mi)}')
print(f'\n')
The code in the else block works fine, but when I come to a categorical column I get the error below
******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-24-f077cc105185> in <module>
6 print(f'Percentage = {(f/c)*100}')
7 else:
----> 8 c, m, std, mi, tf, f, sf, ma, = df[v].describe()
9 print(f'******{str(v)}******')
10 print(f'M = {m}')
ValueError: not enough values to unpack (expected 8, got 4)
What I want to happen is something like
******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0
******gender******
Largest category = female
Percentage = 52.2%
I believe that the issue is how I'm setting up the if statement with the dtype
and I've rooted around to try to find out how to access the dtype properly but I can't seem to make it work.
Advice would be much appreciated.
You can check what fields are included in the output of describe and print the corresponding sections:
import pandas as pd
df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})
for v in df.columns:
desc = df[v].describe()
print(f'******{str(v)}******')
if 'top' in desc:
print(f'Largest category = {desc["top"]}')
print(f'Percentage = {(desc["freq"]/desc["count"])*100:.1f}%')
else:
print(f'M = {desc["mean"]}')
print(f'SD = {desc["std"]}')
print(f'Range = {float(desc["max"]) - float(desc["min"])}')

Pandas - Extracting value to basic python float

I'm trying to extract a cell from a pandas dataframe to a simple floating point number. I'm trying
prediction = pd.to_numeric(baseline.ix[(baseline['Weekday']==5) & (baseline['Hour'] == 8)]['SmsOut'])
However, this returns
128 -0.001405
Name: SmsOut, dtype: float64
I want it to just return a simle Python float: -0.001405
How can I do that?
Output is Series with one value, so then is more possible solutions:
convert to numpy array by to_numpy and select first value by indexing
select by position by iloc or iat
prediction = pd.to_numeric(baseline.loc[(baseline['Weekday'] ==5 ) &
(baseline['Hour'] == 8), 'SmsOut'])
print (prediction.to_numpy()[0])
print (prediction.iloc[0])
print (prediction.iat[0])
Sample:
baseline = pd.DataFrame({'Weekday':[5,3],
'Hour':[8,4],
'SmsOut':[-0.001405,6]}, index=[128,130])
print (baseline)
Hour SmsOut Weekday
128 8 -0.001405 5
130 4 6.000000 3
prediction = pd.to_numeric(baseline.loc[(baseline['Weekday'] ==5 ) &
(baseline['Hour'] == 8), 'SmsOut'])
print (prediction)
128 -0.001405
Name: SmsOut, dtype: float64
print (prediction.to_numpy()[0])
-0.001405
print (prediction.iloc[0])
-0.001405
print (prediction.iat[0])
-0.001405

Pandas - Retrieve Value from df.loc

Using pandas I have a result (here aresult) from a df.loc lookup that python is telling me is a 'Timeseries'.
sample of predictions.csv:
prediction id
1 593960337793155072
0 991960332793155071
....
code to retrieve one prediction
predictionsfile = pandas.read_csv('predictions.csv')
idtest = 593960337793155072
result = (predictionsfile.loc[predictionsfile['id'] == idtest])
aresult = result['prediction']
A result retreives a data format that cannot be keyed:
In: print aresult
11 1
Name: prediction, dtype: int64
I just need the prediction, which in this case is 1. I've tried aresult['result'], aresult[0] and aresult[1] all to no avail. Before I do something awful like converting it to a string and strip it out, I thought I'd ask here.
A series requires .item() to retrieve its value.
print aresult.item()
1

reading an ascii file with headers given in the first rows into a pandas dataframe

I have a huge set of catalogues which have different columns and the different header names for each column, where the description for each header name is given as comments at the beginning of my ascii files in a row. What is the best way to read them into a pandas.DataFrame while it can set the name of the column as well without being needed to define it from the beginning. The following is an example of my catalogues:
# 1 MAG_AUTO Kron-like elliptical aperture magnitude [mag]
# 2 rh half light radius (analyse) [pixel]
# 3 MU_MAX Peak surface brightness above background [mag * arcsec**(-2)]
# 4 FWHM_IMAGE FWHM assuming a gaussian core [pixel]
# 5 CLASS_STAR S/G classifier output
18.7462 4.81509 20.1348 6.67273 0.0286538
18.2440 7.17988 20.6454 21.6235 0.0286293
18.3102 3.11273 19.0960 8.26081 0.0430532
21.1751 2.92533 21.9931 5.52080 0.0290418
19.3998 1.86182 19.3166 3.42346 0.986598
20.0801 3.52828 21.3484 6.76799 0.0303842
21.9427 2.08458 22.0577 5.59344 0.981466
20.7726 1.86017 20.8130 3.69570 0.996121
23.0836 2.23427 23.3689 4.49985 0.706207
23.2443 1.62021 23.1089 3.54191 0.973419
20.6343 3.99555 21.9426 6.94700 0.0286164
23.4012 2.00408 23.3412 4.35926 0.946349
23.8427 1.54819 23.8241 3.83407 0.897079
20.3344 2.69910 20.9401 4.38988 0.0355277
21.7506 2.43451 22.2115 4.62045 0.0786921
This is a file in Sextractor format. The astropy.io.ascii reader understands this format natively so this is a snap to read:
>>> from astropy.io import ascii
>>> dat = ascii.read('table.dat')
>>> dat
<Table masked=False length=3>
MAG_AUTO rh MU_MAX FWHM_IMAGE CLASS_STAR
mag mag / arcsec2 pix
float64 float64 float64 float64 float64
-------- ------- ------------- ---------- ----------
18.7462 4.81509 20.1348 6.67273 0.0286538
18.244 7.17988 20.6454 21.6235 0.0286293
18.3102 3.11273 19.096 8.26081 0.0430532
...
Note that using the astropy ASCII reader you get a table that also retains the unit meta data.
If you still want to convert this to a pandas dataframe that's easy as well with DataFrame(dat.as_array()). Version 1.1 of astropy (and the current master) will have methods to_pandas and from_pandas that make this conversion more robust (see http://astropy.readthedocs.org/en/latest/table/pandas.html).
Ok, assuming all of your header info is encoded in the exact same way, here's how I would do this:
import re
import pandas
COMMENT_CHAR = '#'
columns = []
with open('test.dat', 'r') as td:
for line in td:
# find the commented lines
if line[0] == COMMENT_CHAR:
info = re.split(' +', line)
columns.append(info[2])
# when we seethe first line that doesn't start with
# COMMENT_CHAR, we pass the remaining lines of the
# file to pandas.read_table and break our loop
else:
_dfs = [
pandas.DataFrame([line.split(' ')], columns=columns, dtype=float),
pandas.read_table(td, sep='\s', header=None, names=columns)
]
df = pandas.concat(_dfs, ignore_index=True)
To break down the initial parsing a bit, re.split(' +', line) turns this:
# 1 MAG_AUTO Kron-like elliptical aperture magnitude [mag]
into
['#', '1', 'MAG_AUTO', 'Kron-like', 'elliptical', 'aperture', 'magnitude', '[mag]']
So we take the column name as the 3 element (index = 2).
All this produces a dataframe that looks like this:
print(df.head())
MAG_AUTO rh MU_MAX FWHM_IMAGE CLASS_STAR
0 18.7462 4.81509 20.1348 6.67273 0.0286538
1 18.2440 7.17988 20.6454 21.62350 0.028629
2 18.3102 3.11273 19.0960 8.26081 0.043053
3 21.1751 2.92533 21.9931 5.52080 0.029042
4 19.3998 1.86182 19.3166 3.42346 0.986598
And df.info() gives us:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 5 columns):
MAG_AUTO 15 non-null float64
rh 15 non-null float64
MU_MAX 15 non-null float64
FWHM_IMAGE 15 non-null float64
CLASS_STAR 15 non-null float64
dtypes: float64(5)
memory usage: 720.0 bytes

Categories