I am working on multiclass classification problem. My target column has 4 classes as Low, medium, high and very high. When I am trying to encode it, I am getting only 0 as value_counts(). I am not sure, why.
value count in original data frame is :
High 18767
Very High 15856
Medium 9212
Low 5067
Name: physician_segment, dtype: int64
I have tried below methods to encode my target column:
Using replace() method :
target_enc = {'Low':0,'Medium':1,'High':2,'Very High':3}
df1['physician_segment'] = df1['physician_segment'].astype(object)
df1['physician_segment'] = df1['physician_segment'].replace(target_enc)
df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
using factorize method():
from pandas.api.types import CategoricalDtype
df1['physician_segment'] = df1['physician_segment'].factorize()[0]
df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
Using Label Encoder :
from sklearn import preprocessing
labelencoder= LabelEncoder()
df1['physician_segment'] = labelencoder.fit_transform(df1['physician_segment']) df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
In all these three techniques, I am getting only one class as 0, length of dataframe is 48902.
Can someone please point out, what I am doing wrong.
I want my target column to have values as 0, 1, 2, 3.
target_enc = {'Low':0,'Medium':1,'High':2,'Very High':3}
df1['physician_segment'] = df1['physician_segment'].astype(object)
After that create/define a function:-
def func(val):
if val in target_enc.keys():
return target_enc[val]
and finally use apply() method:-
df1['physician_segment']=df1['physician_segment'].apply(func)
Now if you print df1['physician_segment'].value_counts() you will get correct output
Related
This is ML code and I am beginner.
X and y are class and feature matrix
print(X.shape)
X.dtypes
output:
Age int64
Sex int64
chest pain type int64
Trestbps int64
chol int64
fbs int64
restecg int64
thalach int64
exang int64
oldpeak float64
slope int64
ca object
thal object
dtype: object
from sklearn.feature_selection import SelectKBest, f_classif
#Using ANOVA to create the new dataset with only best three selected features
X_new_anova = SelectKBest(f_classif, k=3).fit_transform(X,y) #<-------- get error
X_new_anova = pd.DataFrame(X_new_anova, columns = ["Age", "Trestbps","chol"])
print("The dataset with best three selected features after using ANOVA:")
print(X_new_anova.head())
kmeans_anova = KMeans(n_clusters = 3).fit(X_new_anova)
labels_anova = kmeans_anova.labels_
#Counting the number of the labels in each cluster and saving the data into clustering_classes
clustering_classes_anova = {
0: [0,0,0,0,0],
1: [0,0,0,0,0],
2: [0,0,0,0,0]
}
for i in range(len(y)):
clustering_classes_anova[labels_anova[i]][y[i]] += 1
###Finding the most appeared label in each cluster and computing the purity score
purity_score_anova = (max(clustering_classes_anova[0])+max(clustering_classes_anova[1])+max(clustering_classes_anova[2]))/len(y)
print(f"Purity score of the new data after using ANOVA {round(purity_score_anova*100, 2)}%")
This is the error I got:
#Using ANOVA to create the new dataset with only best three selected features
----> 4 X_new_anova = SelectKBest(f_classif, k=3).fit_transform(X,y)
5 X_new_anova = pd.DataFrame(X_new_anova, columns = ["Age", "Trestbps","chol"])
6 print("The dataset with best three selected features after using ANOVA:")
ValueError: could not convert string to float: '?'
I don't know what is the meaning of "?"
could you please tell me how to avoid this error?
The meaning of the '?' is that there is this string (?) somewhere within your datafile that it cannot convert. I would just check your datafile to make sure that everything checks out. I would guess whoever made it put a ? somewhere that data could not be found.
can Delete a row using
DataFrame=Dataframe.drop(labels=3,axis=0)
'''
With 3 being used as a placeholder for whatever
row holds the ? so if row 40 has the empty ?, you would do # 40
'''
I'm trying to extract a cell from a pandas dataframe to a simple floating point number. I'm trying
prediction = pd.to_numeric(baseline.ix[(baseline['Weekday']==5) & (baseline['Hour'] == 8)]['SmsOut'])
However, this returns
128 -0.001405
Name: SmsOut, dtype: float64
I want it to just return a simle Python float: -0.001405
How can I do that?
Output is Series with one value, so then is more possible solutions:
convert to numpy array by to_numpy and select first value by indexing
select by position by iloc or iat
prediction = pd.to_numeric(baseline.loc[(baseline['Weekday'] ==5 ) &
(baseline['Hour'] == 8), 'SmsOut'])
print (prediction.to_numpy()[0])
print (prediction.iloc[0])
print (prediction.iat[0])
Sample:
baseline = pd.DataFrame({'Weekday':[5,3],
'Hour':[8,4],
'SmsOut':[-0.001405,6]}, index=[128,130])
print (baseline)
Hour SmsOut Weekday
128 8 -0.001405 5
130 4 6.000000 3
prediction = pd.to_numeric(baseline.loc[(baseline['Weekday'] ==5 ) &
(baseline['Hour'] == 8), 'SmsOut'])
print (prediction)
128 -0.001405
Name: SmsOut, dtype: float64
print (prediction.to_numpy()[0])
-0.001405
print (prediction.iloc[0])
-0.001405
print (prediction.iat[0])
-0.001405
Using pandas I have a result (here aresult) from a df.loc lookup that python is telling me is a 'Timeseries'.
sample of predictions.csv:
prediction id
1 593960337793155072
0 991960332793155071
....
code to retrieve one prediction
predictionsfile = pandas.read_csv('predictions.csv')
idtest = 593960337793155072
result = (predictionsfile.loc[predictionsfile['id'] == idtest])
aresult = result['prediction']
A result retreives a data format that cannot be keyed:
In: print aresult
11 1
Name: prediction, dtype: int64
I just need the prediction, which in this case is 1. I've tried aresult['result'], aresult[0] and aresult[1] all to no avail. Before I do something awful like converting it to a string and strip it out, I thought I'd ask here.
A series requires .item() to retrieve its value.
print aresult.item()
1
I have data streaming in the following format:
from StringIO import StringIO
data ="""\
ANI/IP
sip:5554447777#10.94.2.15
sip:10.66.7.34#6665554444
sip:3337775555#10.94.2.11
"""
import pandas as pd
df = pd.read_table(StringIO(data),sep='\s+',dtype='str')
What I would like to do is replace the column content with just the phone number part of the string above. I tried the suggestions from this thread like so:
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
print(df)
However, this results in:
.....print(df)
ANI/IP
0 sip:#10.94.2.15
1 sip:#10.66.7.34
2 sip:#10.94.2.11
I need the phone numbers, so how do I achieve this? :
ANI/IP
0 5554447777
1 6665554444
2 3337775555
The regex \d{10} searches for substring of digits precisely 10 characters long.
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
This removes the numbers!
Note: You shouldn't do astype str (it's not needed and there is no str dtype in pandas).
You want to extract these phone numbers:
In [11]: df["ANI/IP"].str.extract(r'(\d{10})') # before overwriting!
Out[11]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object
Set this as another column and you're away:
In [12]: df["phone_number"] = df["ANI/IP"].str.extract(r'(\d{10})')
You could use pandas.core.strings.StringMethods.extract to extract
In [10]: df['ANI/IP'].str.extract("(\d{10})")
Out[10]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object
I have pandas dataframe:
df.text[:3]
0 nena shot by me httptcodcrsfqyvh httpstcokxr...
1 full version of soulless httptcowfmcyyu
2 when youre having a good day but then get to w...
Name: text, dtype: object
Basically it just a series with tweets text. Nothing more.
text = df.text
text.index
Int64Index([0, 1, 2, ...], dtype='int64')
Now I want to split words in this series. It works just fine with this one:
df.text.str.split('')
0 [nena shot by me httptcodcrsfqyvh httpstcokx...
1 [full version of soulless httptcowfmcyyu]
2 [when youre having a good day but then get to ...
But id does not work with apply method:
df.text.apply(lambda x: x.split(' '))
and throws an exception: AttributeError: 'float' object has no attribute 'split'
What am I doing wrong and why apply method takes this int index as parameter?
Same thing if I use df.text.map(lambda x: x.split(' '))
UPD
df[df.text == np.nan].shape
(0, 13)
And
df.text[:3]
0 nena shot by me httptcodcrsfqyvh httpstcokxr...
1 full version of soulless httptcowfmcyyu
2 when youre having a good day but then get to w...
Works just fine:
df.text[:3].map(lambda x: x.split())
0 [nena, shot, by, me, httptcodcrsfqyvh, httpstc...
1 [full, version, of, soulless, httptcowfmcyyu]
2 [when, youre, having, a, good, day, but, then,...
Name: text, dtype: object