Pandas - Extracting value to basic python float - python

I'm trying to extract a cell from a pandas dataframe to a simple floating point number. I'm trying
prediction = pd.to_numeric(baseline.ix[(baseline['Weekday']==5) & (baseline['Hour'] == 8)]['SmsOut'])
However, this returns
128 -0.001405
Name: SmsOut, dtype: float64
I want it to just return a simle Python float: -0.001405
How can I do that?

Output is Series with one value, so then is more possible solutions:
convert to numpy array by to_numpy and select first value by indexing
select by position by iloc or iat
prediction = pd.to_numeric(baseline.loc[(baseline['Weekday'] ==5 ) &
(baseline['Hour'] == 8), 'SmsOut'])
print (prediction.to_numpy()[0])
print (prediction.iloc[0])
print (prediction.iat[0])
Sample:
baseline = pd.DataFrame({'Weekday':[5,3],
'Hour':[8,4],
'SmsOut':[-0.001405,6]}, index=[128,130])
print (baseline)
Hour SmsOut Weekday
128 8 -0.001405 5
130 4 6.000000 3
prediction = pd.to_numeric(baseline.loc[(baseline['Weekday'] ==5 ) &
(baseline['Hour'] == 8), 'SmsOut'])
print (prediction)
128 -0.001405
Name: SmsOut, dtype: float64
print (prediction.to_numpy()[0])
-0.001405
print (prediction.iloc[0])
-0.001405
print (prediction.iat[0])
-0.001405

Related

How to avoid ValueError: could not convert string to float: '?'

This is ML code and I am beginner.
X and y are class and feature matrix
print(X.shape)
X.dtypes
output:
Age int64
Sex int64
chest pain type int64
Trestbps int64
chol int64
fbs int64
restecg int64
thalach int64
exang int64
oldpeak float64
slope int64
ca object
thal object
dtype: object
from sklearn.feature_selection import SelectKBest, f_classif
#Using ANOVA to create the new dataset with only best three selected features
X_new_anova = SelectKBest(f_classif, k=3).fit_transform(X,y) #<-------- get error
X_new_anova = pd.DataFrame(X_new_anova, columns = ["Age", "Trestbps","chol"])
print("The dataset with best three selected features after using ANOVA:")
print(X_new_anova.head())
kmeans_anova = KMeans(n_clusters = 3).fit(X_new_anova)
labels_anova = kmeans_anova.labels_
#Counting the number of the labels in each cluster and saving the data into clustering_classes
clustering_classes_anova = {
0: [0,0,0,0,0],
1: [0,0,0,0,0],
2: [0,0,0,0,0]
}
for i in range(len(y)):
clustering_classes_anova[labels_anova[i]][y[i]] += 1
###Finding the most appeared label in each cluster and computing the purity score
purity_score_anova = (max(clustering_classes_anova[0])+max(clustering_classes_anova[1])+max(clustering_classes_anova[2]))/len(y)
print(f"Purity score of the new data after using ANOVA {round(purity_score_anova*100, 2)}%")
This is the error I got:
#Using ANOVA to create the new dataset with only best three selected features
----> 4 X_new_anova = SelectKBest(f_classif, k=3).fit_transform(X,y)
5 X_new_anova = pd.DataFrame(X_new_anova, columns = ["Age", "Trestbps","chol"])
6 print("The dataset with best three selected features after using ANOVA:")
ValueError: could not convert string to float: '?'
I don't know what is the meaning of "?"
could you please tell me how to avoid this error?
The meaning of the '?' is that there is this string (?) somewhere within your datafile that it cannot convert. I would just check your datafile to make sure that everything checks out. I would guess whoever made it put a ? somewhere that data could not be found.
can Delete a row using
DataFrame=Dataframe.drop(labels=3,axis=0)
'''
With 3 being used as a placeholder for whatever
row holds the ? so if row 40 has the empty ?, you would do # 40
'''

Encoded target column shows only one category?

I am working on multiclass classification problem. My target column has 4 classes as Low, medium, high and very high. When I am trying to encode it, I am getting only 0 as value_counts(). I am not sure, why.
value count in original data frame is :
High 18767
Very High 15856
Medium 9212
Low 5067
Name: physician_segment, dtype: int64
I have tried below methods to encode my target column:
Using replace() method :
target_enc = {'Low':0,'Medium':1,'High':2,'Very High':3}
df1['physician_segment'] = df1['physician_segment'].astype(object)
df1['physician_segment'] = df1['physician_segment'].replace(target_enc)
df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
using factorize method():
from pandas.api.types import CategoricalDtype
df1['physician_segment'] = df1['physician_segment'].factorize()[0]
df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
Using Label Encoder :
from sklearn import preprocessing
labelencoder= LabelEncoder()
df1['physician_segment'] = labelencoder.fit_transform(df1['physician_segment']) df1['physician_segment'].value_counts()
0 48902
Name: physician_segment, dtype: int64
In all these three techniques, I am getting only one class as 0, length of dataframe is 48902.
Can someone please point out, what I am doing wrong.
I want my target column to have values as 0, 1, 2, 3.
target_enc = {'Low':0,'Medium':1,'High':2,'Very High':3}
df1['physician_segment'] = df1['physician_segment'].astype(object)
After that create/define a function:-
def func(val):
if val in target_enc.keys():
return target_enc[val]
and finally use apply() method:-
df1['physician_segment']=df1['physician_segment'].apply(func)
Now if you print df1['physician_segment'].value_counts() you will get correct output

python : conversion a dataframe column with commas and $ into float

i'm trying to convert a column of prices in a dataframe into float and then calculate the mean of the first 5 rows.
first i did it succussefully in this way :
import pandas as pd
import numpy as np
paris_listing = pd.read_csv("C:../.../.../paris_airbnb.csv")
stripped_commas = paris_listing["price"].str.replace(",", "")
stripped_dollars = stripped_commas.str.replace("$", "")
paris_listing["price"] = stripped_dollars.astype("float")
mean_price = paris_listing.iloc[0:5]["price"].mean()
print (mean_price)
but i tried to make a function and apply it on the dataframe and it didn't work
def conversion_price(price_conv):
price_conv = price_conv.str.replace(",", "")
price_conv = price_conv.str.replace("$", "")
price_conv = price_conv.astype("float")
price_mean = price_conv.iloc[0:5].mean()
paris_listing["converted_price"] = paris_listing["price"].apply(conversion_price)
Could you please try below instead of second and third line of function
price_conv = float(price_conv.replace("$", ""))
Your question is a bit confusing, do you want all the rows to have mean of first 5 prices or the mean of next five prices? Anyway here's the code to calculate the mean for next 5 prices. The get_mean function will return mean(present_index to present_index+5).
def get_mean(row):
index = df[df == row].dropna().index
if index+4 in df.index:
index_list = range(index,index+5)
price_mean = np.mean([df.loc[index,'price'] for index in index_list])
return price_mean
return np.NaN
paris_listing['price'] = paris_listing['price'].str.replace(r'[$\,]','').astype('float')
paris_listing["converted_price"] = paris_listing.apply(get_mean,axis = 1)
Following statement can be used to find the mean of just the first 5 rows
mean = df.price[0:5].mean()
Thank you for your help :)
i tried this function and it works well
def convert_price (df):
df = df.replace("$", "")
df = df.replace(",", "")
df = float(df)
return df
converted_price = paris_listing["price"].apply(convert_price)
paris_listing["price"].head()
converted_price.head()
and i got this result :
1956 80.0
3735 67.0
6944 36.0
2094 120.0
2968 60.0
Name: price, dtype: float64
1956 80.0
3735 67.0
6944 36.0
2094 120.0
2968 60.0
Name: price, dtype: float64
otherwise i'd like to calculate the mean of the series(the result) but when i use
mean_price = df.mean()
i get this error :
AttributeError: 'float' object has no attribute 'mean'

Maximum Consecutive Ones/Trues per year that also considers the boundaries (Start-of-year and End-of-year)

Title says most of it. i.e. Find the maximum consecutive Ones/1s (or Trues) for each year, and if the consecutive 1s at the end of a year continues to the following year, merge them together.
I have tried to implement this, but seems a bit of a 'hack', and wondering if there is a better way to do it.
Reproducible Example Code:
# Modules needed
import pandas as pd
import numpy as np
# Example Input array of Ones and Zeroes with a datetime-index (Original data is time-series)
InputArray = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
InputArray.index = (pd.date_range('2000-12-22', '2001-01-06'))
boolean_array = InputArray == 1 #convert to boolean
# Wanted Output
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3
Below is my initial code to achieved wanted output
# function to get max consecutive for a particular array. i.e. will be done for each year below (groupby)
def GetMaxConsecutive(boolean_array):
distinct = boolean_array.ne(boolean_array.shift()).cumsum() # associate each trues/false to a number
distinct = distinct[boolean_array] # only consider trues from the distinct values
consect = distinct.value_counts().max() # find the number of consecutives of distincts values then find the maximum value
return consect
# Find the maximum consecutive 'Trues' for each year.
MaxConsecutive = boolean_array.groupby(lambda x: x.year).apply(GetMaxConsecutive)
print(MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 7
# 2001 3
However, output above is still not what we want because groupby function cuts the data for each year.
So below code we will try and 'fix' this by computing the MaxConsecutive-Ones at the boundaries (i.e. current_year-01-01 and previous_year-12-31), And if the MaxConsecutive-Ones at the boundaries are larger than compared to the original MaxConsecutive-Ones from above output then we replace it.
# First) we aquire all start_of_year and end_of_year data
start_of_year = boolean_array.loc[(boolean_array.index.month==1) & (boolean_array.index.day==1)]
end_of_year = boolean_array.loc[(boolean_array.index.month==12) & (boolean_array.index.day==31)]
# Second) we mask above start_of_year and end_of_year data: to only have elements that are "True"
start_of_year = start_of_year[start_of_year]
end_of_year = end_of_year[end_of_year]
# Third) Change index to only contain years (rather than datetime index)
# Also for "start_of_year" array include -1 to the years when setting the index.
# So we can match end_of_year to start_of_year arrays!
start_of_year = pd.Series(start_of_year)
start_of_year.index = start_of_year.index.year - 1
end_of_year = pd.Series(end_of_year)
end_of_year.index = end_of_year.index.year
# Combine index-years that are 'matched'
matched_years = pd.concat([end_of_year, start_of_year], axis = 1)
matched_years = matched_years.dropna()
matched_years = matched_years.index
# Finally) Compute the consecutive 1s/trues at the boundaries
# for each matched years
for year in matched_years:
# Compute the amount of consecutive 1s/trues at the start-of-year
start = boolean_array.loc[boolean_array.index.year == (year + 1)]
distinct = start.ne(start.shift()).cumsum() # associate each consecutive trues/false to a number
distinct_masked = distinct[start] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
start_consecutive = count_distincts.loc[distinct_masked.min()] # Find the number of consecutives at the start of year (or where distinct_masked is minimum)
# Compute the amount of consecutive 1s/trues at the previous-end-of-year
end = boolean_array.loc[boolean_array.index.year == year]
distinct = end.ne(end.shift()).cumsum() # associate each trues/false to a number
distinct_masked = distinct[end] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
end_consecutive = count_distincts.loc[distinct_masked.max()] # Find the number of consecutives at the end of year (or where distinct_masked is maximum)
# Merge/add the consecutives at the boundaries (start-of-year and previous-end-of-year)
ConsecutiveAtBoundaries = start_consecutive + end_consecutive
# Now we modify the original MaxConsecutive if ConsecutiveAtBoundaries is larger
Modify_MaxConsecutive = MaxConsecutive.copy()
if Modify_MaxConsecutive.loc[year] < ConsecutiveAtBoundaries:
Modify_MaxConsecutive.loc[year] = ConsecutiveAtBoundaries
else:
None
# Wanted Output is achieved!
print(Modify_MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3
Now I've got the time. Here is my solution:
# Modules needed
import pandas as pd
import numpy as np
input_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1], dtype=bool)
input_dates = pd.date_range('2000-12-22', '2001-01-06')
df = pd.DataFrame({"input": input_array, "dates": input_dates})
streak_starts = df.index[~df.input.shift(1, fill_value=False) & df.input]
streak_ends = df.index[~df.input.shift(-1, fill_value=False) & df.input] + 1
streak_lengths = streak_ends - streak_starts
streak_df = df.iloc[streak_starts].copy()
streak_df["streak_length"] = streak_lengths
longest_streak_per_year = streak_df.groupby(streak_df.dates.dt.year).streak_length.max()
output:
dates
2000 9
2001 3
Name: streak_length, dtype: int64
Not sure if this is the most efficient, but it's one solution:
arr = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
arr.index = (pd.date_range('2000-12-22', '2001-01-06'))
arr = arr.astype(bool)
df = arr.reset_index() # convert to df
df['adj_year'] = df['index'].dt.year # adj_year will be adjusted for streaks
mask = (df[0].eq(True)) & (df[0].shift().eq(True))
df.loc[mask, 'adj_year'] = np.NaN # we mark streaks as NaN and fill from above
df.adj_year = df.adj_year.fillna(method='ffill').astype('int')
df.groupby('adj_year').apply(lambda x: ((x[0] == x[0].shift()).cumsum() + 1).max())
# find max streak for each adjusted year
Output:
adj_year
2000 9
2001 3
dtype: int64
Note:
By convention variable names in Python (except for classes) are lower case , so arr as opposed to InputArray
1 and 0 are equivalent to True and False, so you can make convert them to boolean without the explicit comparison
cumsum is zero-indexed (as is usual in Python) so we add 1
This solution doesn't answer the question exactly, so will not be the final answer.
i.e. This regards max_consecutive trues at the boundaries for both current-year and following year
boolean_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1]).astype(bool)
boolean_array.index = (pd.date_range('2000-12-22', '2001-01-06'))
distinct = boolean_array.ne(boolean_array.shift()).cumsum()
distinct_masked = distinct[boolean_array]
streak_sum = distinct_masked.value_counts()
streak_sum_series = pd.Series(streak_sum.loc[distinct_masked].values, index = distinct_masked.index.copy())
max_consect = streak_sum_series.groupby(lambda x: x.year).max()
Output:
max_consect
2000 9
2001 9
dtype: int64

Pandas - Retrieve Value from df.loc

Using pandas I have a result (here aresult) from a df.loc lookup that python is telling me is a 'Timeseries'.
sample of predictions.csv:
prediction id
1 593960337793155072
0 991960332793155071
....
code to retrieve one prediction
predictionsfile = pandas.read_csv('predictions.csv')
idtest = 593960337793155072
result = (predictionsfile.loc[predictionsfile['id'] == idtest])
aresult = result['prediction']
A result retreives a data format that cannot be keyed:
In: print aresult
11 1
Name: prediction, dtype: int64
I just need the prediction, which in this case is 1. I've tried aresult['result'], aresult[0] and aresult[1] all to no avail. Before I do something awful like converting it to a string and strip it out, I thought I'd ask here.
A series requires .item() to retrieve its value.
print aresult.item()
1

Categories