Error could not convert string to float: '' - python

I have a column which does not want to change from object to float. The data from the xlsx file is always presented the same (as a number), but somehow only this column is seen as object.
The numbers in the column represent percentage using point(.) as a decimal placement.
xls3[' Vesturland'] = xls3[' Vesturland'].astype(float)
does not work. There is no special characters to replace (eg.str.replace()), I have tried that as well.
I dare not to use
xls3[' Vesturland'] = pd.to_numeric(xls3[' Vesturland'])
because it changes all floats to NaN and the whole column is percentage values.
The only thing I can think of is that the number of decimals is not consistent, but that shouldn't really matter, or does it? I put a red arrow on the column I want to change to float.
I only get this error when I try to convert to float Error could not convert string to float: '' and searching for it on my specific problem has not given any results yet.

You have empty strings in your pd.Series, which cannot be readily converted to a float data type. What you can do is check for them and remove them. An example script is:
import pandas as pd
a=pd.DataFrame([['a','b','c'],['2.42','','3.285']]).T
a.columns=['names', 'nums']
a['nums']=a['nums'][a['nums']!=''].astype(float)
Note: if you try to run a['nums']=a['nums'].astype(float) before selecting non-empty strings the same error that you've mentioned will be thrown.

First use this line to obtain the current dtypes:
col_dtypes = dict([(k, v.name) for k, v in dict(df.dtypes).items()])
Like so:
xls3 = pd.read_csv('path/to/file')
col_dtypes = dict([(k, v.name) for k, v in dict(xls3.dtypes).items()])
print(col_dtypes)
Copy the value that is printed.
It should be like this:
{'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64', ' Vesturland': 'object', ...}
Then, for the column which whose datatype you know isn't object, change it to the required type ('int32', 'int64', 'float32' or 'float64')
Example:
The datatypes might be detected as:
{'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64', ' Vesturland': 'object', ...}
If we know Vesturland is supposed to be Float, then we can edit this to be:
col_dtypes = {
'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64',
' Vesturland': 'float64', ...
}
Now, with this snippet you can find the non-numeric values:
def clean_non_numeric_values(series, col_type):
illegal_value_pos = []
for i in range(len(series)):
try:
if col_type == 'int64' or col_type == 'int32':
val = int(series[i])
elif col_type == 'float32' or col_type == 'float64':
val = float(series[i])
except:
illegal_value_pos.append(i)
# series[i] = None # We can set the illegal values to None
# to remove them later using xls3.dropna()
return series, illegal_value_pos
# Now we will manually replace the dtype of the column Vesturland like so:
col_dtypes = {
'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64',
' Vesturland': 'float64'
}
for col in list(xls3.columns):
if col_dtypes[col] in ['int32', 'int64', 'float32', 'float64']:
series, illegal_value_pos = (
clean_non_numeric_values(series=xls3[col], col_type=col_dtypes[col])
)
xls3[col] = series
print(illegal_value_pos)
if illegal_value_pos:
illegal_rows = xls3.iloc[illegal_value_pos]
# This will print all the illegal values.
print(illegal_rows[col])
Now you can use this information to remove the non-numeric values from the dataframe.
Warning: Since this uses a for loop, it is slow but it will help you to remove the values you don't want.

After much trial and error, I ended up opening the excel sheet, deleted about 10 rows below the last data input. Then I unfroze rows/column, read it into Jupyter Notebook again and now ALL OF THE DATA IS FLOAT. I don't know which did the trick, but this is resolved now.
Thank you all that helped me here for your time and your attempts to solve this.

len([x for x in xls3[' Vesturland'] if x == ' '])
sometimes it can be blank go to your CSV file open it from excel and check is ctrl+shift+l filter and blank space.

Could it be that you do for i in range 0,len(tablename)
and you need len(tablename)-1
because you start at 0?

Related

Can't convert string to float - Python Dash

I'm trying to convert string into float in Dash callback but when I run my code I'm getting in my Dash app error: lati = float(lati[-1])
ValueError: could not convert string to float: 'float64) I'm not getting this error in terminal though.
First what I need to do is extract given latitude (and longitude) number. Therefore I need it to convert it to string and split it because I could not find better way to get this number from csv file using pandas.
Output:
# converting to string:
12 41.6796
Name: latitude, dtype: float64
# splitting:
['12', '', '', '', '41.6796']
# converting to float:
41.6796
This is the actual code:
#app.callback(Output('text-output', 'children'),
[Input('submit-val', 'n_clicks')],
[State('search-input', 'value')])
def updateText(n_clicks, searchVar):
df = pd.read_csv("powerplant.csv")
df = df[df.name == searchVar]
# converting to string
lati = str(df['latitude'])
longi = str(df['longitude'])
# splitting it
lati = lati.split('\n', 1)
lati = lati[0].split(' ', 4)
longi = longi.split('\n', 1)
longi = longi[0].split(' ', 4)
#converting to float
lati = float(lati[-1])
longi = float(longi[-1])
I actually tested this code in other script and it worked just fine. Is there any better way how could I extract latitude and longitude numbers?
The data can be downloaded from https://datasets.wri.org/dataset/globalpowerplantdatabase; here is an excerpt.
country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude,primary_fuel,other_fuel1,other_fuel2,other_fuel3,commissioning_year,owner,source,url,geolocation_source,wepp_id,year_of_capacity_data,generation_gwh_2013,generation_gwh_2014,generation_gwh_2015,generation_gwh_2016,generation_gwh_2017,estimated_generation_gwh
AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009793,2017,,,,,,
AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009795,2017,,,,,,
ALB,Albania,Shkopet,WRI1002173,24.0,41.6796,19.8305,Hydro,,,,1963.0,,Energy Charter Secretariat,http://www.energycharter.org/fileadmin/DocumentsMedia/IDEER/IDEER-Albania_2013_en.pdf,GEODB,1021238,,,,,,,79.22851153039832
ALB,Albania,Ulez,WRI1002174,25.0,41.6796,19.8936,Hydro,,,,1958.0,,Energy Charter Secretariat,http://www.energycharter.org/fileadmin/DocumentsMedia/IDEER/IDEER-Albania_2013_en.pdf,GEODB,1021241,,,,,,,82.52969951083159
The issue is the way you are accessing the values in a dataframe. Pandas allows you to access the data without having to parse the string representation.
You can access the row and the column in one call to .loc
If you know you will have a single value, you can call the squeeze method
>>> import pandas as pd
>>> from io import StringIO
>>> # data shortened for brievity
>>> df = pd.read_csv(StringIO("""country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude
... AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190
... AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787
... ALB,Albania,Shkopet,WRI1002173,24.0,41.6796,19.8305
... ALB,Albania,Ulez,WRI1002174,25.0,41.6796,19.8936"""))
>>> searchVar = "Ulez"
>>> df.loc[df["name"] == searchVar, "latitude"] # here you have a pd.Series
3 41.6796
Name: latitude, dtype: float64
>>> df.loc[df["name"] == searchVar, "latitude"].squeeze() # here you have a scalar
41.6796
>>> df.loc[df["name"] == searchVar, "longitude"].squeeze()
19.8936
If for some reason you have several rows with the same name, you will get a Series back and not a scalar. But maybe it is a case where failure is what you want rather than passing ambiguous data.
What you're looking at is a pandas.Series object, containing a single row of data, and you're trying to chop up its __repr__ to get at the value. There is no need for this. I'm not familiar with the Python version of plotly but I see that you have a callback so I've wrapped it up into a function (I'm not sure whether the case exists where the name can't be found):
import pandas as pd
def get_by_name(name):
df = pd.read_csv('powerplants.csv')
df = df[df['name'] == name]
if not df.empty:
return df[['latitude', 'longitude']].values.tolist()[0]
return None, None
lat, lon = get_by_name('Kajaki Hydroelectric Power Plant Afghanistan')

Imputing the missing values string using a condition(pandas DataFrame)

Kaggle Dataset(working on)- Newyork Airbnb
Created with a raw data code for running better explanation of the issue
`airbnb= pd.read_csv("https://raw.githubusercontent.com/rafagarciac/Airbnb_NYC-Data-Science_Project/master/input/new-york-city-airbnb-open-data/AB_NYC_2019.csv")
airbnb[airbnb["host_name"].isnull()][["host_name","neighbourhood_group"]]
`DataFrame
I would like to fill the null values of "host_name" based on the "neighbourhood_group" column entities.
like
if airbnb['host_name'].isnull():
airbnb["neighbourhood_group"]=="Bronx"
airbnb["host_name"]= "Vie"
elif:
airbnb["neighbourhood_group"]=="Manhattan"
airbnb["host_name"]= "Sonder (NYC)"
else:
airbnb["host_name"]= "Michael"
(this is wrong,just to represent the output format i want)
I've tried using if statement but I couldn't apply in a correct way. Could you please me solve this.
Thanks
You could try this -
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Bronx"), "host_name"] = "Vie"
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Manhattan"), "host_name"] = "Sonder (NYC)"
airbnb.loc[airbnb['host_name'].isnull(), "host_name"] = "Michael"
Pandas has a special method to fill NA values:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You may create a dict with values for "host_name" field using "neighbourhood_group" values as keys and do this:
host_dict = {'Bronx': 'Vie', 'Manhattan': 'Sonder (NYC)'}
airbnb['host_name'] = airbnb['host_name'].fillna(value=airbnb[airbnb['host_name'].isna()]['neighbourhood_group'].map(host_dict))
airbnb['host_name'] = airbnb['host_name'].fillna("Michael")
"value" argument here may be a Series of values.
So, first of all, we create a Series with "neighbourhood_group" values which correspond to our missing values by using this part:
neighbourhood_group_series = airbnb[airbnb['host_name'].isna()]['neighbourhood_group']
Then using map function together with "host_dict" we get a Series with values that we want to impute:
neighbourhood_group_series.map(host_dict)
Finally we just impute in all other NA cells some default value, in our case "Michael".
You can do it with:
ornek = pd.DataFrame({'samp1': [None, None, None],
'samp2': ["sezer", "bozkir", "farkli"]})
def filter_by_col(row):
if row["samp2"] == "sezer":
return "ping"
if row["samp2"] == "bozkir":
return "pong"
return None
ornek.apply(lambda x: filter_by_col(x), axis=1)

pandas astype python bool instead of numpy.bool_

I need to convert a pandas dataframe to a JSON object.
However
json.dumps(df.to_dict(orient='records'))
fails as the boolean columns are not JSON serializable since they are of type numpy.bool_. Now I've tried df['boolCol'] = df['boolCol'].astype(bool) but that still leaves the type of the fields as numpy.bool_ rather than the pyhton bool which serializes to JSON no problem.
Any suggestions on how to convert the columns without looping through every record and converting it?
Thanks
EDIT:
This is part of a whole sanitization of dataframes of varying content so they can be used as the JSON payload for an API. Hence we currently have something like this:
for cols in df.columns:
if type(df[cols][0]) == pd._libs.tslibs.timestamps.Timestamp:
df[cols] = df[cols].astype(str)
elif type(df[cols]) == numpy.bool_:
df[cols] = df[cols].astype(bool) #still numnpy bool afterwards!
Just tested it out, and the problem seems to be caused by the orient='records' parameter. Seems you have to set it to a option (e.g. list) and convert the results to your preferred format.
import numpy as np
import pandas as pd
column_name = 'bool_col'
bool_df = pd.DataFrame(np.array([True, False, True]), columns=[column_name])
list_repres = bool_df.to_dict('list')
record_repres = [{column_name: values} for values in list_repres[column_name]]
json.dumps(record_repres)
You need to use .astype and set its field dtype to object
See example below:
df = pd.DataFrame({
"time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
"value": [10, 20, 30, 40, None]
})
df['revoked'] = False
df.revoked = df.revoked.astype(bool)
print 'setting astype as bool:', type(df.iloc[0]['revoked'])
df.revoked = df.revoked.astype(object)
print 'setting astype as object:', type(df.iloc[0]['revoked'])
>>> setting astype as bool: <type 'numpy.bool_'>
>>> setting astype as object: <type 'bool'>

How to get value in dataframe column

I have dataframe column, one row contain data:
bloodtype
[{'id': 35,
'type': 'typeO'}]
The column contain different blood type, A,B,AB, and O in other rows.
I ran:
type(df.blood_type)
it returned 'pandas.core.series.Series'
df['blood_type'].str.split(":")[0][2]
it return " 'typeO'}]"
How can I just get typeO, typeA, typeAB so i could convert them different classes. Thanks
from ast import literal_eval
df['type'] = df['blood_type'].apply(lambda x: literal_eval(x[0]).type)
Alternate solution:
#a more robust solution to incase the series has unexpected values
def extract_type(x):
try:
if type(literal_eval(x)) == 'dict':
return literal_eval(x)
else:
return None
except ValueError:
return None
df['type'] = df['blood_type'].apply(extract_type)
edit: convert string representation of dict to dictionary using ast.literal_eval before extracting type

Python - Detect if list of values/strings are dates, times, datetime, or neither

Given a list of values or strings, how can I detect whether these are either dates, date and times, or neither?
I have used the pandas api to infer data types but it doesn't work well with dates. See example:
import pandas as pd
def get_redshift_dtype(values):
dtype = pd.api.types.infer_dtype(values)
return dtype
This is the result that I'm looking for. Any suggestions on better methods?
# Should return "date"
values_1 = ['2018-10-01', '2018-02-14', '2017-08-01']
# Should return "date"
values_2 = ['2018-10-01 00:00:00', '2018-02-14 00:00:00', '2017-08-01 00:00:00']
# Should return "datetime"
values_3 = ['2018-10-01 02:13:00', '2018-02-14 11:45:00', '2017-08-01 00:00:00']
# Should return "None"
values_4 = ['123098', '213408', '801231']
You can write a function to return values dependent on conditions you specify:
def return_date_type(s):
s_dt = pd.to_datetime(s, errors='coerce')
if s_dt.isnull().any():
return 'None'
elif s_dt.normalize().equals(s_dt):
return 'date'
return 'datetime'
return_date_type(values_1) # 'date'
return_date_type(values_2) # 'date'
return_date_type(values_3) # 'datetime'
return_date_type(values_4) # 'None'
You should be aware that Pandas datetime series always include time. Internally, they are stored as integers, and if a time is not specified it will be set to 00:00:00.
Here's something that'll give you exactly what you asked for using re
import re
classify_dict = {
'date': '^\d{4}(-\d{2}){2}$',
'date_again': '^\d{4}(-\d{2}){2} 00:00:00$',
'datetime': '^\d{4}(-\d{2}){2} \d{2}(:\d{2}){2}$',
}
def classify(mylist):
key = 'None'
for k, v in classify_dict.items():
if all([bool(re.match(v, e)) for e in mylist]):
key = k
break
if key == 'date_again':
key = 'date'
return key
classify(values_2)
>>> 'date'
The checking is done iteratively using regex and it tries to match all items of a list. Only if all items are matched will the key be returned. This works for all of your example lists you've given.
For now, the regex string does not check for numbers outside certain range, e.g (25:00:00) but that would be relatively straightforward to implement.

Categories