I'm trying to convert string into float in Dash callback but when I run my code I'm getting in my Dash app error: lati = float(lati[-1])
ValueError: could not convert string to float: 'float64) I'm not getting this error in terminal though.
First what I need to do is extract given latitude (and longitude) number. Therefore I need it to convert it to string and split it because I could not find better way to get this number from csv file using pandas.
Output:
# converting to string:
12 41.6796
Name: latitude, dtype: float64
# splitting:
['12', '', '', '', '41.6796']
# converting to float:
41.6796
This is the actual code:
#app.callback(Output('text-output', 'children'),
[Input('submit-val', 'n_clicks')],
[State('search-input', 'value')])
def updateText(n_clicks, searchVar):
df = pd.read_csv("powerplant.csv")
df = df[df.name == searchVar]
# converting to string
lati = str(df['latitude'])
longi = str(df['longitude'])
# splitting it
lati = lati.split('\n', 1)
lati = lati[0].split(' ', 4)
longi = longi.split('\n', 1)
longi = longi[0].split(' ', 4)
#converting to float
lati = float(lati[-1])
longi = float(longi[-1])
I actually tested this code in other script and it worked just fine. Is there any better way how could I extract latitude and longitude numbers?
The data can be downloaded from https://datasets.wri.org/dataset/globalpowerplantdatabase; here is an excerpt.
country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude,primary_fuel,other_fuel1,other_fuel2,other_fuel3,commissioning_year,owner,source,url,geolocation_source,wepp_id,year_of_capacity_data,generation_gwh_2013,generation_gwh_2014,generation_gwh_2015,generation_gwh_2016,generation_gwh_2017,estimated_generation_gwh
AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009793,2017,,,,,,
AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009795,2017,,,,,,
ALB,Albania,Shkopet,WRI1002173,24.0,41.6796,19.8305,Hydro,,,,1963.0,,Energy Charter Secretariat,http://www.energycharter.org/fileadmin/DocumentsMedia/IDEER/IDEER-Albania_2013_en.pdf,GEODB,1021238,,,,,,,79.22851153039832
ALB,Albania,Ulez,WRI1002174,25.0,41.6796,19.8936,Hydro,,,,1958.0,,Energy Charter Secretariat,http://www.energycharter.org/fileadmin/DocumentsMedia/IDEER/IDEER-Albania_2013_en.pdf,GEODB,1021241,,,,,,,82.52969951083159
The issue is the way you are accessing the values in a dataframe. Pandas allows you to access the data without having to parse the string representation.
You can access the row and the column in one call to .loc
If you know you will have a single value, you can call the squeeze method
>>> import pandas as pd
>>> from io import StringIO
>>> # data shortened for brievity
>>> df = pd.read_csv(StringIO("""country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude
... AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190
... AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787
... ALB,Albania,Shkopet,WRI1002173,24.0,41.6796,19.8305
... ALB,Albania,Ulez,WRI1002174,25.0,41.6796,19.8936"""))
>>> searchVar = "Ulez"
>>> df.loc[df["name"] == searchVar, "latitude"] # here you have a pd.Series
3 41.6796
Name: latitude, dtype: float64
>>> df.loc[df["name"] == searchVar, "latitude"].squeeze() # here you have a scalar
41.6796
>>> df.loc[df["name"] == searchVar, "longitude"].squeeze()
19.8936
If for some reason you have several rows with the same name, you will get a Series back and not a scalar. But maybe it is a case where failure is what you want rather than passing ambiguous data.
What you're looking at is a pandas.Series object, containing a single row of data, and you're trying to chop up its __repr__ to get at the value. There is no need for this. I'm not familiar with the Python version of plotly but I see that you have a callback so I've wrapped it up into a function (I'm not sure whether the case exists where the name can't be found):
import pandas as pd
def get_by_name(name):
df = pd.read_csv('powerplants.csv')
df = df[df['name'] == name]
if not df.empty:
return df[['latitude', 'longitude']].values.tolist()[0]
return None, None
lat, lon = get_by_name('Kajaki Hydroelectric Power Plant Afghanistan')
Related
I have a column which does not want to change from object to float. The data from the xlsx file is always presented the same (as a number), but somehow only this column is seen as object.
The numbers in the column represent percentage using point(.) as a decimal placement.
xls3[' Vesturland'] = xls3[' Vesturland'].astype(float)
does not work. There is no special characters to replace (eg.str.replace()), I have tried that as well.
I dare not to use
xls3[' Vesturland'] = pd.to_numeric(xls3[' Vesturland'])
because it changes all floats to NaN and the whole column is percentage values.
The only thing I can think of is that the number of decimals is not consistent, but that shouldn't really matter, or does it? I put a red arrow on the column I want to change to float.
I only get this error when I try to convert to float Error could not convert string to float: '' and searching for it on my specific problem has not given any results yet.
You have empty strings in your pd.Series, which cannot be readily converted to a float data type. What you can do is check for them and remove them. An example script is:
import pandas as pd
a=pd.DataFrame([['a','b','c'],['2.42','','3.285']]).T
a.columns=['names', 'nums']
a['nums']=a['nums'][a['nums']!=''].astype(float)
Note: if you try to run a['nums']=a['nums'].astype(float) before selecting non-empty strings the same error that you've mentioned will be thrown.
First use this line to obtain the current dtypes:
col_dtypes = dict([(k, v.name) for k, v in dict(df.dtypes).items()])
Like so:
xls3 = pd.read_csv('path/to/file')
col_dtypes = dict([(k, v.name) for k, v in dict(xls3.dtypes).items()])
print(col_dtypes)
Copy the value that is printed.
It should be like this:
{'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64', ' Vesturland': 'object', ...}
Then, for the column which whose datatype you know isn't object, change it to the required type ('int32', 'int64', 'float32' or 'float64')
Example:
The datatypes might be detected as:
{'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64', ' Vesturland': 'object', ...}
If we know Vesturland is supposed to be Float, then we can edit this to be:
col_dtypes = {
'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64',
' Vesturland': 'float64', ...
}
Now, with this snippet you can find the non-numeric values:
def clean_non_numeric_values(series, col_type):
illegal_value_pos = []
for i in range(len(series)):
try:
if col_type == 'int64' or col_type == 'int32':
val = int(series[i])
elif col_type == 'float32' or col_type == 'float64':
val = float(series[i])
except:
illegal_value_pos.append(i)
# series[i] = None # We can set the illegal values to None
# to remove them later using xls3.dropna()
return series, illegal_value_pos
# Now we will manually replace the dtype of the column Vesturland like so:
col_dtypes = {
'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64',
' Vesturland': 'float64'
}
for col in list(xls3.columns):
if col_dtypes[col] in ['int32', 'int64', 'float32', 'float64']:
series, illegal_value_pos = (
clean_non_numeric_values(series=xls3[col], col_type=col_dtypes[col])
)
xls3[col] = series
print(illegal_value_pos)
if illegal_value_pos:
illegal_rows = xls3.iloc[illegal_value_pos]
# This will print all the illegal values.
print(illegal_rows[col])
Now you can use this information to remove the non-numeric values from the dataframe.
Warning: Since this uses a for loop, it is slow but it will help you to remove the values you don't want.
After much trial and error, I ended up opening the excel sheet, deleted about 10 rows below the last data input. Then I unfroze rows/column, read it into Jupyter Notebook again and now ALL OF THE DATA IS FLOAT. I don't know which did the trick, but this is resolved now.
Thank you all that helped me here for your time and your attempts to solve this.
len([x for x in xls3[' Vesturland'] if x == ' '])
sometimes it can be blank go to your CSV file open it from excel and check is ctrl+shift+l filter and blank space.
Could it be that you do for i in range 0,len(tablename)
and you need len(tablename)-1
because you start at 0?
I have a function that extracts a number of variables from zillow. I used a lambda function to append the returned values to a dataframe. I am wondering if there is a faster way to return all the variables and append them to the dataframe instead of individually.
Here is my code:
from xml.dom.minidom import parse,parseString
import xml.dom.minidom
import requests
import sys
import pandas as pd
import numpy as np
l_zwsid=''
df = pd.read_csv('data.csv')
def getElementValue(p_dom,p_element):
if len(p_dom.getElementsByTagName(p_element)) > 0:
l_value=p_dom.getElementsByTagName(p_element)[0]
return(l_value.firstChild.data)
else:
l_value='NaN'
return(l_value)
def getData(l_zwsid, a_addr, a_zip):
try:
l_url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id='+l_zwsid+'&address='+a_addr+'&citystatezip='+a_zip
xml=requests.get(l_url)
dom=parseString(xml.text)
responses=dom.getElementsByTagName('response')
zpid=getElementValue(dom,'zpid')
usecode=getElementValue(dom,'useCode')
taxyear=getElementValue(dom,'taxAssessmentYear')
tax=getElementValue(dom,'taxAssessment')
yearbuilt=getElementValue(dom,'yearBuilt')
sqft=getElementValue(dom,'finishedSqFt')
lotsize=getElementValue(dom,'lotSizeSqFt')
bathrooms=getElementValue(dom,'bathrooms')
bedrooms=getElementValue(dom,'bedrooms')
totalrooms=getElementValue(dom,'totalRooms')
lastSale=getElementValue(dom,'lastSoldDate')
lastPrice=getElementValue(dom,'lastSoldPrice')
latitude=getElementValue(dom, 'latitude')
longitude=getElementValue(dom, 'longitude')
for response in responses:
addresses=response.getElementsByTagName('address')
for addr in addresses:
street=getElementValue(addr,'street')
zipcode=getElementValue(addr,'zipcode')
zestimates=response.getElementsByTagName('zestimate')
for zest in zestimates:
amt=getElementValue(zest,'amount')
lastupdate=getElementValue(zest,'last-updated')
valranges=zest.getElementsByTagName('valuationRange')
for val in valranges:
low=getElementValue(val,'low')
high=getElementValue(val,'high')
return longitude, latitude
except AttributeError:
return None
df['Longtitude'] = df.apply(lambda row: getData(l_zwsid, row['Street'], row['Zip']), axis = 1)
df['Latitude'] = df.apply(lambda row: getData(l_zwsid, row['Street'], row['Zip']), axis = 1)
This currently does not work because the new columns will contain both the longitude and latitude.
Your getData function returns a tuple, which is why both columns have both lat and lon. One workaround could be to parameterise this function as follows:
def getData(l_zwsid, a_addr, a_zip, axis='lat'):
valid = ['lat', 'lon']
if axis not in valid:
raise ValueError(f'axis must be one of {valid}')
...
if axis == 'lat':
return latitude
else:
return longitude
This won't improve efficiency will make it even slower, however. Your main overhead is coming from making API calls for every row in the DataFrame, so you are constrained by network performance.
You can make your getData function return a string which contains comma separated values of all the elements
Append this csv string as ALL_TEXTcolumn in the dataframe df
Split the column ALL_TEXT into multiple columns (Lat, long, zipcode, street etc)
def split_into_columns(text):
required_columns = ['Latitude', 'Longtitude', 'Zipcode']
columns_value_list = text['ALL_TEXT'].split(',')
for i in range(len(required_columns)):
text[required_columns[i]] = columns_value_list[i]
return text
df= pd.DataFrame([ ['11.49, 12.56, 9823A'], ['14.02, 15.29, 9674B'] ], columns=['ALL_TEXT'])
updated_df = df.apply(split_into_columns, axis=1)
df
ALL_TEXT
0 11.49, 12.56, 9823A
1 14.02, 15.29, 9674B
updated_df
ALL_TEXT Latitude Longtitude Zipcode
0 11.49, 12.56, 9823A 11.49 12.56 9823A
1 14.02, 15.29, 9674B 14.02 15.29 9674B
I am trying to understand how to convert azure ml String Feature data type into float using python script. my data set is contain "HH:MM" data time format. It recognized as String Feature like the following img:
I want to convert it into float type which will divide the timestamp by 84600 ( 24 hour) so 17:30 will be converted into 0,729166666666667, so I write python script to convert that. This is my script:
import pandas as pd
import numpy as np
def timeToFloat(x):
frt = [3600,60]
data = str(x)
result = float(sum([a*b for a,b in zip(frt, map(int,data.split(':')))]))/86400
return result if isNotZero(x) else 0.0
def isNotZero(x):
return (x is "0")
def azureml_main(dataframe1 = None):
df = pd.DataFrame(dataframe1)
df["Departure Time"] = pd.to_numeric(df["Departure Time"]).apply(timeToFloat)
print(df["Departure Time"])
return df,
When I run the script it was failed. Then I try to check whether it is str or not, but it returns None.
can we treat String Feature as String? or how should I covert this data correctly?
The to_numeric conversion seems to be the problem, as there's no default parsing from string to number.
Does it work if you just use pd.apply(timeToFloat) ?
Roope - Microsoft Azure ML Team
As shown in the screenshot below, I have 2 columns in an excel file. I'm trying to reduce the precision of the number fields eg 100.54000000000001 to 100.540. The number is stored as string, so when I convert it to float using
df['Unnamed: 5'] = pd.to_numeric(df['Unnamed: 5'], errors='coerce')
it converts strings to NaN. Can anyone help me with issue? I'm trying to convert only numbers to int, and words should remain strings.
EDIT: It would be acceptable to convert the numeric values back to string after rounding them. My code is as follows:
>>> import pandas as pd
>>> import numpy as np
>>> xl = pd.ExcelFile("WSJ_template.xls")
>>> xl.sheet_names
[u'losers', u'winners']
>>> dfw = xl.parse("winners")
>>> dfw.head()
<output>
>>> dfw = dfw.apply(pd.to_numeric, errors='coerce').combine_first(dfw)
>>> dfw = dfw.replace(np.nan, '', regex=True)
>>> dfw
<output>
As you already identified, we're best off using pd.DataFrame.apply. The only difference is rather than using a built-in function, we'll define our own.
We'll start off by filling the DataFrame (this is a placeholder, you already have this covered):
df = pd.DataFrame(columns=['Unnamed: 5', 'Unnamed: 6'],
data=[['NaN', 'NaN'],
['Average', 'Weekly'],
['100.540000000001', '0.2399999999999999'],
['99.3299999999998', '0.1700000000000001'],
['95.4800000000004', 'change'],
['bid', '1.929999999999999']])
Now we define a function to use to convert values. This function should try to cast to float and if it works, return the rounded value. If it doesn't work, just return the original value. Here's one possible route:
def round_only_nums(val):
try:
return '%s' % round(float(val), 3)
except:
return val
Next, let's apply that the columns that need processing:
cols_to_process = ['Unnamed: 5', 'Unnamed: 6']
for col in cols_to_process:
df[col] = df[col].apply(round_only_nums)
And our results:
>>> df
Unnamed: 5 Unnamed: 6
0 nan nan
1 Average Weekly
2 100.54 0.24
3 99.33 0.17
4 95.48 change
5 bid 1.93
I have this type of DataFrame I wish to utilize. But because the data i imported is using the i letter for the imaginary part of the complex number, python doesn't allow me to convert it as a float.
5.0 0.01511+0.0035769i
5.0298 0.015291+0.0075383i
5.0594 0.015655+0.0094534i
5.0874 0.012456+0.011908i
5.1156 0.015332+0.011174i
5.1458 0.015758+0.0095832i
How can I proceed to change the i to j in each row of the DataFrame?
Thank you.
If you have a string like this: complexStr = "0.015291+0.0075383i", you could do:
complexFloat = complex(complexStr[:-1] + 'j')
If your data is a string like this: str = "5.0 0.01511+0.0035769i", you have to separate the first part, like this:
number, complexStr = str.split()
complexFloat = complex(complexStr[:-1] + 'j')
>>> complexFloat
>>> (0.015291+0.0075383j)
>>> type(complexFloat)
>>> <type 'complex'>
I'm not sure how you obtain your dataframe, but if you're reading it from a text file with a suitable header, then you can use a converter function to sort out the 'j' -> 'i' so that your dtype is created properly:
For file test.df:
a b
5.0 0.01511+0.0035769i
5.0298 0.015291+0.0075383i
5.0594 0.015655+0.0094534i
5.0874 0.012456+0.011908i
5.1156 0.015332+0.011174i
5.1458 0.015758+0.0095832i
the code
import pandas as pd
df = pd.read_table('test.df',delimiter='\s+',
converters={'b': lambda v: complex(str(v.replace('i','j')))}
)
gives df as:
a b
0 5.0000 (0.01511+0.0035769j)
1 5.0298 (0.015291+0.0075383j)
2 5.0594 (0.015655+0.0094534j)
3 5.0874 (0.012456+0.011908j)
4 5.1156 (0.015332+0.011174j)
5 5.1458 (0.015758+0.0095832j)
with column dtypes:
a float64
b complex128