I am trying to validate columns over a particular regex in dataframe. The Limit of number is (20,3) i.e maximum 20 length with int datatype or 23 with float datatype . but pandas is converting original number to random int number and my regex validation is getting failed . I checked my regex is proper .
Dataframe :
FirstColumn,SecondColumn,ThirdColumn
111900987654123.123,111900987654123.123,111900987654123.123
111900987654123.12,111900987654123.12,111900987654123.12
111900987654123.1,111900987654123.1,111900987654123.1
111900987654123,111900987654123,111900987654123
111900987654123,-111900987654123,-111900987654123
-111900987654123.123,-111900987654123.123,-111900987654123.1
-111900987654123.12,-111900987654123.12,-111900987654123.12
-111900987654123.1,-111900987654123.1,-111900987654123.1
11119009876541231111,1111900987654123,1111900987654123
Code:
NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:\.[0-9]{1,3})?$"
df_CPCodeDF=pd.read_csv("D:\\FTP\LocalUser\\NCCLCOLL\\COLLATERALUPLOAD\\upld\\SplitFiles\\AACCR6675H_22102021_07_1 - Copy.csv")
pd.set_option('display.float_format', '{:.3f}'.format)
rslt_df2=df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1=rslt_df2[~rslt_df2.iloc[:,0].apply(str).str.contains(NumberValidationRegexnegative, regex=True)].index
print("rslt_df1",rslt_df1)
Output Result:
rslt_df1 Int64Index([8], dtype='int64')
Expected Result:
rslt_df1 Int64Index([], dtype='int64')
Use dtype=str as parameter of pd.read_csv:
NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:\.[0-9]{1,3})?$"
df_CPCodeDF = pd.read_csv("data.csv", dtype=str)
rslt_df2 = df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1 = rslt_df2[~rslt_df2.iloc[:,0] \
.str.contains(NumberValidationRegexnegative, regex=True)].index
Output:
>>> print("rslt_df1", rslt_df1)
rslt_df1 Int64Index([], dtype='int64')
I have a column which does not want to change from object to float. The data from the xlsx file is always presented the same (as a number), but somehow only this column is seen as object.
The numbers in the column represent percentage using point(.) as a decimal placement.
xls3[' Vesturland'] = xls3[' Vesturland'].astype(float)
does not work. There is no special characters to replace (eg.str.replace()), I have tried that as well.
I dare not to use
xls3[' Vesturland'] = pd.to_numeric(xls3[' Vesturland'])
because it changes all floats to NaN and the whole column is percentage values.
The only thing I can think of is that the number of decimals is not consistent, but that shouldn't really matter, or does it? I put a red arrow on the column I want to change to float.
I only get this error when I try to convert to float Error could not convert string to float: '' and searching for it on my specific problem has not given any results yet.
You have empty strings in your pd.Series, which cannot be readily converted to a float data type. What you can do is check for them and remove them. An example script is:
import pandas as pd
a=pd.DataFrame([['a','b','c'],['2.42','','3.285']]).T
a.columns=['names', 'nums']
a['nums']=a['nums'][a['nums']!=''].astype(float)
Note: if you try to run a['nums']=a['nums'].astype(float) before selecting non-empty strings the same error that you've mentioned will be thrown.
First use this line to obtain the current dtypes:
col_dtypes = dict([(k, v.name) for k, v in dict(df.dtypes).items()])
Like so:
xls3 = pd.read_csv('path/to/file')
col_dtypes = dict([(k, v.name) for k, v in dict(xls3.dtypes).items()])
print(col_dtypes)
Copy the value that is printed.
It should be like this:
{'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64', ' Vesturland': 'object', ...}
Then, for the column which whose datatype you know isn't object, change it to the required type ('int32', 'int64', 'float32' or 'float64')
Example:
The datatypes might be detected as:
{'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64', ' Vesturland': 'object', ...}
If we know Vesturland is supposed to be Float, then we can edit this to be:
col_dtypes = {
'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64',
' Vesturland': 'float64', ...
}
Now, with this snippet you can find the non-numeric values:
def clean_non_numeric_values(series, col_type):
illegal_value_pos = []
for i in range(len(series)):
try:
if col_type == 'int64' or col_type == 'int32':
val = int(series[i])
elif col_type == 'float32' or col_type == 'float64':
val = float(series[i])
except:
illegal_value_pos.append(i)
# series[i] = None # We can set the illegal values to None
# to remove them later using xls3.dropna()
return series, illegal_value_pos
# Now we will manually replace the dtype of the column Vesturland like so:
col_dtypes = {
'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64',
' Vesturland': 'float64'
}
for col in list(xls3.columns):
if col_dtypes[col] in ['int32', 'int64', 'float32', 'float64']:
series, illegal_value_pos = (
clean_non_numeric_values(series=xls3[col], col_type=col_dtypes[col])
)
xls3[col] = series
print(illegal_value_pos)
if illegal_value_pos:
illegal_rows = xls3.iloc[illegal_value_pos]
# This will print all the illegal values.
print(illegal_rows[col])
Now you can use this information to remove the non-numeric values from the dataframe.
Warning: Since this uses a for loop, it is slow but it will help you to remove the values you don't want.
After much trial and error, I ended up opening the excel sheet, deleted about 10 rows below the last data input. Then I unfroze rows/column, read it into Jupyter Notebook again and now ALL OF THE DATA IS FLOAT. I don't know which did the trick, but this is resolved now.
Thank you all that helped me here for your time and your attempts to solve this.
len([x for x in xls3[' Vesturland'] if x == ' '])
sometimes it can be blank go to your CSV file open it from excel and check is ctrl+shift+l filter and blank space.
Could it be that you do for i in range 0,len(tablename)
and you need len(tablename)-1
because you start at 0?
I'm trying to convert string into float in Dash callback but when I run my code I'm getting in my Dash app error: lati = float(lati[-1])
ValueError: could not convert string to float: 'float64) I'm not getting this error in terminal though.
First what I need to do is extract given latitude (and longitude) number. Therefore I need it to convert it to string and split it because I could not find better way to get this number from csv file using pandas.
Output:
# converting to string:
12 41.6796
Name: latitude, dtype: float64
# splitting:
['12', '', '', '', '41.6796']
# converting to float:
41.6796
This is the actual code:
#app.callback(Output('text-output', 'children'),
[Input('submit-val', 'n_clicks')],
[State('search-input', 'value')])
def updateText(n_clicks, searchVar):
df = pd.read_csv("powerplant.csv")
df = df[df.name == searchVar]
# converting to string
lati = str(df['latitude'])
longi = str(df['longitude'])
# splitting it
lati = lati.split('\n', 1)
lati = lati[0].split(' ', 4)
longi = longi.split('\n', 1)
longi = longi[0].split(' ', 4)
#converting to float
lati = float(lati[-1])
longi = float(longi[-1])
I actually tested this code in other script and it worked just fine. Is there any better way how could I extract latitude and longitude numbers?
The data can be downloaded from https://datasets.wri.org/dataset/globalpowerplantdatabase; here is an excerpt.
country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude,primary_fuel,other_fuel1,other_fuel2,other_fuel3,commissioning_year,owner,source,url,geolocation_source,wepp_id,year_of_capacity_data,generation_gwh_2013,generation_gwh_2014,generation_gwh_2015,generation_gwh_2016,generation_gwh_2017,estimated_generation_gwh
AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009793,2017,,,,,,
AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009795,2017,,,,,,
ALB,Albania,Shkopet,WRI1002173,24.0,41.6796,19.8305,Hydro,,,,1963.0,,Energy Charter Secretariat,http://www.energycharter.org/fileadmin/DocumentsMedia/IDEER/IDEER-Albania_2013_en.pdf,GEODB,1021238,,,,,,,79.22851153039832
ALB,Albania,Ulez,WRI1002174,25.0,41.6796,19.8936,Hydro,,,,1958.0,,Energy Charter Secretariat,http://www.energycharter.org/fileadmin/DocumentsMedia/IDEER/IDEER-Albania_2013_en.pdf,GEODB,1021241,,,,,,,82.52969951083159
The issue is the way you are accessing the values in a dataframe. Pandas allows you to access the data without having to parse the string representation.
You can access the row and the column in one call to .loc
If you know you will have a single value, you can call the squeeze method
>>> import pandas as pd
>>> from io import StringIO
>>> # data shortened for brievity
>>> df = pd.read_csv(StringIO("""country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude
... AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190
... AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787
... ALB,Albania,Shkopet,WRI1002173,24.0,41.6796,19.8305
... ALB,Albania,Ulez,WRI1002174,25.0,41.6796,19.8936"""))
>>> searchVar = "Ulez"
>>> df.loc[df["name"] == searchVar, "latitude"] # here you have a pd.Series
3 41.6796
Name: latitude, dtype: float64
>>> df.loc[df["name"] == searchVar, "latitude"].squeeze() # here you have a scalar
41.6796
>>> df.loc[df["name"] == searchVar, "longitude"].squeeze()
19.8936
If for some reason you have several rows with the same name, you will get a Series back and not a scalar. But maybe it is a case where failure is what you want rather than passing ambiguous data.
What you're looking at is a pandas.Series object, containing a single row of data, and you're trying to chop up its __repr__ to get at the value. There is no need for this. I'm not familiar with the Python version of plotly but I see that you have a callback so I've wrapped it up into a function (I'm not sure whether the case exists where the name can't be found):
import pandas as pd
def get_by_name(name):
df = pd.read_csv('powerplants.csv')
df = df[df['name'] == name]
if not df.empty:
return df[['latitude', 'longitude']].values.tolist()[0]
return None, None
lat, lon = get_by_name('Kajaki Hydroelectric Power Plant Afghanistan')
I save set parameter using to_csv.
csv file as below.
1,59,"set([17122, 196, 26405, 13032, 39657, 12427, 25133, 35951,
38928, 2 6088, 10258, 49235, 10326, 13176, 30450, 41787, 14084,
46149])",18,19.0,1 1,5.36363649368
Can I use read_csv and return a set type but str
users = pd.read_csv(DATA_PATH + "users_match.csv", dtype={
})
The answer is yes. Your solution
users = pd.read_csv(DATA_PATH + "users_match.csv", header = None)
will already return column 2 as a string as long as you have double quotes around set([...]).
Then use
users[2].apply(lambda x: eval(x))
to convert it back to set
To convert the DataFrame's str object (the string starting with the characters "set") into a built-in Python set object, here is one way:
>>> import pandas as pd
>>> df = pd.read_csv('users_match.csv', header=None)
>>> type(df[2][0])
str
>>> df.set_value(0, 2, eval(df[2][0]))
>>> type(df[2][0])
set