Python- problem converting negative numbers to floats, issues with hyphen encoding - python

I have a Pandas dataframe that I've read from a file - pd.read_csv() - and I'm having trouble converting a column with string values to float.
Firstly, I'm not entirely sure why pandas is even reading the column as string files to begin with - all the values are numeric. The problem seems to be with the hyphen minus sign for the negative numbers. There are other threads on this topic that mention how em-dash can mess things up (here, for example)
However, when I try converting the hyphen type, it still gives me an error. For example,
df['Verified_m'] = df['Verified_m'].str.replace("\U00002013", "-").astype(float)
doesn't change anything; all the values start with the '-' hyphen, so it's not actually replacing anything. It still gives me the error:
ValueError: could not convert string to float: '-'
I've tried replacing all of the hyphens with a numeric value to see if that would work, and I'm able to convert to float (example: df['Verified_m'] = df['Verified_m'].str.replace("-", "0").astype(float) . But I'd like to retain the negative values in the dataset. Does anyone know what's wrong with my hyphens?

Try this:
df['Verified_m'] = df['Verified_m'].str.replace("\U00002013", "-").str.replace(r'^-$', '0', regex=True).astype(float)
After converting the em-dashes to hyphens, it converts a lone - to zero.

Related

ValueError: could not convert string to float: ' caing low enough for rotary follow on' | Python

I have a python script that iterate among data format values and returns back just hour.
Below is the similar script(that I use for iteration):
zaman = "06:00:00" (hours:minutes:seconds)
hm = zaman.split(":")
vaxt = [hm[1]]
saat = float(hm[0]) + float(float(hm[1])/60)
print(f"{saat:,.2f}")
In one of the files which has several rows I get the error:
ValueError: could not convert string to float: ' caing low enough for rotary follow on'
I have checked myself, that this row do not differ from the previous ones, but I get an error on that one.
Do you have suggestions on how to solve it? (may be getting hours from DateTime in another way)
The issue is that you're not correctly identifying the datetime in the string, so you end up trying to convert the wrong bit to a float.
One potential fix for this would be to not rely on splitting the string at the :s, but instead to use a regex to look for the part of the string with the appropriate format.
For example:
import re
test_string = 'this is a string with 06:00:00 in it somewhere'
matches = re.search('(\d{2}):(\d{2}):(\d{2})', test_string)
matches = [float(m) for m in matches.groups()]
print(matches)
# [6.0,0.0,0.0]
I have tested the code you provided above and it works. However, after doing some research it appears:
The Python "ValueError: could not convert string to float" occurs when we pass a string that cannot be converted to a float (e.g. an empty string or one containing characters) to the float() class. To solve the error, remove all unnecessary characters from the string.
So check your file to make sure the input is clean for float() to work perfectly.

Can't convert string to float because of '.'?

I need to round the string values of a column in my dataframe up to 2 decimal cases, so I started by converting them to floats using astype(float) and then using round(2).
Ex:
df['col'] = df['col'].astype(float).round(2)
But I'm getting the following error:
ValueError: could not convert string to float: '.'
I thought the dots would be no problem, is there something I'm missing here?
Edit: It's a huge amount of data, so there could be unexpected values, but after filtering it to testing samples the error continues.
Edit2: Turns out I still had invalid data even after filtering the sheet, so sorry for the seemingly dumb question lol. mozways's solution worked fine.
To convert string to numeric without errors upon invalid data, use pandas.to_numeric:
df['col'] = pandas.to_numeric(df['col'], error='coerce').round(2)

Python Matplotlib ValueError

Hi,
how do i plot the Attached Dataframe in python, i am looking for multiple series line graph.
Any help will be much appreciated.
Error:-ValueError: could not convert string to float
Thanks
Your problem here is that the % signs in your csv file are making Pandas read each value as a string object, rather than as a float.
The best option for resolving this would probably be to not have extraneous characters like %s everywhere in your csv file. Instead, it would probably make more sense to list units in your columns, or elsewhere in descriptions.
However, in this case, it can also be solved afterward by removing the extraneous characters and converting manually, eg, for a DataFrame a:
a.ix[:,a.dtypes==object] = a.ix[:,a.dtypes==object].applymap(lambda x: float(x[:-1]))
This will work for your specific case of one % at the end being the offending character consistently:
The indexing here selects all columns that are of dtype 'object', which in this case are all strings with the last character %.
The lambda function that is applied to each element removes the last character from the string, and then converts it to a float.
It is then assigned to the same columns.

How to compare two percentages in python?

I am new to python and I am dealing with some csv files. To sort these files, I have to compare some percentages in string format, such as "5.265%" and "2.1545%". So how do I compare the actual values of these two strings? I have tried to convert them to float but it didn't work. Thanks in advance!
Still convert them to floats, but without the % sign:
float(value.strip(' \t\n\r%'))
The .strip() removes any extra whitespace, as well as the % percent sign, you don't need that to be able to compare two values:
>>> float('5.265% '.strip(' \t\n\r%'))
5.265
>>> float('2.1545%'.strip(' \t\n\r%'))
2.1545
float() itself will normally strip away whitespace for you but by stripping it yourself you make sure that the % sign is also properly removed, making this a little more robust when handling data from files.

Python - Convert negative decimals from string to float

I need to read in a large number of .txt files, each of which contains a decimal (some are positive, some are negative), and append these into 2 arrays (genotypes and phenotypes). Subsequently, I wish to perform some mathematical operations on these arrays in scipy, however the negative ('-') symbol is causing problems. Specifically, I cannot convert the arrays to float, because the '-' is being read as a string, causing the following error:
ValueError: could not convert string to float:
Here is my code as it's currently written:
import linecache
gene_array=[]
phen_array=[]
for i in genotype:
for j in phenotype:
genotype='/path/g.txt'
phenotype='/path/p.txt'
g=linecache.getline(genotype,1)
p=linecache.getline(phenotype,1)
p=p.strip()
g=g.strip()
gene_array.append(g)
phen_array.append(p)
gene_array=map(float,gene_array)
phen_array=map(float,phen_array)
I am fairly certain at this point that it is the negative sign that is causing the problem, but it is not clear to me why. Is my use of Linecache the problem here? Is there an alternative method that would be better?
The result of
print gene_array
is
['-0.0448022516321286', '-0.0236187263814157', '-0.150505384829925', '-0.00338459268479522', '0.0142429109897682', '0.0286253352284279', '-0.0462358095345649', '0.0286232317578776', '-0.00747425206137217', '0.0231790239373428', '-0.00266935581919541', '0.00825077426011094', '0.0272744527203547', '0.0394829854063242', '0.0233109171715023', '0.165841084392078', '0.00259693465334536', '-0.0342590874424289', '0.0124600520095644', '0.0713627590092807', '-0.0189374898081401', '-0.00112750710611284', '-0.0161387333242288', '0.0227226505624106', '0.0382173405035751', '0.0455518646388402', '-0.0453048799717046', '0.0168570746329513']
The issue seems to be with empty string or space as evident from your error message
ValueError: could not convert string to float:
To make it work, convert the map to a list comprehension
gene_array=[float(e) for e in gene_array if e]
phen_array=[float(e) for e in phen_array if e]
By empty string means
float(" ") or float("") would give value errors, so if any of the items within gene_array or phen_array has space, this will throw an error while converting to float
There could be many reasons for empty string like
empty or blank line
blank line either at the beginning or end
The issue is definitely not in the negative sign. Python converts strings with negative sign without a problem. I suggest you run each of your entries against a float RegEx and see if they all pass.
There is nothing in the error message to suggest that - is the problem. The most likely reason is that gene_array and/or phen_array contain an empty string ('').
As stated in the documentation, linecache.getline()
will return '' on errors (the terminating newline character will be included for lines that are found).

Categories