I need to read in a large number of .txt files, each of which contains a decimal (some are positive, some are negative), and append these into 2 arrays (genotypes and phenotypes). Subsequently, I wish to perform some mathematical operations on these arrays in scipy, however the negative ('-') symbol is causing problems. Specifically, I cannot convert the arrays to float, because the '-' is being read as a string, causing the following error:
ValueError: could not convert string to float:
Here is my code as it's currently written:
import linecache
gene_array=[]
phen_array=[]
for i in genotype:
for j in phenotype:
genotype='/path/g.txt'
phenotype='/path/p.txt'
g=linecache.getline(genotype,1)
p=linecache.getline(phenotype,1)
p=p.strip()
g=g.strip()
gene_array.append(g)
phen_array.append(p)
gene_array=map(float,gene_array)
phen_array=map(float,phen_array)
I am fairly certain at this point that it is the negative sign that is causing the problem, but it is not clear to me why. Is my use of Linecache the problem here? Is there an alternative method that would be better?
The result of
print gene_array
is
['-0.0448022516321286', '-0.0236187263814157', '-0.150505384829925', '-0.00338459268479522', '0.0142429109897682', '0.0286253352284279', '-0.0462358095345649', '0.0286232317578776', '-0.00747425206137217', '0.0231790239373428', '-0.00266935581919541', '0.00825077426011094', '0.0272744527203547', '0.0394829854063242', '0.0233109171715023', '0.165841084392078', '0.00259693465334536', '-0.0342590874424289', '0.0124600520095644', '0.0713627590092807', '-0.0189374898081401', '-0.00112750710611284', '-0.0161387333242288', '0.0227226505624106', '0.0382173405035751', '0.0455518646388402', '-0.0453048799717046', '0.0168570746329513']
The issue seems to be with empty string or space as evident from your error message
ValueError: could not convert string to float:
To make it work, convert the map to a list comprehension
gene_array=[float(e) for e in gene_array if e]
phen_array=[float(e) for e in phen_array if e]
By empty string means
float(" ") or float("") would give value errors, so if any of the items within gene_array or phen_array has space, this will throw an error while converting to float
There could be many reasons for empty string like
empty or blank line
blank line either at the beginning or end
The issue is definitely not in the negative sign. Python converts strings with negative sign without a problem. I suggest you run each of your entries against a float RegEx and see if they all pass.
There is nothing in the error message to suggest that - is the problem. The most likely reason is that gene_array and/or phen_array contain an empty string ('').
As stated in the documentation, linecache.getline()
will return '' on errors (the terminating newline character will be included for lines that are found).
Related
I have a Pandas dataframe that I've read from a file - pd.read_csv() - and I'm having trouble converting a column with string values to float.
Firstly, I'm not entirely sure why pandas is even reading the column as string files to begin with - all the values are numeric. The problem seems to be with the hyphen minus sign for the negative numbers. There are other threads on this topic that mention how em-dash can mess things up (here, for example)
However, when I try converting the hyphen type, it still gives me an error. For example,
df['Verified_m'] = df['Verified_m'].str.replace("\U00002013", "-").astype(float)
doesn't change anything; all the values start with the '-' hyphen, so it's not actually replacing anything. It still gives me the error:
ValueError: could not convert string to float: '-'
I've tried replacing all of the hyphens with a numeric value to see if that would work, and I'm able to convert to float (example: df['Verified_m'] = df['Verified_m'].str.replace("-", "0").astype(float) . But I'd like to retain the negative values in the dataset. Does anyone know what's wrong with my hyphens?
Try this:
df['Verified_m'] = df['Verified_m'].str.replace("\U00002013", "-").str.replace(r'^-$', '0', regex=True).astype(float)
After converting the em-dashes to hyphens, it converts a lone - to zero.
I have a python script that iterate among data format values and returns back just hour.
Below is the similar script(that I use for iteration):
zaman = "06:00:00" (hours:minutes:seconds)
hm = zaman.split(":")
vaxt = [hm[1]]
saat = float(hm[0]) + float(float(hm[1])/60)
print(f"{saat:,.2f}")
In one of the files which has several rows I get the error:
ValueError: could not convert string to float: ' caing low enough for rotary follow on'
I have checked myself, that this row do not differ from the previous ones, but I get an error on that one.
Do you have suggestions on how to solve it? (may be getting hours from DateTime in another way)
The issue is that you're not correctly identifying the datetime in the string, so you end up trying to convert the wrong bit to a float.
One potential fix for this would be to not rely on splitting the string at the :s, but instead to use a regex to look for the part of the string with the appropriate format.
For example:
import re
test_string = 'this is a string with 06:00:00 in it somewhere'
matches = re.search('(\d{2}):(\d{2}):(\d{2})', test_string)
matches = [float(m) for m in matches.groups()]
print(matches)
# [6.0,0.0,0.0]
I have tested the code you provided above and it works. However, after doing some research it appears:
The Python "ValueError: could not convert string to float" occurs when we pass a string that cannot be converted to a float (e.g. an empty string or one containing characters) to the float() class. To solve the error, remove all unnecessary characters from the string.
So check your file to make sure the input is clean for float() to work perfectly.
Hi,
how do i plot the Attached Dataframe in python, i am looking for multiple series line graph.
Any help will be much appreciated.
Error:-ValueError: could not convert string to float
Thanks
Your problem here is that the % signs in your csv file are making Pandas read each value as a string object, rather than as a float.
The best option for resolving this would probably be to not have extraneous characters like %s everywhere in your csv file. Instead, it would probably make more sense to list units in your columns, or elsewhere in descriptions.
However, in this case, it can also be solved afterward by removing the extraneous characters and converting manually, eg, for a DataFrame a:
a.ix[:,a.dtypes==object] = a.ix[:,a.dtypes==object].applymap(lambda x: float(x[:-1]))
This will work for your specific case of one % at the end being the offending character consistently:
The indexing here selects all columns that are of dtype 'object', which in this case are all strings with the last character %.
The lambda function that is applied to each element removes the last character from the string, and then converts it to a float.
It is then assigned to the same columns.
i want adding and subtracting this type of data: $12,587.30.which returns answer in same format.how can do this ?
Here is my code example:
print(int(col_ammount2.lstrip('$'))-int(col_ammount.lstrip('$')))
I removed $ sign and convert it to int but it gives me base 10 error.
You mentioned you want to do arithmetic operations to the numbers (addition/subtraction) so you probably want them in float instead. The difference between an integer (int) and float is that integers do not carry decimal points.
Additionally, as #officialaimm mentioned you need to remove the commas too, for example
float('$3,333.33'.replace('$', '').replace(',', ''))
will give you
3333.33
So putting it into your code
print(float(col_ammount2.lstrip('$').replace(',', ''))
- float(col_ammount.lstrip('$').replace(',', '')))
An additional note for when you parse a floating point number (same applies to integers too), you may want to watch out for empty values, i.e.
float('')
is bad. One of the things u can do in case col_amount and col_amount2 may be empty at some point is default them to 0 if that happens
float(col_amount.lstrip(...).replace(...) or 0)
You also want to read this to know about workaround to problems you may face with floating point arithmetic https://docs.python.org/3/tutorial/floatingpoint.html
There are two things you are missing here. Firstly python int(...) cannot parse numbers with commas so you will need to remove commas as well by using .replace(',',''). Secondly int() cannot parse floating point values you will have to use float(...) first and after that maybe typecast it to int using int or math.ceil, math.floor appropriately as per your choice and needs.
Maybe something like this will solve your problem:
col_ammount2='$1,587.30'
col_ammount = '$2,567.67'
print(int(float(col_ammount2.lstrip('$').replace(',','')))-int(float(col_ammount.lstrip('$').replace(',',''))))
If you are doing these sorts of things quite often in your code, making a function as such might be handy:
integerify_currency = lambda x:int(float(x.lstrip('$').replace(',','')))
I got the following code to handle Chinese character problem, or some special character in powerpoint file , because I would like to use the content of the ppt as the filename to save.
If it contains some special character, it will throw some exception, so I use the following code to handle it.
It works fine under Python 2.7 , but when I run with Python 3.0 it gives me the following error :
if not (char in '<>:"/\|?*'):
TypeError: 'in <string>' requires string as left operand, not int
I Googled the error message but I don't understand how to resolve it. I know the code if not (char in '<>:"/\|?*'): is to convert the character to ASCII code number, right?
Is there any example to fix my problem in Python 3?
def rm_invalid_char(self,str):
final=""
dosnames=['CON', 'PRN', 'AUX', 'NUL', 'COM1', 'COM2', 'COM3', 'COM4', 'COM5', 'COM6', 'COM7', 'COM8', 'COM9', 'LPT1', 'LPT2', 'LPT3', 'LPT4', 'LPT5', 'LPT6', 'LPT7', 'LPT8', 'LPT9']
for char in str:
if not (char in '<>:"/\|?*'):
if ord(char)>31:
final+=char
if final in dosnames:
#oh dear...
raise SystemError('final string is a DOS name!')
elif final.replace('.', '')=='':
print ('final string is all periods!')
pass
return final
Simple: use this
re.escape(YourStringHere)
From the docs:
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
You are passing an iterable whose first element is an integer (232) to rm_invalid_char(). The problem does not lie with this function, but with the caller.
Some debugging is in order: right at the beginning of rm_invalid_char(), you should do print(repr(str)): you will not see a string, contrary to what is expected by rm_invalid_char(). You must fix this until you see the string that you were expecting, by adjusting the code before rm_invalid_char() is called.
The problem is likely due to how Python 2 and Python 3 handle strings (in Python 2, str objects are strings of bytes, while in Python 3, they are strings of characters).
I'm curious why there is something in "str" that is acting like an integer - something strange is going on with the input.
However, I suspect if you:
Change the name of your str value to something else, e.g. char_string
Right after for char in char_string coerce whatever your input is to a string
then the problem you describe will be solved.
You might also consider adding a random bit to the end of your generated file name so you don't have to worry about colliding with the DOS reserved names.