ValueError: could not convert string to float: id - python

I'm running the following python script:
#!/usr/bin/python
import os,sys
from scipy import stats
import numpy as np
f=open('data2.txt', 'r').readlines()
N=len(f)-1
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
result=stats.ttest_ind(list1,list2)
print result[1]
However I got the errors like:
ValueError: could not convert string to float: id
I'm confused by this.
When I try this for only one line in interactive section, instead of for loop using script:
>>> from scipy import stats
>>> import numpy as np
>>> f=open('data2.txt','r').readlines()
>>> w=f[1].split()
>>> l1=w[1:8]
>>> l2=w[8:15]
>>> list1=[float(x) for x in l1]
>>> list1
[5.3209183842, 4.6422726719, 4.3788135547, 5.9299061614, 5.9331108706, 5.0287087832, 4.57...]
It works well.
Can anyone explain a little bit about this?
Thank you.

Obviously some of your lines don't have valid float data, specifically some line have text id which can't be converted to float.
When you try it in interactive prompt you are trying only first line, so best way is to print the line where you are getting this error and you will know the wrong line e.g.
#!/usr/bin/python
import os,sys
from scipy import stats
import numpy as np
f=open('data2.txt', 'r').readlines()
N=len(f)-1
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
try:
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
except ValueError,e:
print "error",e,"on line",i
result=stats.ttest_ind(list1,list2)
print result[1]

My error was very simple: the text file containing the data had some space (so not visible) character on the last line.
As an output of grep, I had 45  instead of just 45. 

This error is pretty verbose:
ValueError: could not convert string to float: id
Somewhere in your text file, a line has the word id in it, which can't really be converted to a number.
Your test code works because the word id isn't present in line 2.
If you want to catch that line, try this code. I cleaned your code up a tad:
#!/usr/bin/python
import os, sys
from scipy import stats
import numpy as np
for index, line in enumerate(open('data2.txt', 'r').readlines()):
w = line.split(' ')
l1 = w[1:8]
l2 = w[8:15]
try:
list1 = map(float, l1)
list2 = map(float, l2)
except ValueError:
print 'Line {i} is corrupt!'.format(i = index)'
break
result = stats.ttest_ind(list1, list2)
print result[1]

For a Pandas dataframe with a column of numbers with commas, use this:
df["Numbers"] = [float(str(i).replace(",", "")) for i in df["Numbers"]]
So values like 4,200.42 would be converted to 4200.42 as a float.
Bonus 1: This is fast.
Bonus 2: More space efficient if saving that dataframe in something like Apache Parquet format.

Perhaps your numbers aren't actually numbers, but letters masquerading as numbers?
In my case, the font I was using meant that "l" and "1" looked very similar. I had a string like 'l1919' which I thought was '11919' and that messed things up.

Your data may not be what you expect -- it seems you're expecting, but not getting, floats.
A simple solution to figuring out where this occurs would be to add a try/except to the for-loop:
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
try:
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
except ValueError, e:
# report the error in some way that is helpful -- maybe print out i
result=stats.ttest_ind(list1,list2)
print result[1]

Shortest way:
df["id"] = df['id'].str.replace(',', '').astype(float) - if ',' is the problem
df["id"] = df['id'].str.replace(' ', '').astype(float) - if blank space is the problem

Update empty string values with 0.0 values:
if you know the possible non-float values then update it.
df.loc[df['score'] == '', 'score'] = 0.0
df['score']=df['score'].astype(float)

I solved the similar situation with basic technique using pandas. First load the csv or text file using pandas.It's pretty simple
data=pd.read_excel('link to the file')
Then set the index of data to the respected column that needs to be changed. For example, if your data has ID as one attribute or column, then set index to ID.
data = data.set_index("ID")
Then delete all the rows with "id" as the value instead of number using following command.
data = data.drop("id", axis=0).
Hope, this will help you.

Related

How I automate my python script or get multiple entries in one run?

I am running the following python script:
import random
result_str = ''.join((random.choice('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!##$%^&*()') for i in range(8)))
with open('file_output.txt','a') as out:
out.write(f'{result_str}\n')
Is there a way I could automate this script to run automatically? or If I can get multiple outputs instantly?
Ex. Right now the output stores itself in the file one by one
kmfd5s6s
But if somehow I can get 1,000,000 entries in the file on one click and there is no duplication.
Same logic as given by PangolinPaws,but since you require it for a 1,000,000 entries, which is quite large, using numpy could be more effecient. Also, replacing random.choice() with random.choices() with k=8, inorder to avoid the for loop to generate the string.
import random
import numpy as np
a = np.array([])
for i in range(1000000):
str = ''.join((random.choices('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!##$%^&*()', k = 8)))
if str not in a:
a = np.append(a,str)
np.savetxt("generate_strings.csv", a, fmt='%s')
You need to nest your out.write() in a loop, something like this, to make it happen multiple times:
import random
with open('file_output.txt','a') as out:
for x in range(1000): # the number of lines you want in the output file
result_str = ''.join((random.choice('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!##$%^&*()') for i in range(8)))
out.write(f'{result_str}\n')
However, while unlikely, it is possible that you could end up with duplicate rows. To avoid this, you can generate and store your random strings in a loop and check for duplicates as you go. Once you have enough, write them all to the file outside the loop:
import random
results = []
while len(results) < 1000: # the number of lines you want in the output file
result_str = ''.join((random.choice('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!##$%^&*()') for i in range(8)))
if result_str not in results: # check if the generated result_str is a duplicate
results.append(result_str)
with open('file_output.txt','a') as out:
out.write( '\n'.join(results) )

How to print selected numbers from a list in python?

I want to print all atoms except H (hydrogen) from the pdb file. Here is the file
https://github.com/mahesh27dx/molecular_phys.git
Following code prints the objects of the file
import numpy as np
import mdtraj as md
coord = md.load('alanine-dipeptide-nowater.pdb')
atoms, bonds = coord.topology.to_dataframe()
atoms
The result looks like this
From this table I want to print all the elements except H . I think it can be done in python using the list slicing. Could someone have any idea how it can be done?
Thanks!
You probably should clarify that you want help with mdtraj or pandas specifically.
Anyway, it's as simple as atoms.loc[atoms.element != 'H'].
You can do the following:
print(*list(df[df.element!='H']['element']))
If you want the unique elements, you can do this:
print(*set(df[df.element!='H']['element']))

How to turn items from extracted data to numbers for plotting in python?

So i have a text document with a lot of values from calculations. I have extracted all the data and stored it in an array, but they are not numbers that I can use for anything. I want to use the number to plot them in a graph, but the elements in the array are text-strings, how would i turn them into numbers and remove unneccesary signs like commas and n= for instance?
Here is code, and under is my print statement.
import numpy as np
['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9', 'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
I'd use the conversion method presented in this post within the extract function, so e.g.
...
delta_x.append(strtofloat(words[1]))
...
where you might as well do the conversion inline (my strtofloat is a function you'd have to write based on mentioned post) and within a try/except block, so failed conversions are just ignored from your list.
To make it more consistent, any conversion error should discard the whole line affected, so you might want to use intermediate variables and a check for each field.
Btw. I noticed the argument to the extract function, it would seem logical to make the argument a string containing the file name from which to extract the data?
EDIT: as a side note, you might want to look into pandas, which is a library specialised in numerical data handling. Depending on the format of your data file there are probably standard functions to read your whole file into a DataFrame (which is a kind of super-charged array class which can handle a lot of data processing as well) in a single command.
I would consider using regular expression:
import re
match_number = re.compile('-?[0-9]+\.?[0-9]*(?:[Ee]-?[0-9]+)?')
for line in infile:
words = line.split()
new_delta_x = float(re.search(match_number, words[1]).group())
new_abs_error = float(re.search(match_number, words[7]).group())
new_n = int(re.search(match_number, words[10]).group())
delta_x.append(new_delta_x)
abs_error.append(new_abs_error)
n.append(new_n)
But it seems like your data is already in csv format. So try using pandas.
Then read data into dataframe without header (column names will be integers).
import numpy as np
import pandas as pd
df = pd.read_csv('approx_derivative_sine.txt', header=None)
delta_x = df[1].to_numpy()
abs_error = df[7].to_numpy()
# if n is always number of the row
n = df.index.to_numpy(dtype=int)
# if n is always in the form 'n=<integer>'
n = df[10].apply(lambda x: x.strip()[2:]).to_numpy(dtype=int)
If you could post a few rows of your approx_derivative_sine.txt file, that would be useful.
From the given array in the question, If you would like to remove the 'n=' and convert each element to an integer, you may try the following.
import numpy as np
array = np.array(['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9',
'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
array = [int(i.replace('n=', '')) for i in array]
print(array)

matplotlib.pyplot.plot, ValueError: could not convert string to float: f

I'm trying to use python to make plots. Below is the simplified version of my code that cause error.
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use("AGG")
distance = np.array('f')
depth = np.array('f')# make sure these two arrays store float type value
with open('Line671.txt','r') as datafile:
for line in datafile:
word_count = 0
for word in line.split():
word = float(word)#convert string type to float
if word_count == 0:
distance = np.append(distance, word)
elif word_count == 1:
depth = np.append(depth, word)
else:
print 'Error'
word_count += 1
datafile.closed
print depth
print distance #outputs looks correct
# original data is like this: -5.3458000e+00
# output of the array is :['f' '-5.3458' '-5.3463' ..., '-5.4902' '-5.4912' '-5.4926']
plt.plot(depth, distance)# here comes the problem
The error message says that in line for plt.plot(depth, distance): ValueError: could not convert string to float: f
I don't understand this because it seems I converted all string values into float type. I tried to search this problem on stackoverflow but they all seem to solve the problem once they cast all string values into float or int. Can anyone give any suggestion on this problem? I would be really appreciate for any help.
You confused the value with the type. If you're trying to declare the type, you need to use "dtype=". What you actually did was to stick a single character into the array.
To answer a later question, your line
word = float(word)
likely worked just fine. However, we can't tell because you didn't do anything with the resulting value. Are you expecting this to alter the original inside the variable "line"? Common variables don't work that way.

Python: Create coordinate list (convert string to int)

I want to import several coordinates (could add up to 20.000) from an text file.
These coordinates need to be added into a list, looking like the follwing:
coords = [[0,0],[1,0],[2,0],[0,1],[1,1],[2,1],[0,2],[1,2],[2,2]]
However when i want to import the coordinates i got the follwing error:
invalid literal for int() with base 10
I can't figure out how to import the coordinates correctly.
Does anyone has any suggestions why this does not work?
I think there's some problem with creating the integers.
I use the following script:
Bronbestand = open("D:\\Documents\\SkyDrive\\afstuderen\\99 EEM - Abaqus 6.11.2\\scripting\\testuitlezen4.txt", "r")
headerLine = Bronbestand.readline()
valueList = headerLine.split(",")
xValueIndex = valueList.index("x")
#xValueIndex = int(xValueIndex)
yValueIndex = valueList.index("y")
#yValueIndex = int(yValueIndex)
coordList = []
for line in Bronbestand.readlines():
segmentedLine = line.split(",")
coordList.extend([segmentedLine[xValueIndex], segmentedLine[yValueIndex]])
coordList = [x.strip(' ') for x in coordList]
coordList = [x.strip('\n') for x in coordList]
coordList2 = []
#CoordList3 = [map(int, x) for x in coordList]
for i in coordList:
coordList2 = [coordList[int(i)], coordList[int(i)]]
print "coordList = ", coordList
print "coordList2 = ", coordList2
#print "coordList3 = ", coordList3
The coordinates needed to be imported are looking like (this is "Bronbestand" in the script):
id,x,y,
1, -1.24344945, 4.84291601
2, -2.40876842, 4.38153362
3, -3.42273545, 3.6448431
4, -4.22163963, 2.67913389
5, -4.7552824, 1.54508495
6, -4.99013376, -0.313952595
7, -4.7552824, -1.54508495
8, -4.22163963, -2.67913389
9, -3.42273545, -3.6448431
Thus the script should result in:
[[-1.24344945, 4.84291601],[-2.40876842, 4.38153362],[-3.42273545, 3.6448431],[-4.22163963, 2.67913389],[-4.7552824, 1.54508495],[-4.99013376,-0.313952595],[-4.7552824, -1.54508495],[-4.22163963, -2.67913389],[-3.42273545, -3.6448431]]
I also tried importing the coordinates with the native python csv parser but this didn't work either.
Thank you all in advance for the help!
Your numbers are not integers so the conversion to int fails.
Try using float(i) instead of int(i) to convert into floating point numbers instead.
>>> int('1.5')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
int('1.5')
ValueError: invalid literal for int() with base 10: '1.5'
>>> float('1.5')
1.5
Other answers have said why your script fails, however, there is another issue here - you are massively reinventing the wheel.
This whole thing can be done in a couple of lines using the csv module and a list comprehension:
import csv
with open("test.csv") as file:
data = csv.reader(file)
next(data)
print([[float(x) for x in line[1:]] for line in data])
Gives us:
[[-1.24344945, 4.84291601], [-2.40876842, 4.38153362], [-3.42273545, 3.6448431], [-4.22163963, 2.67913389], [-4.7552824, 1.54508495], [-4.99013376, -0.313952595], [-4.7552824, -1.54508495], [-4.22163963, -2.67913389], [-3.42273545, -3.6448431]]
We open the file, make a csv.reader() to parse the csv file, skip the header row, then make a list of the numbers parsed as floats, ignoring the first column.
As pointed out in the comments, as you are dealing with a lot of data, you may wish to iterate over the data lazily. While making a list is good to test the output, in general, you probably want a generator rather than a list. E.g:
([float(x) for x in line[1:]] for line in data)
Note that the file will need to remain open while you utilize this generator (remain inside the with block).

Categories