Python reading file error - python

I'm trying to read a list of number from a text file and I am getting this error when I run my code:
ValueError: invalid literal for float(): -4.4987000e-01 -2.0049000e-01 -4.8729000e-01 -6.1085000e-02 -5.1024000e-02 -2.1653000e-02
Here is my code:
def read_data_file(datafile, token):
dataset = []
with open(datafile, 'r') as file:
for line in file:
#split each word by token
data = line[:-1].split(token)
tmp = []
for x in data:
if x != "":
x = float(x)
tmp.append(x)
else:
tmp.append(1e+99)
dataset.append(tmp)
return dataset
The program encounter the error at the line: x = float(x)

You have not provided the full code and data file to reproduce your problem exactly. But it's possible to guess your issue from the error message.
Your line[:-1].split(token) failed to break up line into single-number strings. Either you picked an incorrect separator (token is a poor choice of name), or/and omitting the last character from line broke it.
Try this (also reducing many unnecessary lines)
def read_data_file(datafile):
dataset = []
for line in open(datafile, 'r'):
# no need to assign open() to file, it will be closed when for is done
dataset.append( [ float(token) for token in line.split() ] )
# default (separator=None) works for most situations
dataset.append(1e+99)
# do you really need to append a fake number?
# a better way might be a list of lists
return dataset

This assumes that the data in your file is on one line, like below:
-4.4987000e-01 -2.0049000e-01 -4.8729000e-01 -6.1085000e-02 -5.1024000e-02 -2.1653000e-02
I made a few adjustments and ran your code as below and it worked. It may not be what you want, but if that's the case then you need to be more specific as to what you want. If the data in your file doesn't look like the above then you also need to show us exactly how it resides in your file.
def read_data_file(datafile, token):
dataset = []
with open(datafile, 'r') as file:
for line in file:
#split each word by token
data = line[:-1].split(token)
tmp = []
for x in data:
if x != "":
x = float(x)
tmp.append(x)
else:
tmp.append(1e+99)
dataset.append(tmp)
return dataset
dataset = read_data_file('test_data.txt', ' ')
print(dataset)
# output
'''
[[-0.44987, -0.20049, -0.48729, -0.061085, -0.051024, -0.021653]]
'''
As mention, using token as a parameter is not a good choice but it will work. If your data is always going to be space separated and on a single line then do away with a 'token' and do this: data = line[:-1].split(' ')

Use str.strip to remove any trailing space. And the split by the required token. And you can use map to apply float on each element on list.
Ex:
def read_data_file(datafile, token):
dataset = []
with open(datafile, 'r') as file:
for line in file:
#split each word by token
dataset.append(map(float, map(float, line.strip().split(token))))
return dataset

You need to provide a check before you call float to make sure there are no other data that will cause the float function to result in the error. Usually it is the character \x00. You will need to call something like
filtered_value = x.replace("\x00", "")
filtered_value = float(filtered_value)
tmp.append(filtered_value)

Related

The function is not applied correctly, and the file is not created properly

from hanspell import spell_checker
import csv
with open('test 1st.txt', 'r', encoding='utf-8') as file:
underinput = file.readlines()
input = []
for i in underinput:
temp = i.replace('.','.#').split('#')
input.append(temp)
input_list = [""]
for i in input:
print(i)
if len(input_list[-1]) + len(i) < 500:
input_list[-1] += i
else:
input_list.append(i)
result = spell_checker.check(input_list)
memo = [result[0].result]
**file = open('hello.txt', 'w')
file.write(memo)
file.close()**
There are many Korean characters in that file. And I want to spellcheck it and make a conclusion file. But there were several errors forward to me.
Error message like this:
This seems to be because spell_checker.check returns a tuple of
result (Bool, True if the input has been checked),
original (String, The original string)
errors (Number, The number of errors in the input)
words (Dictionary, List of words with their error count)
time (Float, Time taken to complete)
your file should be written once you have parsed this data, or cast elements of it to a string.

Python - Pulling data from a file that matches a parameter

I have a file that contains information about users and the amount of times they have logged in. I am trying to pull all users that have a login of >= 250 and save it to another file. I am new at python coding and continue to get a "invalid literal with base 10" error when trying to run this portion of my code. Can anyone help me out and explain why this happens so I can prevent from this from happening in the future? TIA
thanks
def main():
userInformation = readfile("info")
suspicious = []
for i in userInformation :
if(int(i[2])>=250):
suspicious.append(i)
Full code below if needed:
#Reading the file function
def readFile(filename):
file = open(filename,'r')
lines = [x.split('\n')[0].split(';') for x in file.readlines()]
file.close()
return lines
def writeFile(suspicious):
file = open('suspicious.txt','w')
for i in suspicious:
file.write('{};{};{};{}\n'.format(i[0],i[1],i[2],i[3]))
file.close()
def main()
userInformation = readfile("info")
suspicious = []
for i in userInformation :
if(int(i[2])>=250):
suspicious.append(i)
writeFile(suspicious)
print('Suspicious users:')
for i in suspicious:
print('{} {}'.format(i[0],i[1]))
main()
Here is some line of my file:
Jodey;Lamins;278
Chris;Taylors;113
David;Mann;442
etc
etc
"invalid literal with base 10" occurs when you're trying to parse an integer that's not in base 10. In other words i[2] is not a valid integer (most likely it's a string that you're incorrectly trying to convert to an integer). Also, it would be best to correctly format your main function.
Ok, so I took your example file and played with it a little. The issues I faced were mostly spacing issues. So, here's the code you might like -
UsersInfoFileName = '/path/to/usersinfofile.txt'
MaxRetries = 250
usersWithExcessRetries = []
with open(UsersInfoFileName, 'r') as f:
lines = f.readlines()
consecutiveLines = (line.strip() for line in lines if line.strip())
for line in consecutiveLines:
if (int(line.split(';')[-1]) > MaxRetries):
usersWithExcessRetries.append(line)
for suspUsers in usersWithExcessRetries:
print(suspUsers)
Here's what it does -
Reads all lines in the given file
Filters all lines by excluding lines which may be empty
Removes surrounding white spaces for the remaining lines
Reads last semi-colon separated value, and compares it with MaxRetries
Adds the original line to a list if the value exceeds MaxRetries

Replacement for isAlpha() to include underscores?

I am processing data using Python3 and I need to read a results file that looks like this:
ENERGY_BOUNDS
1.964033E+07 1.733253E+07 1.491825E+07 1.384031E+07 1.161834E+07 1.000000E+07 8.187308E+06 6.703200E+06
6.065307E+06 5.488116E+06 4.493290E+06 3.678794E+06 3.011942E+06 2.465970E+06 2.231302E+06 2.018965E+06
GAMMA_INTERFACE
0
EIGENVALUE
1.219034E+00
I want to search the file for a specific identifier (in this case ENERGY_BOUNDS), begin reading the numeric values after this identifier but not the identifier itself, and stop when I reach the next identifier. However, my problem is that I was using isAlpha to find the next identifier, and some of them contain underscores. Here is my code:
def read_data_from_file(file_name, identifier):
with open(file_name, 'r') as read_obj:
list_of_results = []
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
if identifier in line:
# If yes, read the next line
nextValue = next(read_obj)
while(not nextValue.strip().isalpha()): # Keep on reading until next identifier appears
list_of_results.extend(nextValue.split())
nextValue = next(read_obj)
return(list_of_results)
I think I need to use regex, but I am stuck regarding how to phrase it. Any help would be much appreciated!
take = False
with open('path/to/input') as infile:
for line in input:
if line.strip() == "ENERGY_BOUNDS":
take = True
continue # we don't actually want this line
if all(char.isalpha() or char=="_" for char in line.strip()): # we've hit the next section
take = False
if take:
print(line) # or whatever else you want to do with this line
Here's an option for you.
Just iterate over the file until you hit the identifier.
Then iterate over it in another for loop until the next identifier causes a ValueError.
def read_data_from_file(file_name, identifier):
with open(file_name, 'r') as f:
list_of_results = []
for line in f:
if identifier in line:
break
for line in f:
try:
list_of_results.extend(map(float, line.split()))
except ValueError:
break
return list_of_results
You can use this regex: ^[A-Z]+(?:_[A-Z]+)*$
Additionally, you can modify the regex to match strings of custom length, like this: ^[A-Z]{2,10}+(?:_[A-Z]+)*$, where {2, 10} is {MIN, MAX} length:
You can find this demo here: https://regex101.com/r/9jESAH/35
See this answer for more details.
Here is a simple function to verify a string has alpha, uppercase and lowercase and underscore:
RE_PY_VAR_NAME="^[a-zA-Z_]+$"
def isAlphaUscore(s:str) -> bool:
assert not s is None, "s cannot be None"
return re.search(RE_PY_VAR_NAME, s)

Python 2.7 mixing iteration and read methods would lose data

I have an issue with a bit of code that works in Python 3, but fail in 2.7. I have the following part of code:
def getDimensions(file,log):
noStations = 0
noSpanPts = 0
dataSet = False
if log:
print("attempting to retrieve dimensions. Opening file",file)
while not dataSet:
try: # read until error occurs
string = file.readline().rstrip() # to avoid breaking on an empty line
except IOError:
break
stations
if "Ax dist hub" in string: # parse out number of stations
if log:
print("found ax dist hub location")
next(file) # skip empty line
eos = False # end of stations
while not eos:
string = file.readline().rstrip()
if string =="":
eos = True
else:
noStations = int(string.split()[0])
This returns an error:
ValueError: Mixing iteration and read methods would lose data.
I understand that the issue is how I read my string in the while loop, or at least that is what I believe. Is there a quick way to fix this? Any help is appreciated. Thank you!
The problem is that you are using next and readline on the same file. As the docs say:
. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right.
The fix is trivial: replace next with readline.
If you want a short code to do that try:
lines = []
with open(filename) as f:
lines = [line for line in f if line.strip()]
Then you can do tests for lines.

Extracting data from a file using regular expressions and storing in a list to be compiled into a dictionary- python

I've been trying to extract both the species name and sequence from a file as depicted below in order to compile a dictionary with the key corresponding to the species name (FOX2_MOUSE for example) and the value corresponding to the Amino Acid sequence.
Sample fasta file:
>sp|P58463|FOXP2_MOUSE
MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELL
HLQQQQALQAARQLLLQQQTSGLKSPKSSEKQRPLQVPVSVAMMTPQVIT
PQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQL
LQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ-HPGKQAKE
QQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQA
ALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSST
TSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK
>sp|Q8MJ98|FOXP2_PONPY
MMQESVTETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELL
HLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVIT
PQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQL
LQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ--HPGKQAKE
QQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQA
ALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSST
TSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK
I've tried using my code below:
import re
InFileName = "foxp2.fasta"
InFile = open(InFileName, 'r')
Species = []
Sequence = []
reg = re.compile('FOXP2_\w+')
for Line in InFile:
Species += reg.findall(Line)
print Species
reg = re.compile('(^\w+)')
for Line in Infile:
Sequence += reg.findall(Line)
print Sequence
dictionary = dict(zip(Species, Sequence))
InFile.close()
However, my output for my lists are:
[FOX2_MOUSE, FOXP2_PONPY]
[]
Why is my second list empty? Are you not allowed to use re.compile() twice? Any suggestions on how to circumvent my problem?
Thank you,
Christy
If you want to read a file twice, you have to seek back to the beginning.
InFile.seek(0)
You can do it in a single pass, and without regular expressions:
def load_fasta(filename):
data = {}
species = ""
sequence = []
with open(filename) as inf:
for line in inf:
line = line.strip()
if line.startswith(";"): # is comment?
# skip it
pass
elif line.startswith(">"): # start of new record?
# save previous record (if any)
if species and sequence:
data[species] = "".join(sequence)
species = line.split("|")[2]
sequence = []
else: # continuation of previous record
sequence.append(line)
# end of file - finish storing last record
if species and sequence:
data[species] = "".join(sequence)
return data
data = load_fasta("foxp2.fasta")
On your given file, this produces data ==
{
'FOXP2_PONPY': 'MMQESVTETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQLLQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ--HPGKQAKEQQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQAALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSSTTSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK',
'FOXP2_MOUSE': 'MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSEKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQLLQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ-HPGKQAKEQQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQAALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSSTTSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK'
}
You could also do this in a single pass with a multiline regex:
import re
reg = re.compile('(FOXP2_\w+)\n(^[\w\n-]+)', re.MULTILINE)
with open("foxp2.fasta", 'r') as file:
data = dict(reg.findall(file.read()))
The downside is that you have to read the whole file in at once. Whether this is a problem depends on likely file sizes.

Categories