I am using jupyter python 3. I have tried to import data from .tsp file but it keeps showing me this error.And I saw some people had same problem and they solved it thanks to convert, but it did not work on my codes.
NAME: berlin52
TYPE: TSP
COMMENT: 52 locations in Berlin (Groetschel)
DIMENSION : 52
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
1 565.0 575.0
2 25.0 185.0
3 345.0 750.0
4 945.0 685.0
5 845.0 655.0
6 880.0 660.0
7 25.0 230.0
8 525.0 1000.0
9 580.0 1175.0
10 650.0 1130.0
# Open input file
infile = open(r'C:\Users\13136\OneDrive\Desktop\AI\berlin52.tsp')
# Read instance header
Name = infile.readline().strip().split()[1] # NAME
FileType = infile.readline().strip().split()[1] # TYPE
Comment = infile.readline().strip().split()[1] # COMMENT
Dimension = infile.readline().strip().split()[1] # DIMENSION
EdgeWeightType = infile.readline().strip().split()[1] # EDGE_WEIGHT_TYPE
infile.readline()
# Read node list
nodelist = []
N = int(Dimension)
for i in range(0, int(Dimension)):
x,y = infile.readline().strip().split()[1:]
nodelist.append([float(x), float(y)])
# Close input file
infile.close()
ValueError Traceback (most recent call last)
<ipython-input-22-5e3fe725955a> in <module>
12 # Read node list
13 nodelist = []
---> 14 N = int(Dimension)
15 for i in range(0, int(Dimension)):
16 x,y = infile.readline().strip().split()[1:]
ValueError: invalid literal for int() with base 10: ':'
Name = infile.readline().strip().split(':')[1] # NAME
FileType = infile.readline().strip().split(':')[1] # TYPE
Comment = infile.readline().strip().split(':')[1] # COMMENT
Dimension = infile.readline().strip().split(':')[1] # DIMENSION
EdgeWeightType = infile.readline().strip().split(':')[1] # EDGE_WEIGHT_TYPE
The two lines for DIMENSION and EDGE_WEIGHT_TYPE in your file do not have the : immediately following the name, but have some extra space inbetween, so split() will split these lines at each space, into three parts, e.g.:
['DIMENSION', ':', '52']
You are selecting the second part, which cannot be interpreted as int. You want to always have the second part of the line after splitting the line by :, not by , which split(':') does for you, e.g.:
['DIMENSION ', ' 52']
The extra whitespaces could be removed with a .strip() call after these lines, but int will also accept it without.
Dimension = infile.readline().split(':')[1].strip()
This will still cut of fields containing extra :, but I suppose such special cases are not that important to you here.
Related
I've tried literally everything to make this work. What i'm trying to do is take a file, assign each line a variable, and then set the type of the variable. It's reading it in the list as [ and ' being a line number, and I don't know what to do. I also have lists inside of the file that I need to save.
My error is:
ValueError: invalid literal for int() for base 10: '['
My code is:
def load_data():
f = open(name+".txt",'r')
enter = str(f.readlines()).rstrip('\n)
print(enter)
y = enter[0]
hp = enter[1]
coins = enter[2]
status = enter[3]
y2 = enter[4]
y3 = enter[5]
energy = enter[6]
stamina = enter[7]
item1 = enter[8]
item2 = enter[9]
item3 = enter[10]
equipped = enter[11]
firstime = enter[12]
armorpoint1 = enter[13]
armorpoint2 = enter[14]
armorpoints = enter[15]
upgradepoint1 = enter[16]
upgradepoint2 = enter[17]
firstime3 = enter[18]
firstime4 = enter[19]
part2 = enter[20]
receptionist = enter[21]
unlocklist = enter[22]
armorlist = enter[23]
heal1 = enter[24]
heal2 = enter[25]
heal3 = enter[26]
unlocked = enter[27]
unlocked2 = enter[28]
float(int(y))
int(hp)
int(coins)
str(status)
float(int(y2))
float(int(y3))
int(energy)
int(stamina)
str(item1)
str(item2)
str(item3)
str(equipped)
int(firstime)
int(armorpoint1)
int(armorpoint2)
int(armorpoints)
int(upgradepoint1)
int(upgradepoint2)
int(firstime3)
int(firstime4)
list(unlocklist)
list(armorlist)
int(heal1)
int(heal2)
int(heal3)
f.close()
SAMPLE FILE:
35.0
110
140
Sharpshooter
31.5
33
11
13
Slimer Gun
empty
empty
Protective Clothes
0
3
15
0
3
15
0
1
False
False
['Slime Slicer', 'Slimer Gun']
['Casual Clothes', 'Protective clothes']
4
3
-1
{'Protective Clothes': True}
{'Slimer Gun': True}
The .readlines() function returns a list, each item containing a separate line. In order to strip the newline from each of the lines, you can use a list comprehension:
f = open("data.txt", "r")
lines = [line.strip() for line in f.readlines()]
You can then proceed to cast each item in the list separately, as you have, or try to somehow automatically infer the type in a loop. This would be easier if you formatted the example file more like a configuration file. This thread has some relevant answers:
Best way to retrieve variable values from a text file?
I think its better if you read the file this way, which first reads and removes whitespace and then splits into lines. Then you can set a variable to each line (also you need to set the result of changing variable types to the variable).
For lists, you may need a function to extract the list from the string. However if you aren't expecting security breaches then using eval() should be fine.
def load_data():
f = open(name+".txt",'r')
content = f.read().rstrip()
lines = content.split("\n")
y = float(int(enter[0]))
hp = int(enter[1])
coins = int(enter[2])
status = enter[3]
# (etc)
unlocklist = eval(enter[22])
armorlist = eval(enter[23])
f.close()
I have This string of length 66.
RP000729SP001CT087ET02367EL048TP020DS042MF0220LT9.300000LN4.500000. Two alphabets (keyword like RP) show next is value until the next keyword.
Now I am parsing the string keeping the number of bytes constant between two keywords i.e. 000729 between RP and SP. And following is the code to parse like that.
msgStr = "RP000729SP001CT087ET02367EL048TP020DS042MF0220LT9.300000LN4.500000"
Ppm = msgStr[msgStr.find("RP")+2:msgStr.find("SP")]
Speed = msgStr[msgStr.find("SP")+2:msgStr.find("CT")]
Coolent_temp = msgStr[msgStr.find("CT")+2:msgStr.find("ET")]
ETime = msgStr[msgStr.find("ET")+2:msgStr.find("EL")]
E_load = msgStr[msgStr.find("EL")+2:msgStr.find("TP")]
Throttle_pos = msgStr[msgStr.find("TP")+2:msgStr.find("DS")]
Distance = msgStr[msgStr.find("DS")+2:msgStr.find("MF")]
MAF = msgStr[msgStr.find("MF")+2:msgStr.find("LT")]
Lat = msgStr[msgStr.find("LT")+2:msgStr.find("LN")]
Lon = msgStr[msgStr.find("LN")+2:]
print Ppm, Speed, Coolent_temp, ETime, E_load, Throttle_pos, Distance, MAF, Lat, Lon
Output:
000729 001 087 02367 048 020 042 0220 9.300000 4.500000
Now I want to collect if there are any number of bytes between two keywords. examples are given below
Example No. 1:
Example1_msgStr= "RP729SP14CT087ET2367EL48TP20DS42MF0220LT0.000000LN0.000000"
Expected Ouput 1:
729 14 087 2367 48 20 42 0220 0.000000 0.000000
Example No. 2:
Example2_msgStr = "RP72956SP134CT874ET02367EL458TP20DS042MF0220LT53.000LN45.00"
Expected Ouput 2:
72956 134 874 02367 458 20 042 0220 53.000 45.00
You should use a regular expression to find variable length matches between two strings:
import re
regex = r'RP(\d+)SP'
strings = ['RP729SP14CT087ET2367EL48TP20DS42MF0220LT0.000000LN0.000000',
'RP72956SP134CT874ET02367EL458TP20DS042MF0220LT53.000LN45.00']
for string in strings:
match = re.search(regex,string)
print('Matched:',match.group(1))
In the regex the brackets () specify a group to store and \d+ means 1 or more number characters. So the whole regex RP(\d+)SP will find a variable length string of numbers between RP and SP.
This shows you how to do one case, you'll need to loop through your delimiters (RP, SP, CT etc. etc.) to capture all the information you want. If the delimiters always come in the same order you can build one enormous regex to capture all the groups at once ...
If you want to check for characters between your delimiters, you can use your code and then turn any of your variables into bool type. If the string is not empty it means that there is something in that string and thus it returns True. If the string is empty, it returns False:
msgStr = 'RP000729SP001CT087ET02367EL048TP020DS042MF0220LT9.300000LN4.500000'
rpm = msgStr[msgStr.find("RP")+2:msgStr.find("SP")] # Outputs '000729'
Speed = msgStr[msgStr.find("SP")+2:msgStr.find("CT")] # Outputs '001'
coolent_temp = msgStr[msgStr.find("CT")+2:msgStr.find("ET")] # Outputs '087'
ETime = msgStr[msgStr.find("ET")+2:msgStr.find("EL")] # '02367'
e_load = msgStr[msgStr.find("EL")+2:msgStr.find("TP")] # '048'
throttle_pos = msgStr[msgStr.find("TP")+2:msgStr.find("DS")] # '020'
Distance = msgStr[msgStr.find("DS")+2:msgStr.find("MF")] # Outputs '042'
MAF = msgStr[msgStr.find("MF")+2:msgStr.find("LT")] # Outputs '0220'
Lat = msgStr[msgStr.find("LT")+2:msgStr.find("LN")] # Outputs '9.300000'
Lon = msgStr[msgStr.find("LN")+2:] # Outputs '4.500000'
bool(rpm) # Outputs True
bool(Speed) # Outputs True
bool(coolent_temp) # Outputs True
bool(ETime) # Outputs True
bool(e_load) # Outputs True
bool(throttle_pos) # Outputs True
bool(Distance) # Outputs True
bool(MAF) # Outputs True
bool(Lat) # Outputs True
bool(Lon) # Outputs True
You could check several fields being not empty at the same time:
all_filled = bool(rpm) and bool(Speed) and bool(coolent_temp) and \
bool(ETime) and bool(e_load) and bool(throttle_pos) and bool(Distance) \
and bool(MAF) and bool(Lat) and bool(Lon)
The way your code is set, if you try it with msgStr1 you get already the separation you desired:
Example1_msgStr= "RP729SP14CT087ET2367EL48TP20DS42MF0220LT0.000000LN0.000000"
# ... your code ...
print (rpm, Speed, coolent_temp, ETime, e_load, throttle_pos, Distance, MAF, Lat, Lon)
#> 729 14 087 2367 48 20 42 0220 0.000000 0.000000
Example2_msgStr= "RP72956SP134CT874ET02367EL458TP20DS042MF0220LT53.000LN45.00"
# ... your code ...
print (rpm, Speed, coolent_temp, ETime, e_load, throttle_pos, Distance, MAF, Lat, Lon)
#> 72956 134 874 02367 458 20 042 0220 53.000 45.00
I am having the hardest time figuring out why the scientific notation string I am passing into the float() function will not work:
time = []
WatBalR = []
Area = np.empty([1,len(time)])
Volume = np.empty([1,len(time)])
searchfile = open("C:\GradSchool\Research\Caselton\Hydrus2d3d\H3D2_profile1v3\Balance.out", "r")
for line in searchfile:
if "Time" in line:
time.append(re.sub("[^0-9.]", "", line))
elif "WatBalR" in line:
WatBalR.append(re.sub("[^0-9.]", "", line))
elif "Area" in line:
Area0 = re.sub("[^0-9.\+]", "", line)
print repr(Area0[:-10])
Area0 = float(Area0[:-10].replace("'", ""))
Area = numpy.append(Area, Area0)
elif "Volume" in line:
Volume0 = re.sub("[^0-9.\+]", "", line)
Volume0 = float(Volume0[:-10].replace("'", ""))
Volume = numpy.append(Volume, Volume0)
searchfile.close()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-80-341de12bbc94> in <module>()
13 Area0 = re.sub("[^0-9.\+]", "", line)
14 print repr(Area0[:-10])
---> 15 Area0 = float(Area0[:-10].replace("'", ""))
16 Area = numpy.append(Area, Area0)
17 elif "Volume" in line:
ValueError: invalid literal for float(): 0.55077+03
However, the following works:
float(0.55077+03)
3.55077
If I put quotes around the argument, the same invalid literal comes up, but I am tried to remove the quotes from the string and cannot seem to do so.
0.55077+03 is 0.55077 added to 03. You need an e for scientific notation:
0.55077e+03
float(0.55077+03) adds 3 to .55077 and then converts it to a float (which it already is).
Note that this also only works on python2.x. On python3.x, 03 is an invalid token -- the correct way to write it there is 0o3...
float('0.55077+03') doesn't work (and raises the error that you're seeing) because that isn't a valid notation for a python float. You need: float('0.55077e03') if you're going for a sort of scientific notation. If you actually want to evaluate the expression, then things become a little bit trickier . . .
So I am having a problem extracting text from a larger (>GB) text file. The file is structured as follows:
>header1
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
andEnds
>header2
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAlineAtPosition_80
MaybeAnotherTargetBBBBBBBBBBBrestText
andEndsSomewhereHere
Now I have the information that in the entry with header2 I need to extract the text from position X to position Y (the A's in this example), starting with 1 as the first letter in the line below the header.
BUT: the positions do not account for newline characters. So basically when it says from 1 to 95 it really means just the letters from 1 to 80 and the following 15 of the next line.
My first solution was to use file.read(X-1) to skip the unwanted part in front and then file.read(Y-X) to get the part I want, but when that stretches over newline(s) I get to few characters extracted.
Is there a way to solve this with another python-function than read() maybe? I thought about just replacing all newlines with empty strings but the file maybe quite large (millions of lines).
I also tried to account for the newlines by taking extractLength // 80 as added length, but this is problematic in cases like the example when eg. of 95 characters it's 2-80-3 over 3 lines I actually need 2 additional positions but 95 // 80 is 1.
UPDATE:
I modified my code to use Biopython:
for s in SeqIO.parse(sys.argv[2], "fasta"):
#foundClusters stores the information for substrings I want extracted
currentCluster = foundClusters.get(s.id)
if(currentCluster is not None):
for i in range(len(currentCluster)):
outputFile.write(">"+s.id+"|cluster"+str(i)+"\n")
flanking = 25
start = currentCluster[i][0]
end = currentCluster[i][1]
left = currentCluster[i][2]
if(start - flanking < 0):
start = 0
else:
start = start - flanking
if(end + flanking > end + left):
end = end + left
else:
end = end + flanking
#for debugging only
print(currentCluster)
print(start)
print(end)
outputFile.write(s.seq[start, end+1])
But I get the following error:
[[1, 55, 2782]]
0
80
Traceback (most recent call last):
File "findClaClusters.py", line 92, in <module>
outputFile.write(s.seq[start, end+1])
File "/usr/local/lib/python3.4/dist-packages/Bio/Seq.py", line 236, in __getitem__
return Seq(self._data[index], self.alphabet)
TypeError: string indices must be integers
UPDATE2:
Changed outputFile.write(s.seq[start, end+1]) to:
outRecord = SeqRecord(s.seq[start: end+1], id=s.id+"|cluster"+str(i), description="Repeat-Cluster")
SeqIO.write(outRecord, outputFile, "fasta")
and its working :)
With Biopython:
from Bio import SeqIO
X = 66
Y = 130
for s in in SeqIO.parse("test.fst", "fasta"):
if "header2" == s.id:
print s.seq[X: Y+1]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Biopython let's you parse fasta file and access its id, description and sequence easily. You have then a Seq object and you can manipulate it conveniently without recoding everything (like reverse complement and so on).
I have a file below where I want to convert what is written on every fourth line into a number.
sample.fastq
#HISE
GGATCGCAATGGGTA
+
CC#!$%*&J#':AAA
#HISE
ATCGATCGATCGATA
+
()**D12EFHI#$;;
Each fourth line is a series of characters which each individually equate to a number (stored in a dictionary). I would like to convert each character into it’s corresponding number and then find the average of all those numbers on that line.
I have gotten as far as being able to display each of the characters individually but I’m pretty stunted as to how to replace the characters with their number and then subsequently go on further.
script.py
d = {
'!':0, '"':1, '#':2, '$':3, '%':4, '&':5, '\'':6, '(':7, ')':8,
'*':9, '+':10, ',':11, '-':12, '.':13, '/':14, '0':15,'1':16,
'2':17, '3':18, '4':19, '5':20, '6':21, '7':22, '8':23, '9':24,
':':25, ';':26, '<':27, '=':28, '>':29, '?':30, '#':31, 'A':32, 'B':33,
'C':34, 'D':35, 'E':36, 'F':37, 'G':38, 'H':39, 'I':40, 'J':41 }
with open('sample.fastq') as fin:
for i in fin.readlines()[3::4]:
for j in i:
print j
The output should be as below and stored in a new file.
output.txt
#HISE
GGATCGCAATGGGTA
+
19 #From 34 34 31 0 3 4 9 5 41 2 6 25 32 32 32
#HISE
ATCGATCGATCGATA
+
23 #From 7 8 9 9 35 16 17 36 37 39 40 31 3 26 26
Is what i’m proposing possible?
You can do this with a for loop over the input file lines:
with open('sample.fastq') as fin, open('outfile.fastq', "w") as outf:
for i, line in enumerate(fin):
if i % 4 == 3: # only change every fourth line
# don't forget to do line[:-1] to get rid of newline
qualities = [d[ch] for ch in line[:-1]]
# take the average quality score. Note that as in your example,
# this truncates each to an integer
average = sum(qualities) / len(qualities)
# new version; average with \n at end
line = str(average) + "\n"
# write line (or new version thereof)
outf.write(line)
This produces the output you requested:
#HISE
GGATCGCAATGGGTA
+
19
#HISE
ATCGATCGATCGATA
+
22
Assuming you read from stdin and write to stdout:
for i, line in enumerate(stdin, 1):
line = line[:-1] # Remove newline
if i % 4 != 0:
print(line)
continue
nums = [d[c] for c in line]
print(sum(nums) / float(len(nums)))