I have a txt that contains data for classification purposes. The first column is the class, that is 0 or 1 and the other four columns contain the features of the class. Yet the features has numbers before them, that is 1: for feature 1, 2: for feature 2 etc. I tried to use regex in numpy split but I failed. How can I take only the columns I need? Below is the txt with the data.
1 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
1 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02
1 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02
1 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02
1 1:9.133997e+01 2:2.935699e+02 3:1.423918e-01 4:1.605402e+02
1 1:5.537500e+01 2:1.792220e+02 3:1.654953e-01 4:1.112273e+02
1 1:2.956200e+01 2:1.913570e+02 3:9.901439e-02 4:1.034076e+02
1 1:1.451200e+02 2:2.088600e+02 3:-1.760859e-01 4:1.542257e+02
1 1:3.849699e+01 2:4.146600e+01 3:-1.886419e-01 4:1.239661e+02
1 1:2.927699e+01 2:1.072510e+02 3:1.149632e-01 4:1.077885e+02
1 1:2.886700e+01 2:1.090240e+02 3:-1.239433e-01 4:9.799130e+01
1 1:2.401300e+01 2:7.602000e+01 3:2.850990e-01 4:9.891692e+01
1 1:2.837900e+01 2:1.452160e+02 3:3.870011e-01 4:1.549975e+02
1 1:2.238140e+01 2:8.242810e+01 3:-2.814865e-01 4:8.998764e+01
1 1:1.232100e+02 2:4.561600e+02 3:-1.518468e-01 4:1.432996e+02
1 1:2.008405e+01 2:1.774510e+02 3:2.578101e-01 4:9.253101e+01
1 1:3.285699e+01 2:1.826750e+02 3:2.204406e-01 4:9.457175e+01
1 1:0.000000e+00 2:1.154780e+02 3:1.504970e-01 4:1.096315e+02
1 1:3.954504e+01 2:2.374420e+02 3:1.089429e-01 4:1.376333e+02
1 1:1.067980e+02 2:3.237560e+02 3:-1.509505e-01 4:1.754021e+02
1 1:3.408200e+01 2:1.198280e+02 3:2.200156e-01 4:1.383639e+02
1 1:0.000000e+00 2:8.671080e+01 3:4.201880e-01 4:1.298851e+02
1 1:4.865997e+01 2:3.071500e+02 3:1.756066e-01 4:1.640174e+02
1 1:2.341090e+01 2:8.347140e+01 3:1.766868e-01 4:9.803250e+01
1 1:1.222390e+02 2:4.357930e+02 3:-1.812907e-01 4:1.687663e+02
1 1:1.624560e+01 2:4.830620e+01 3:5.508614e-01 4:2.632639e+01
1 1:4.389899e+01 2:2.421300e+02 3:2.006008e-01 4:1.331948e+02
1 1:6.143698e+01 2:2.338500e+02 3:2.758731e-01 4:1.612433e+02
1 1:5.952499e+01 2:2.176700e+02 3:-8.601014e-02 4:1.170831e+02
1 1:2.915850e+01 2:1.259875e+02 3:1.910455e-01 4:1.279927e+02
1 1:5.059702e+01 2:2.430620e+02 3:1.863443e-01 4:1.352273e+02
1 1:6.024097e+01 2:1.977340e+02 3:-1.319924e-01 4:1.320220e+02
1 1:2.620490e+01 2:6.270790e+01 3:-1.402450e-01 4:1.135866e+02
1 1:2.847198e+01 2:1.483760e+02 3:-1.868249e-01 4:1.672337e+02
1 1:2.707990e+01 2:7.770390e+01 3:-2.509235e-01 4:9.798032e+01
1 1:2.068600e+01 2:8.446800e+01 3:1.761782e-01 4:1.199423e+02
1 1:1.962450e+01 2:4.923090e+01 3:4.302725e-01 4:9.361318e+01
1 1:4.961401e+01 2:3.234850e+02 3:-1.963741e-01 4:1.622486e+02
1 1:7.982401e+01 2:2.017540e+02 3:-1.412161e-01 4:1.310716e+02
1 1:6.696402e+01 2:2.214030e+02 3:-1.187778e-01 4:1.416626e+02
1 1:5.842999e+01 2:1.348610e+02 3:2.876077e-01 4:1.286684e+02
1 1:6.982007e+01 2:3.693401e+02 3:-1.539849e-01 4:1.511659e+02
1 1:1.902200e+01 2:2.210120e+02 3:1.689450e-01 4:1.368066e+02
1 1:4.582898e+01 2:2.215950e+02 3:2.419124e-01 4:1.627100e+02
I do hate pandas but try these three lines:
import pandas as pd
# Use pandas read_csv; sep is interpreted as a regex
x=pd.read_csv('file.txt',sep='[: ]').to_numpy()
# Now select the required columns
output=x[:,(2,4,6,8)]
print(output)
"""
array([[ 5.707397e+01, 2.214040e+02, 8.607959e-02, 1.229114e+02],
[ 1.725900e+01, 1.734360e+02, -1.298053e-01, 1.250318e+02],
[ 2.177940e+01, 1.249531e+02, 1.538853e-01, 1.527150e+02],
[ 9.133997e+01, 2.935699e+02, 1.423918e-01, 1.605402e+02],
[ 5.537500e+01, 1.792220e+02, 1.654953e-01, 1.112273e+02],
[ 2.956200e+01, 1.913570e+02, 9.901439e-02, 1.034076e+02]])
"""
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can also check out How to use regex as delimiter function while reading a file into numpy array or similar
I rediscovered the solution below independently but this answer follows the same strategy via sep:
https://stackoverflow.com/a/51195469/1021819
You wish to parse a line like this:
1 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
Your parser reads the input with the usual idiom of:
for line in input_file:
Start by distinguishing the class label from the features.
label, *raw_features = line.split()
label = int(label)
Now it just remains to strip the unhelpful N: prefix from each feature.
features = [float(feat.split(':')[1])
for feat in raw_features]
Sure, you could solve this with a regex.
But that doesn't sound like the simplest solution, in this case.
I was bored :-) , So thought of writing a snippet for you. See below, which is kind of a dirty text processing and loading it to a dataframe.
lines = open("predictions.txt", "r").readlines()
column_lines = [
[fline[0]] + [feat[1] for feat in sorted([tuple(feature.split(":")) for feature in fline[1:]], key=lambda f: f[0])]
for fline in [line.split(" ") for line in lines]
]
import pandas as pd
table = pd.DataFrame(column_lines, columns = ["Class", "Feature1","Feature2","Feature3","Feature4"])
Instead of this, you can also think of tranforming the file to a csv, using a similar text processing and then use them directly to create a dataframe, so you dont need to run this code everytime.!
I hope this is helpful.
If you want to use regex to only extract you columns you can use this regex expression on each line:
import re
line = '1 1:1.067980e+02 2:3.237560e+02 3:-1.509505e-01 4:1.754021e+02'
reg = re.compile(r'(-*\d+\.\d+e[+|-]\d+)')
# Your columns:
reg.findall(line)
>>> ['1.067980e+02', '3.237560e+02', '-1.509505e-01', '1.754021e+02']
# Assuming you also want numbers of those values:
list(map(float, reg.findall(line)))
>>> [106.798, 323.756, -0.1509505, 175.4021]
What is does:
(-*\d+\.\d+e[+|-]\d+) the first brackets are used to create groups. Inside the group first -* is the optional minus sign. Thereafter there is at least 1 number, but there can be more than 1 \d+. The number does have a decimal point with decimals therefore \.\d+. Then there is an exponent with either + or - e[+|-] following with number \d+.
I am trying to use a code by someone else, but it keeps failing at one specific point. The code needs to find a multiline string in some inputfile and split the inputfile in two. The code seems logical to me, but I keep getting the same error. The inputfile looks something like this:
Text
more text 1
more text 2
VECTT
0 0 0 0 0
more text 3
more text 4
more text 5
Here is a minimal working example:
vfile = open('inputfile','r').read()
vesta_end = vfile.split("VECTT\n 0 0 0 0 0")[1]
print(vesta_end)
I would expect to get the second part of the inputfile, so:
more text 3
more text 4
more text 5
Instead, I get following error:
Traceback (most recent call last):
File "min.py", line 2, in <module>
vesta_end = vfile.split("VECTT\n 0 0 0 0 0")[1]
IndexError: list index out of range
which, I believe, just means it did not recognize the intended string passed to the split function. Any ideas on how to make the function recognize the multiline string?
This question already has answers here:
Extract Values between two strings in a text file using python
(9 answers)
Closed 3 years ago.
I am new to python and wanted to try it to extract text between the matching pattern in each line of my tab delimited text file (mydata)
mydata.txt:
Sequence tRNA Bounds tRNA Anti Intron Bounds Cove
Name tRNA # Begin End Type Codon Begin End Score
-------- ------ ---- ------ ---- ----- ----- ---- ------
lcl|NC_035155.1_gene_75[locus_tag=SS1G_20133][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_73[locus_tag=SS1G_20131][db_xref=GeneID:33 1 1 73 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_72[locus_tag=SS1G_20130][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_71[locus_tag=SS1G_20129][db_xref=GeneID:33 1 1 72 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_62[locus_tag=SS1G_20127][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
Code I tried:
lines = [] #Declare an empty list named "lines"
with open('/media/owner/c3c5fbb4-73f6-45dc-a475-988ad914056e/phasing/trna/test.txt') as input_data:
# Skips text before the beginning of the interesting block:
for line in input_data:
# print(line)
if line.strip() == "locus_tag=": # Or whatever test is needed
break
# Reads text until the end of the block:
for line in input_data: # This keeps reading the file
if line.strip() == "][db":
break
print(line) # Line is extracted (or block_of_lines.append(line), etc.)
I want to grab texts between [locus_tag= and ][db_xre and get these as my results:
SS1G_20133
SS1G_20131
SS1G_20130
SS1G_20129
SS1G_20127
If I'm understanding correctly, this should work for a given line of your data:
data = line.split("locus_tag=")[1].split("][db_xref")[0]
The idea is to split the string on locus_tag=, take the 2nd element, then split that string on ][db_xref and take the first element.
If you want help with the outer loop it could look like:
for line in open(file_path, 'r'):
if "locus_tag" in line:
data = line.split("locus_tag=")[1].split("][db_xref")[0]
print(data)
You can use re.search with positive lookbehind and positive lookahead patterns:
import re
...
for line in input_data:
match = re.search(r'(?<=\[locus_tag=).*(?=\]\[db_xre)', line)
if match:
print(match.group())
I am trying to make a very simple login script to learn about accessing files and lists but I'm a bit stuck.
newaccno = str(1)
with open("C:\\Python\\Test\\userpasstest.txt","r+") as loginfile:
for line in loginfile.readlines():
line = line.strip()
logininfo = line.split(" ")
print(newaccno in logininfo[0])
while newaccno in logininfo[0]: #issue is here, also tried ==
newaccno += 1
print(newaccno)
loginfile.write(newaccno)
My logic is that it will search logininfo[0] for newaccno and if it is true, increase newaccno by 1 and search again until it is false then write to file (so if the file has 1, 2 and 3 already then newaccno will end up as 4).
Edit: This is how the txt file looks, the first number represents newaccno before it gets split.
1 abc qwe
2 123 456
(adapted from comment)
Your while loop needs to be inside your for loop for it to work. If it is outside logininfo[0] will always be the last line's first character
I am trying to interpret a string that I have received from a socket. The first set of data is seen below:
2 -> 1
1 -> 2
2 -> 0
0 -> 2
0 -> 2
1 -> 2
2 -> 0
I am using the following code to get the numerical values:
for i in range(0,len(data)-1):
if data[i] == "-":
n1 = data[i-2]
n2 = data[i+3]
moves.append([int(n1),int(n2)])
But when a number greater than 9 appears in the data, the program only takes the second digit of that number (eg. with 10 the program would get 0). How would I get both of the digits from the code while maintaining the ability to get single digit numbers?
Well you just grab one character on each side ..
for the second value you can make it like this: data[i+3,len(data)-1]
for the first one: : data[0,i-2]
Use the split() function
numlist = data[i].split('->')
moves.append([int(numlist[0]),int(numlist[1])])
I assume each line is available as a (byte) string in a variable named line. If it's a whole bunch of lines then you can split it into individual lines with
lines = data.splitlines()
and work on each line inside a for statement:
for line in lines:
# do something with the line
If you are confident the lines will always be correctly formatted the easiest way to get the values you want uses the string split method. A full code starting from the data would then read like this.
lines = data.splitlines()
for line in lines:
first, _, second = line.split()
moves.append([int(first), int(second)])