I would like to split a string like this
str = "$$Node_Type<block=begin>Blabla$$Node_Type<block=end>"
to something like this:
tab = ["$$Node_Type<block=begin>", "Blabla", "$$Node_Type<block=end>"]
but I can also have this:
str = "$$Node_Type1<block=begin>Blabla1$$Node_Type2<block=begin>Blabla2$$Node_Type2<block=end>$$Node_Type1<block=end>"
to something like this:
tab = ["$$Node_Type1<block=begin>", "Blabla1", "$$Node_Type2<block=begin>", "Blabla2", "$$Node_Type2<block=end>", "$$Node_Type1<block=end>"]
The idea at the end is to print it like that
$$Node_Type1<block=begin>
Blabla1
$$Node_Type2<block=begin>
Blabla2
$$Node_Type2<block=end>
$$Node_Type1<block=end>
Does someone has an idea ? Thx
You can take advantage of the fact that re.split retains the "splitter" in the results if it's a capturing group, and then:
import re
example = "Hello$$Node_Type1<block=begin>Blabla1$$Node_Type2<block=begin>Blabla2$$Node_Type2<block=end>$$Node_Type1<block=end>"
level = 0
for bit in re.split(r'(\$\$[^>]+>)', example):
if bit.startswith('$$') and bit.endswith('block=end>'):
level -= 1
if bit:
print(' ' * level + bit)
if bit.startswith('$$') and bit.endswith('block=begin>'):
level += 1
This prints out
Hello
$$Node_Type1<block=begin>
Blabla1
$$Node_Type2<block=begin>
Blabla2
$$Node_Type2<block=end>
$$Node_Type1<block=end>
I have a txt that contains data for classification purposes. The first column is the class, that is 0 or 1 and the other four columns contain the features of the class. Yet the features has numbers before them, that is 1: for feature 1, 2: for feature 2 etc. I tried to use regex in numpy split but I failed. How can I take only the columns I need? Below is the txt with the data.
1 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
1 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02
1 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02
1 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02
1 1:9.133997e+01 2:2.935699e+02 3:1.423918e-01 4:1.605402e+02
1 1:5.537500e+01 2:1.792220e+02 3:1.654953e-01 4:1.112273e+02
1 1:2.956200e+01 2:1.913570e+02 3:9.901439e-02 4:1.034076e+02
1 1:1.451200e+02 2:2.088600e+02 3:-1.760859e-01 4:1.542257e+02
1 1:3.849699e+01 2:4.146600e+01 3:-1.886419e-01 4:1.239661e+02
1 1:2.927699e+01 2:1.072510e+02 3:1.149632e-01 4:1.077885e+02
1 1:2.886700e+01 2:1.090240e+02 3:-1.239433e-01 4:9.799130e+01
1 1:2.401300e+01 2:7.602000e+01 3:2.850990e-01 4:9.891692e+01
1 1:2.837900e+01 2:1.452160e+02 3:3.870011e-01 4:1.549975e+02
1 1:2.238140e+01 2:8.242810e+01 3:-2.814865e-01 4:8.998764e+01
1 1:1.232100e+02 2:4.561600e+02 3:-1.518468e-01 4:1.432996e+02
1 1:2.008405e+01 2:1.774510e+02 3:2.578101e-01 4:9.253101e+01
1 1:3.285699e+01 2:1.826750e+02 3:2.204406e-01 4:9.457175e+01
1 1:0.000000e+00 2:1.154780e+02 3:1.504970e-01 4:1.096315e+02
1 1:3.954504e+01 2:2.374420e+02 3:1.089429e-01 4:1.376333e+02
1 1:1.067980e+02 2:3.237560e+02 3:-1.509505e-01 4:1.754021e+02
1 1:3.408200e+01 2:1.198280e+02 3:2.200156e-01 4:1.383639e+02
1 1:0.000000e+00 2:8.671080e+01 3:4.201880e-01 4:1.298851e+02
1 1:4.865997e+01 2:3.071500e+02 3:1.756066e-01 4:1.640174e+02
1 1:2.341090e+01 2:8.347140e+01 3:1.766868e-01 4:9.803250e+01
1 1:1.222390e+02 2:4.357930e+02 3:-1.812907e-01 4:1.687663e+02
1 1:1.624560e+01 2:4.830620e+01 3:5.508614e-01 4:2.632639e+01
1 1:4.389899e+01 2:2.421300e+02 3:2.006008e-01 4:1.331948e+02
1 1:6.143698e+01 2:2.338500e+02 3:2.758731e-01 4:1.612433e+02
1 1:5.952499e+01 2:2.176700e+02 3:-8.601014e-02 4:1.170831e+02
1 1:2.915850e+01 2:1.259875e+02 3:1.910455e-01 4:1.279927e+02
1 1:5.059702e+01 2:2.430620e+02 3:1.863443e-01 4:1.352273e+02
1 1:6.024097e+01 2:1.977340e+02 3:-1.319924e-01 4:1.320220e+02
1 1:2.620490e+01 2:6.270790e+01 3:-1.402450e-01 4:1.135866e+02
1 1:2.847198e+01 2:1.483760e+02 3:-1.868249e-01 4:1.672337e+02
1 1:2.707990e+01 2:7.770390e+01 3:-2.509235e-01 4:9.798032e+01
1 1:2.068600e+01 2:8.446800e+01 3:1.761782e-01 4:1.199423e+02
1 1:1.962450e+01 2:4.923090e+01 3:4.302725e-01 4:9.361318e+01
1 1:4.961401e+01 2:3.234850e+02 3:-1.963741e-01 4:1.622486e+02
1 1:7.982401e+01 2:2.017540e+02 3:-1.412161e-01 4:1.310716e+02
1 1:6.696402e+01 2:2.214030e+02 3:-1.187778e-01 4:1.416626e+02
1 1:5.842999e+01 2:1.348610e+02 3:2.876077e-01 4:1.286684e+02
1 1:6.982007e+01 2:3.693401e+02 3:-1.539849e-01 4:1.511659e+02
1 1:1.902200e+01 2:2.210120e+02 3:1.689450e-01 4:1.368066e+02
1 1:4.582898e+01 2:2.215950e+02 3:2.419124e-01 4:1.627100e+02
I do hate pandas but try these three lines:
import pandas as pd
# Use pandas read_csv; sep is interpreted as a regex
x=pd.read_csv('file.txt',sep='[: ]').to_numpy()
# Now select the required columns
output=x[:,(2,4,6,8)]
print(output)
"""
array([[ 5.707397e+01, 2.214040e+02, 8.607959e-02, 1.229114e+02],
[ 1.725900e+01, 1.734360e+02, -1.298053e-01, 1.250318e+02],
[ 2.177940e+01, 1.249531e+02, 1.538853e-01, 1.527150e+02],
[ 9.133997e+01, 2.935699e+02, 1.423918e-01, 1.605402e+02],
[ 5.537500e+01, 1.792220e+02, 1.654953e-01, 1.112273e+02],
[ 2.956200e+01, 1.913570e+02, 9.901439e-02, 1.034076e+02]])
"""
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can also check out How to use regex as delimiter function while reading a file into numpy array or similar
I rediscovered the solution below independently but this answer follows the same strategy via sep:
https://stackoverflow.com/a/51195469/1021819
You wish to parse a line like this:
1 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
Your parser reads the input with the usual idiom of:
for line in input_file:
Start by distinguishing the class label from the features.
label, *raw_features = line.split()
label = int(label)
Now it just remains to strip the unhelpful N: prefix from each feature.
features = [float(feat.split(':')[1])
for feat in raw_features]
Sure, you could solve this with a regex.
But that doesn't sound like the simplest solution, in this case.
I was bored :-) , So thought of writing a snippet for you. See below, which is kind of a dirty text processing and loading it to a dataframe.
lines = open("predictions.txt", "r").readlines()
column_lines = [
[fline[0]] + [feat[1] for feat in sorted([tuple(feature.split(":")) for feature in fline[1:]], key=lambda f: f[0])]
for fline in [line.split(" ") for line in lines]
]
import pandas as pd
table = pd.DataFrame(column_lines, columns = ["Class", "Feature1","Feature2","Feature3","Feature4"])
Instead of this, you can also think of tranforming the file to a csv, using a similar text processing and then use them directly to create a dataframe, so you dont need to run this code everytime.!
I hope this is helpful.
If you want to use regex to only extract you columns you can use this regex expression on each line:
import re
line = '1 1:1.067980e+02 2:3.237560e+02 3:-1.509505e-01 4:1.754021e+02'
reg = re.compile(r'(-*\d+\.\d+e[+|-]\d+)')
# Your columns:
reg.findall(line)
>>> ['1.067980e+02', '3.237560e+02', '-1.509505e-01', '1.754021e+02']
# Assuming you also want numbers of those values:
list(map(float, reg.findall(line)))
>>> [106.798, 323.756, -0.1509505, 175.4021]
What is does:
(-*\d+\.\d+e[+|-]\d+) the first brackets are used to create groups. Inside the group first -* is the optional minus sign. Thereafter there is at least 1 number, but there can be more than 1 \d+. The number does have a decimal point with decimals therefore \.\d+. Then there is an exponent with either + or - e[+|-] following with number \d+.
For school I have to parse a string after a word with a lot of whitespace, but I just can't get it.
Because the file is a genbank.
So for example:
BLA
1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
//
What I have tried is this.
if line.startswith("BLA"):
start = line.find("BLA")
end = line.find("//")
line = line[:end]
s_string = ""
string = list()
if s_string:
string.append(line)
else:
line = line.strip()
my_seq += line
But what I get is:
**output**
BLA
and that is the only thing it get and I want to get the output be like
**output**
BLA 1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
So I don't know what to do, I tried to get it like that last output. But without success. My teacher told me that I had to do like. If BLA is True you can go iterate it. And if you see "//" you have to stop, but when I tried it with that True - statement I get nothing.
I tried to search it up online, and it said I had to do it with bio seqIO. But the teacher said we can't use that.
Here is my solution:
lines = """BLA
1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
//"""
lines = lines.strip().split("//")
lines = lines[0].split("BLA")
lines = [i.strip() for i in lines]
print("BLA", " ", lines[1])
Output:
BLA 1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
I am trying to interpret a string that I have received from a socket. The first set of data is seen below:
2 -> 1
1 -> 2
2 -> 0
0 -> 2
0 -> 2
1 -> 2
2 -> 0
I am using the following code to get the numerical values:
for i in range(0,len(data)-1):
if data[i] == "-":
n1 = data[i-2]
n2 = data[i+3]
moves.append([int(n1),int(n2)])
But when a number greater than 9 appears in the data, the program only takes the second digit of that number (eg. with 10 the program would get 0). How would I get both of the digits from the code while maintaining the ability to get single digit numbers?
Well you just grab one character on each side ..
for the second value you can make it like this: data[i+3,len(data)-1]
for the first one: : data[0,i-2]
Use the split() function
numlist = data[i].split('->')
moves.append([int(numlist[0]),int(numlist[1])])
I assume each line is available as a (byte) string in a variable named line. If it's a whole bunch of lines then you can split it into individual lines with
lines = data.splitlines()
and work on each line inside a for statement:
for line in lines:
# do something with the line
If you are confident the lines will always be correctly formatted the easiest way to get the values you want uses the string split method. A full code starting from the data would then read like this.
lines = data.splitlines()
for line in lines:
first, _, second = line.split()
moves.append([int(first), int(second)])
I have a function that reads a file which contains a name followed by a space, then multiple numbers, each seperated by a space. I want to parse the name into one string, and all the numbers into another, then put them in a dictionary (with the name as the key). I have written the following code:
def read_users (user_file):
try:
file_in = open(user_file)
except:
return None
user_scores = {}
for line in file_in:
temp_lst = line.strip().split(' ', 1)
user_scores = [temp_lst[0]] = temp_lst[1]
return user_scores
This seems to do everything I need, but when it puts it into a dictionary it throws the exception "Too many values to unpack". I'm confused as to why this is thrown because I think I should be passing the dictionary a string with the name as the key, and a string with a bunch of numbers as the value.
If it's important the lines in the input file are formatted as follows:
Ben 1 0 2 3 4 -2 5 5 6 6 1
I have tried printing the list before I pass it to the dictionary and it appears as follows:
['Ben', '1 0 2 3 4 -1 5 5 6 6 1']
Anyone have any ideas? Thanks!
#I think the way you construct the dictionary is not quite right. Try below code to see if it works.
user_scores[temp_lst[0]] = temp_lst[1]