I have a text file in this format:
000000.png 712,143,810,307,0
000001.png 599,156,629,189,3 387,181,423,203,1 676,163,688,193,5
000002.png 657,190,700,223,1
000003.png 614,181,727,284,1
000004.png 280,185,344,215,1 365,184,406,205,1
I want to remove the lines that don't have a [number1,number2,number3,number4,1] or [number1,number2,number3,number4,5] ending and also strip the text line and remove the [blocks] -> [number1,number2,number3,number4,number5] that don't fulfill this condition.
The above text file should look like this in the end:
000001.png 387,181,423,203,1 676,163,688,193,5
000002.png 657,190,700,223,1
000003.png 614,181,727,284,1
000004.png 280,185,344,215,1 365,184,406,205,1
My code:
import os
with open("data.txt", "r") as input:
with open("newdata.txt", "w") as output:
# iterate all lines from file
for line in input:
# if substring contain in a line then don't write it
if ",0" or ",2" or ",3" or ",4" or ",6" not in line.strip("\n"):
output.write(line)
I have tried something like this and it didn't work obviously.
No need for Regex, this might help you:
with open("data.txt", "r") as input: # Read all data lines.
data = input.readlines()
with open("newdata.txt", "w") as output: # Create output file.
for line in data: # Iterate over data lines.
line_elements = line.split() # Split line by spaces.
line_updated = [line_elements[0]] # Initialize fixed line (without undesired patterns) with image's name.
for i in line_elements[1:]: # Iterate over groups of numbers in current line.
tmp = i.split(',') # Split current group by commas.
if len(tmp) == 5 and (tmp[-1] == '1' or tmp[-1] == '5'):
line_updated.append(i) # If the pattern is Ok, append group to fixed line.
if len(line_updated) > 1: # If the fixed line is valid, write it to output file.
output.write(f"{' '.join(line_updated)}\n")
Related
I have a text file in this format:
0.jpg 12,13,14,15,16
0.jpg 13,14,15,16,17
1.jpg 1,2,3,4,5
1.jpg 2,3,4,5,6
I want to check if the image name is the same and then concatenate those lines into one line with the following format:
0.jpg 12,13,14,15,16 13,14,15,16,17
1.jpg 1,2,3,4,5 2,3,4,5,6
I have tried something like this but don't know how to do the actual comparison and also don't quite know what logic to apply since the first line_elements[0] will be taken and compared with each other line's line_elements[0]
with open("file.txt", "r") as input: # Read all data lines.
data = input.readlines()
with open("out_file.txt", "w") as output: # Create output file.
for line in data: # Iterate over data lines.
line_elements = line.split() # Split line by spaces.
line_updated = [line_elements[0]] # Initialize fixed line (without undesired patterns) with image's name.
if line_elements[0] = (next line's line_elements[0])???:
for i in line_elements[1:]: # Iterate over groups of numbers in current line.
tmp = i.split(',') # Split current group by commas.
if len(tmp) == 5:
line_updated.append(','.join(tmp))
if len(line_updated) > 1: # If the fixed line is valid, write it to output file.
output.write(f"{' '.join(line_updated)}\n")
Could be something like:
for i in range (len(data)):
if line_elements[0] in line[i] == line_elements[0] in line[i+1]:
line_updated = [line_elements[0]]
for i in line_elements[1:]: # Iterate over groups of numbers in current line.
tmp = i.split(',') # Split current group by commas.
if len(tmp) == 5:
line_updated.append(','.join(tmp))
if len(line_updated) > 1: # If the fixed line is valid, write it to output file.
output.write(f"{' '.join(line_updated)}\n")
Save the first field of the line in a variable. Then check if the first field of the current line is equal to the value. If it is, append to the value, otherwise write the saved line and start a new output line.
current_name = None
with open("out_file.txt", "w") as output:
for line in data:
name, values = line.split()
if name == current_name:
current_values += ' ' + values
continue
if current_name:
output.write(f'{current_name} {current_values}\n')
current_name, current_values = name, values
# write the last block
if current_name:
output.write(f'{current_name} {current_values}\n')
I have an input file like below (please note that there may/may not be blank lines in the file
11111*Author Name
22222*Date
11111 01 Var-1
11111 02 Var-2
11111 02 Var-3
Rules to be used:
If asterisk(*) is present at position # 6 of a record then skip the record.
First 6 bytes are sequence number which can be spaces as well. However, the first six bytes whether space or number can be ignored.
Only combine the records where asterisk is not present at position # 6.
Only consider data starting from position 7 in the input file up to positon 72.
Add comma as shown below
Expected Output
01,Var-1,02,Var-2,02,Var-3
Below is the code that I was trying to print the record. However, I was not able to get comma(,) after each text. Some were prefixed with spaces. Can someone please help?
with open("D:/Desktop/Files/Myfile.txt","r") as file_in:
for lines in file_in:
if "*" not in lines:
lines_new = " ".join(lines.split())
lines_fin = lines_new.replace(' ',',')
print(lines_fin,end=' ')
Assuming you just want to print them one after another (they will still be on separate lines)
with open("D:/Desktop/Files/Myfile.txt","r") as file_in:
for line in file_in:
if line == "\n": # skip empty lines
continue
if line[5] == "*": #skip if asterix at 6th position
continue
line = line.strip() # remove trailing and starting whitespace
line = line.replace(' ', ',') # replace remaining spaces with commas
print(line, ',')
If you just want them all combined then a better way to do it would be:
with open("D:/Desktop/Files/Myfile.txt","r") as f:
all_lines = f.readlines()
all_lines = [line.strip().replace(" ",",") for line in all_lines if line != "\n" and line[5] != "*"]
all_lines = ",".join(all_lines)
I havent tested this so may have typos!
I think a regex solution would be elegant
You would need to handle the limit of 72 for the length of data, but that should not be a problem.
import re
pattern = r'[\s\d]{6}(.+)'
out = []
with open('combinestrings.txt', 'r') as infile:
for line in infile:
result = re.findall(pattern, line)
if result:
out.append(','.join(result[0].split(' ')))
print(','.join(out))
output:
01,Var-1,02,Var-2,02,Var-3
I would use Python's pathlib as it has some useful capabilities for handling paths and reading text files.
To join items together it is useful to put the items you want to join in a Python list and then use the join method on the list.
I have also changed the logic of how you are splitting the data. When a line is kept, the line is always the first 6 characters removed so these can be sliced off. If you do that first it makes the split on whitespace cleaner as you get the two items you are seeking.
There seemed to be a requirement to truncate the data if it was longer than 72 characters so I limited the line of data to 72 characters.
This is what my test code looked like:
from pathlib import Path
data_file = Path("D:/Desktop/Files/Myfile.txt")
field_size = 72
def combine_file_contents(filename):
combined_data = []
for line in filename.read_text().splitlines():
if line and line[5] != "*":
combined_data.extend(line[6:field_size].split())
return ','.join(combined_data)
if __name__ == '__main__':
expected_output = "01,Var-1,02,Var-2,02,Var-3"
output_data = combine_file_contents(data_file)
print("New Output: ", output_data)
print("Expected Output:", expected_output)
assert output_data == expected_output
This gave the following output when I ran with the test data from the question:
New Output: 01,Var-1,02,Var-2,02,Var-3
Expected Output: 01,Var-1,02,Var-2,02,Var-3
I have a fasta file like this:
>XP1987651-apple1
ACCTTCCAAGTAG
>XP1235689-lemon2
TTGGAGTCCTGAG
>XP1254115-pear1
ATGCCGTAGTCAA
I would like to create a file selecting the header that ends with '1', for example:
>XP1987651-apple1
ACCTTCCAAGTAG
>XP1254115-pear1
ATGCCGTAGTCAA
so far I create this:
fasta = open('x.fasta')
output = open('x1.fasta', 'w')
seq = ''
for line in fasta:
if line[0] == '>' and seq == '':
header = line
elif line[0] != '>':
seq = seq + line
for n in header:
n = header[-1]
if '1' in n:
output.write(header + seq)
header= line
seq = ''
if "1" in header:
output.write(header + seq)
output.close()
However, it doesn't produce any output in the new file created. Can you please spot the error?
Thank you
One option would be to read the entire file into a string, and then use re.findall with the following regex pattern:
>[A-Z0-9]+-\w+1\r?\n[ACGT]+
Sample script:
fasta = open('x.fasta')
text = fasta.read()
matches = re.findall(r'>[A-Z0-9]+-\w+1\r?\n[ACGT]+', text)
print(matches)
For the sample data you gave above, this prints:
['>XP1987651-apple1\nACCTTCCAAGTAG', '>XP1254115-pear1\nATGCCGTAGTCAA']
You can start by getting a list of your individual records which are delimited by '>' and extract the header and body using a single split by newline .split('\n', 1)
records = [
line.split('\n', 1)
for line in fasta.read().split('>')[1:]
]
Then you can simply filter out records that do not end with 1
for header, body in records:
if header.endswith('1'):
output.write('>' + header + '\n')
output.write(body)
You can quite simply set a flag when you see a matching header line.
with open('x.fasta') as fasta, open('x1.fasta', 'w') as output:
for line in fasta:
if line.startswith('>'):
select = line.endswith('1\n')
if select:
output.write(line)
This avoids reading the entire file into memory; you are only examining one line at a time.
Maybe notice that line will contain the newline at the end of the line. I opted to simply keep it; sometimes, things are easier if you trim it with line = line.rstrip('\n') and add it back on output if necessary.
I am trying to extract data from a .txt file in Python. My goal is to capture the last occurrence of a certain word and show the next line, so I do a reverse () of the text and read from behind. In this case, I search for the word 'MEC', and show the next line, but I capture all occurrences of the word, not the first.
Any idea what I need to do?
Thanks!
This is what my code looks like:
import re
from file_read_backwards import FileReadBackwards
with FileReadBackwards("camdex.txt", encoding="utf-8") as file:
for l in file:
lines = l
while line:
if re.match('MEC', line):
x = (file.readline())
x2 = (x.strip('\n'))
print(x2)
break
line = file.readline()
The txt file contains this:
MEC
29/35
MEC
28,29/35
And with my code print this output:
28,29/35
29/35
And my objetive is print only this:
28,29/35
This will give you the result as well. Loop through lines, add the matching lines to an array. Then print the last element.
import re
with open("data\camdex.txt", encoding="utf-8") as file:
result = []
for line in file:
if re.match('MEC', line):
x = file.readline()
result.append(x.strip('\n'))
print(result[-1])
Get rid of the extra imports and overhead. Read your file normally, remembering the last line that qualifies.
with ("camdex.txt", encoding="utf-8") as file:
for line in file:
if line.startswith("MEC"):
last = line
print(last[4:-1]) # "4" gets rid of "MEC "; "-1" stops just before the line feed.
If the file is very large, then reading backwards makes sense -- seeking to the end and backing up will be faster than reading to the end.
I am reading a file and getting the first element from each start of the line, and comparing it to my list, if found, then I append it to the new output file that is supposed to be exactly like the input file in terms of the structure.
my_id_list = [
4985439
5605471
6144703
]
input file:
4985439 16:0.0719814
5303698 6:0.09407 19:0.132581
5605471 5:0.0486076
5808678 8:0.130536
6144703 5:0.193785 19:0.0492507
6368619 3:0.242678 6:0.041733
my attempt:
output_file = []
input_file = open('input_file', 'r')
for line in input_file:
my_line = np.array(line.split())
id = str(my_line[0])
if id in my_id_list:
output_file.append(line)
np.savetxt("output_file", output_file, fmt='%s')
Question is:
It is currently adding an extra empty line after each line written to the output file. How can I fix it? or is there any other way to do it more efficiently?
update:
output file should be for this example:
4985439 16:0.0719814
5605471 5:0.0486076
6144703 5:0.193785 19:0.0492507
try something like this
# read lines and strip trailing newline characters
with open('input_file','r') as f:
input_lines = [line.strip() for line in f.readlines()]
# collect all the lines that match your id list
output_file = [line for line in input_lines if line.split()[0] in my_id_list]
# write to output file
with open('output_file','w') as f:
f.write('\n'.join(output_file))
I don't know what numpy does to the text when reading it, but this is how you could do it without numpy:
my_id_list = {4985439, 5605471, 6144703} # a set is faster for membership testing
with open('input_file') as input_file:
# Your problem is most likely related to line-endings, so here
# we read the inputfile into an list of lines with intact line endings.
# To preserve the input, exactly, you would need to open the files
# in binary mode ('rb' for the input file, and 'wb' for the output
# file below).
lines = input_file.read().splitlines(keepends=True)
with open('output_file', 'w') as output_file:
for line in lines:
first_word = line.split()[0]
if first_word in my_id_list:
output_file.write(line)
getting the first word of each line is wasteful, since this:
first_word = line.split()[0]
creates a list of all "words" in the line when we just need the first one.
If you know that the columns are separated by spaces you can make it more efficient by only splitting on the first space:
first_word = line.split(' ', 1)[0]