Delete Range of Lines in Iterative Process - python

I have a kml data file (~160,000 lines). Within a Python script, I need to search for a keyword 'Unmatched' in the <name> tag and if found, remove everything from the <Placemark> to </Placemark> associated with that named entry.
I have tried the forums here and for a one time shot, it works, but when i need to perform this operation hundreds of times within the same file, I have not succeeded. There are 34 lines that need to be removed. 'prev' gets the starting line of where the delete needs to start and 'end' is where it stops... so I need to delete [prev:end] and then write those changes.
#!/usr/bin/python
lookup = 'Unmatched '
with open('doc.kml') as myFile:
for num, line in enumerate(myFile, 1):
if lookup in line:
print 'Found in Line:', num
prev = num-1
print 'Placemark Starts at line:', prev
end = prev+33
print '/Placemark Ends at line:', end

I would forget about line numbers. Focus on the contents of the line.
store the lines in another list of lines
when the start pattern is found, drop the remaining 33 (34?) lines by manually iterating on the file with next
the "drop the previous line" problem can be solved by popping the last line that we stored in the output list of lines
like this:
lookup = 'Unmatched '
filtered = [] # output list of lines
with open('doc.kml') as myFile:
for line in myFile:
if lookup in line:
filtered.pop() # drop the line we just added
for _ in range(34): # not sure of how many, you'll see
next(myFile,None) # drop a line
else:
filtered.append(line)
# in the end write back filtered file if needed (one could write directly instead of using a list to append to)
with open('newdoc.kml') as f:
f.writelines(filtered)

Related

Python: Access "field" in line

I have the following .txt-File (modified bash emboss-dreg report, the original report has seqtable format):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
I would like to access the elements under "sequence" only, to compare them with some variables and delete the whole lines, if the comparison does not give the desired result (using Levenshtein distance for comparison).
But I can't even get started .... :(
I am searching for something like the linux -f option, to directly get to the right "field" in the line to do my comparison.
I came across re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r'\t',line)
print(cleaned)
which results in:
[' Start End Strand Pattern Sequence\n']
['\n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT\n']
['\n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC\n']
['\n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC\n']
['\n']
That is the closest I got to "split my lines into elements". I feel like totally going the wrong way, but searching Stack Overflow and google did not result in anything :(
I have never worked with seqtable-format before, so I tried to deal with it as .txt Maybe, there is another approach better for dealing with it?
Python is the main language I am learning, I am not so firm in Bash, but bash-answers for dealing with the issue would be ok for me, too.
I am thankful for any hint/link/help :)
The format itself seems to be using multiple lines as delimiters while your r'\t' is not doing anything (you're instructing Python to split on a literal \t). Also, based on what you've pasted the data is not using a tab delimiter anyway, but a random number of whitespaces to pad the table.
To address both, you can read the file, treat the first line as a header (if you need it), then read the rest line by line, strip the trailing\leading whitespace, check if there is any data there and if there is - further split it on whitespace to get to your line elements:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACACGTTTT
TGACCCTGCTTGGCGATCCCGGCGTTTC
TGATCGCGCAACTGCAGCGGGAGTTAC
As a bonus, since you have the header, you can turn it into a map and then use 'proxied' named access to get the element you're looking for so you don't need to worry about the element position:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
You can also use a header map to turn your rows into dict structures for even easier access.
UPDATE: Here's how to create a header map and then use it to build a dict out of your lines:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
As for how to 'delete' lines that you don't want for some reason, you'll have to create a temporary file, loop through your original file, compare your values, write the ones that you want to keep into the temporary file, delete the original file and finally rename the temporary file to match your original file, something like:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
This will produce the same file sans the second row from your example since its sequence ends in a TC and our comp_function() returns False in that case.
For a bit less complexity, instead of using temporary files you can load your whole source file into the working memory and then just overwrite it, but that would work only for files that can fit your working memory while the above approach can work with files as large as your free storage space.

Python Delete last row if has ['\x1a'] mark on it

I am currently working on a project and I need to test if, on the last row (line) of the input, I have this byte: '\x1a'. If the last row has this marker I want to delete the entire row.
I have this code so far, but i don't know how to make it test for that byte on the last row and delete it.
Thank you!
readFile1 = open("sdn.csv")
lines1 = readFile1.readlines()
readFile1.close()
w1 = open("sdn.csv", 'w')
w1.writelines([item for item in lines1[:-1]])
w1.close()
readFile2 = open("add.csv")
lines2 = readFile2.readlines()
readFile2.close()
w2 = open("add.csv",'w')
w2.writelines([item for item in lines2[:-1]])
w2.close()
readFile3 = open("alt.csv")
lines3 = readFile3.readlines()
readFile3.close()
w = open("alt.csv",'w')
w.writelines([item for item in lines3[:-1]])
w.close()
In any of your code blocks, you have read your file's contents into a variable with a line like:
lines1 = readFile1.readlines()
If you want to see if the \x1a byte exists anywhere in the last line of the text, then you can do this:
if '\x1a' in lines1[-1]:
# whatever you need to do
If you want to find the byte and then actually delete the row from the list altogether:
if '\x1a' in lines1[-1]:
# \x1a byte was found, remove the last item from list
del lines1[-1]
And if I may offer a suggestion, all your code blocks repeat. You could create a function which captures all the functionality and then pass file names to it.
def process_csv(file_name):
# Open the file for both reading and writing
# This will also automatically close the file handle after
# you're done with it
with open(file_name, 'r+') as csv_file:
data = csv_file.readlines()
if '\x1a' in data[-1]:
# erase file and then write data without last row to it
csv_file.seek(0)
csv_file.truncate()
csv_file.writelines(data[:-1])
else:
# Just making this explicit
# Don't do anything to the file if the \x1a byte wasn't found
pass
for f in ('sdn.csv', 'add.csv', 'alt.csv'):
process_csv(f)

File operation starts again from first while looping through the file

I'm trying to find a certain word in a file and want to print the next line when a condition is met.
f = open('/path/to/file.txt','r')
lines = f.readlines()
for line in lines:
if 'P/E' in line:
n = lines.index(line) #get index of current line
print(lines[n+1]) #print the next line
a.close()
The string 'P/E' will be present 4 times in the file, each time in a different line.
When executed, the code prints the next line after the first 2 occurrences of 'P/E' normally. It then again goes back and prints the same first 2 occurrences again and exits. The loop is not proceeding after those first 2 occurrences; it kind of repeats the process and exits.
I checked the data file to see if my output is the actual result, but all next lines are different after 'P/E'.
How can I resolve this? Thanks.
list.index() with just one argument only finds the first occurrence. You'd have to give it a starting point to find elements past the previous index, list.index() takes a second argument that tells it where to start searching from.
However, you don't need to use lines.index(); that's very inefficient; it requires a full scan through the list, testing each line until a match is found.
Just use the enumerate() function to add indices as you loop:
for index, line in enumerate(lines):
if 'P/E' in line:
print(lines[index + 1])
Be careful, there is a chance index + 1 is not a valid index; if you find 'P/E' in the very last line of the lines list you'll get an IndexError. You may have to add a and index + 1 < len(lines) test.
Note that using file.readlines() reads all of the file into memory in one go. Try to avoid this; you could loop directly over the file, and remember the previous line instead:
with open('/path/to/file.txt','r') as f:
previous = ''
for line in f:
if 'P/E' in previous:
print(line) # print this line
previous = line # remember for the next iteration

Python: Appending string constructed out of multiple lines to list

I'm trying to parse a txt file and put sentences in a list that fit my criteria.
The text file consists of several thousand lines and I'm looking for lines that start with a specific string, lets call this string 'start'.
The lines in this text file can belong together and are somehow seperated with \n at random.
This means I have to look for any string that starts with 'start', put it in an empty string 'complete' and then continue scanning each line after that to see if it also starts with 'start'.
If not then I need to append it to 'complete' because then it is part of the entire sentence. If it does I need to append 'complete' to a list, create a new, empty 'complete' string and start appending to that one. This way I can loop through the entire text file without paying attention to the number of lines a sentence exists of.
My code thusfar:
import sys, string
lines_1=[]
startswith = ('keys', 'values', 'files', 'folders', 'total')
completeline = ''
with open (sys.argv[1]) as f:
data = f.read()
for line in data:
if line.lower().startswith(startswith):
completeline = line
else:
completeline += line
lines_1.append(completeline)
# check some stuff in output
for l in lines_1:
print "______"
print l
print len(lines_1)
However this puts the entire content in 1 item in the list, where I'd like everything to be seperated.
Keep in mind that the lines composing one sentence can span one, two, 10 or 1000 lines so it needs to spot the next startswith value, append the existing completeline to the list and then fill completeline up with the next sentence.
Much obliged!
Two issues:
Iterating over a string, not lines:
When you iterate over a string, the value yielded is a character, not a line. This means for line in data: is going character by character through the string. Split your input by newlines, returning a list, which you then iterate over. e.g. for line in data.split('\n'):
Overwriting the completeline inside the loop
You append a completed line at the end of the loop, but not when you start recording a new line inside the loop. Change the if in the loop to something like this:
if line.lower().startswith(startswith):
if completeline:
lines_1.append(completeline)
completeline = line
For task like this
"I'm trying to parse a txt file and put sentences in a list that fit my criteria"
I usually prefer using dictionary for such kind of ideas, for example
from collections import defaultdict
seperatedItems = defaultdict(list)
for sentence in fileDataAsAList:
if satisfiesCriteria("start",sentence):
seperatedItems["start"].append(sentence)
def satisfiesCriteria(criteria,sentence):
if sentence.lower.startswith(criteria):
return True
return False
Something like this should suffise.. the code is just for giving you idea of what you might like to do.. you can have list of criterias and loop over them which will add sentences related to different creterias into dictionary something like this
mycriterias = ['start','begin','whatever']
for criteria in mycriterias:
for sentence in fileDataAsAList:
if satisfiesCriteria(criteria ,sentence):
seperatedItems[criteria ].append(sentence)
mind the spellings :p

Delete chunks of duplicate files in python from a very large file

My lab generates very large files relating to Mass spec data. With an updated program from the manufacturer some of the data writes out duplicated and looks like this:
BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
120.0028 2794.253
---lots more numbers of this format--
END IONS
BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
120.0028 2794.253
---lots more duplicate numbers---
END IONS
All chunks are of this format. I've tried writing a program to read in the whole file (1-2million lines), put the lines in a set and compare every new line to the set to see if it has been duplicated or not. The generated array of lines would then be printed to a new file. Duplicate chunks are supposed to be skipped over in the conditional statement but when I run the program it is never entered, instead printing out all received lines
print('Enter file name to be cleaned (including extension, must be in same folder)')
fileinput = raw_input()
print('Enter output file name including extension')
fileoutput = raw_input()
with open (fileoutput, 'w') as fo:
with open(fileinput) as f:
largearray=[]
j=0
linecount=0
#read file over, append array
for line in f:
largearray.append(line)
linecount+=1
while j<linecount:
#initialize set
seen = set()
if largearray[j] not in seen:
seen.add(largearray[j])
# if the first line of the next chunk is a duplicate:
if 'BEGIN' in largearray[j] and largearray[j+5] in seen:
while 'END IONS' not in largearray[j]:
j+=1 #skip through all lines in the array until the next chunk is reached
print('writing: ',largearray[j])
fo.write(largearray[j])
j+=1
Any help would be greatly appreciated.
so just to clarify,
BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
this is repeated for the duplicated numbers etc right?
so you could just check to see if these initial parts are duplicated, and if so, skip till the next END IONS
If the file is huge, you should read it lines one-by-one and save only data you are interested for. So here's an iterate by line approach:
end_chunk = 'END IONS'
already_read_chunks = set([])
with open(fileinput) as f_in:
current_chunk = []
for line in f_in: #read iterative, save only data you need
line = line.strip() #remove trailings and white spaces
if line: #skip empty lines
current_chunk.append(line)
if line == end_chunk:
entire_chunk = '\n'.join(current_chunk) #rebuild chunk as string
if entire_chunk not in already_read_chunks: #check its existance
already_read_chunks.add(entire_chunk) #add if we haven't read it before
current_chunk = [] #restore current_chunk var, to restart process
with open (fileoutput, 'w') as f_out:
for chunk in already_read_chunks:
f_out.write(chunk)
f_out.write('\n')
f_out.write('\n')
The reason it doesn't skip over duplicates is the line:
seen = set()
It is in the wrong place. If it is moved outside the loop, then the code will work as intended:
with open (fileoutput, 'w') as fo:
with open(fileinput) as f:
largearray=list(f) #read file
seen = set() #initialize set before loop
j=0
while j<len(largearray):
if largearray[j] not in seen:
seen.add(largearray[j])
# if the first line of the next chunk is a duplicate:
if 'BEGIN' in largearray[j] and largearray[j+5] in seen:_
while 'END IONS' not in largearray[j]:
j+=1 #skip through all lines in the array until the next chunk is reached
j+=1 # Skip over `END IONS`
else:
print('writing: ',largearray[j])
fo.write(largearray[j])
j+=1
I made two other adjustments:
Looping over input lines of f to save them in a list is unnecessary. This was replaced with:
largearray=list(f)
Ideally, to handle large files, we wouldn't read in the whole file at once but only one BEGIN/END block at a time. I will leave that as an exercise for the reader.
The code would print out END IONS even for duplicate section. This was avoided by (a) incrementing j once more, and (b) using an else clause to print only the non-duplicate sections.
Alternative solution using awk
The same problem can be solved in awk in a single line:
awk -F'\n' -v RS="BEGIN IONS\n" '$5 in seen || NF==0 {next;} {seen[$5]++;print RS,$0}' infile >outfile
Explanation:
-F'\n' -v RS="BEGIN IONS\n"
awk reads in a record at a time. Here, a record is defined as any text that begins with BEGIN IONS and a newline. awk takes each record and divides it into fields. Here we define the field separator as a newline character. Each line becomes a field.
$5 in seen || NF==0 {next;}
If the fifth line in this record has already been seen, we skip over the rest of the commands and jump to the next record. We do the same on any empty records that contain no lines.
seen[$5]++; print RS,$0
If we get to this command, that means that the record has not been seen before. We add the fifth line to the array seen and print this record.

Categories