Delete chunks of duplicate files in python from a very large file

Delete chunks of duplicate files in python from a very large file - python

My lab generates very large files relating to Mass spec data. With an updated program from the manufacturer some of the data writes out duplicated and looks like this:
BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
120.0028 2794.253
---lots more numbers of this format--
END IONS
BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
120.0028 2794.253
---lots more duplicate numbers---
END IONS
All chunks are of this format. I've tried writing a program to read in the whole file (1-2million lines), put the lines in a set and compare every new line to the set to see if it has been duplicated or not. The generated array of lines would then be printed to a new file. Duplicate chunks are supposed to be skipped over in the conditional statement but when I run the program it is never entered, instead printing out all received lines
print('Enter file name to be cleaned (including extension, must be in same folder)')
fileinput = raw_input()
print('Enter output file name including extension')
fileoutput = raw_input()
with open (fileoutput, 'w') as fo:
with open(fileinput) as f:
largearray=[]
j=0
linecount=0
#read file over, append array
for line in f:
largearray.append(line)
linecount+=1
while j<linecount:
#initialize set
seen = set()
if largearray[j] not in seen:
seen.add(largearray[j])
# if the first line of the next chunk is a duplicate:
if 'BEGIN' in largearray[j] and largearray[j+5] in seen:
while 'END IONS' not in largearray[j]:
j+=1 #skip through all lines in the array until the next chunk is reached
print('writing: ',largearray[j])
fo.write(largearray[j])
j+=1
Any help would be greatly appreciated.

so just to clarify,
BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
this is repeated for the duplicated numbers etc right?
so you could just check to see if these initial parts are duplicated, and if so, skip till the next END IONS

If the file is huge, you should read it lines one-by-one and save only data you are interested for. So here's an iterate by line approach:
end_chunk = 'END IONS'
already_read_chunks = set([])
with open(fileinput) as f_in:
current_chunk = []
for line in f_in: #read iterative, save only data you need
line = line.strip() #remove trailings and white spaces
if line: #skip empty lines
current_chunk.append(line)
if line == end_chunk:
entire_chunk = '\n'.join(current_chunk) #rebuild chunk as string
if entire_chunk not in already_read_chunks: #check its existance
already_read_chunks.add(entire_chunk) #add if we haven't read it before
current_chunk = [] #restore current_chunk var, to restart process
with open (fileoutput, 'w') as f_out:
for chunk in already_read_chunks:
f_out.write(chunk)
f_out.write('\n')
f_out.write('\n')

The reason it doesn't skip over duplicates is the line:
seen = set()
It is in the wrong place. If it is moved outside the loop, then the code will work as intended:
with open (fileoutput, 'w') as fo:
with open(fileinput) as f:
largearray=list(f) #read file
seen = set() #initialize set before loop
j=0
while j<len(largearray):
if largearray[j] not in seen:
seen.add(largearray[j])
# if the first line of the next chunk is a duplicate:
if 'BEGIN' in largearray[j] and largearray[j+5] in seen:_
while 'END IONS' not in largearray[j]:
j+=1 #skip through all lines in the array until the next chunk is reached
j+=1 # Skip over `END IONS`
else:
print('writing: ',largearray[j])
fo.write(largearray[j])
j+=1
I made two other adjustments:
Looping over input lines of f to save them in a list is unnecessary. This was replaced with:
largearray=list(f)
Ideally, to handle large files, we wouldn't read in the whole file at once but only one BEGIN/END block at a time. I will leave that as an exercise for the reader.
The code would print out END IONS even for duplicate section. This was avoided by (a) incrementing j once more, and (b) using an else clause to print only the non-duplicate sections.
Alternative solution using awk
The same problem can be solved in awk in a single line:
awk -F'\n' -v RS="BEGIN IONS\n" '$5 in seen || NF==0 {next;} {seen[$5]++;print RS,$0}' infile >outfile
Explanation:
-F'\n' -v RS="BEGIN IONS\n"
awk reads in a record at a time. Here, a record is defined as any text that begins with BEGIN IONS and a newline. awk takes each record and divides it into fields. Here we define the field separator as a newline character. Each line becomes a field.
$5 in seen || NF==0 {next;}
If the fifth line in this record has already been seen, we skip over the rest of the commands and jump to the next record. We do the same on any empty records that contain no lines.
seen[$5]++; print RS,$0
If we get to this command, that means that the record has not been seen before. We add the fifth line to the array seen and print this record.

Related

Delete Range of Lines in Iterative Process

I have a kml data file (~160,000 lines). Within a Python script, I need to search for a keyword 'Unmatched' in the <name> tag and if found, remove everything from the <Placemark> to </Placemark> associated with that named entry.
I have tried the forums here and for a one time shot, it works, but when i need to perform this operation hundreds of times within the same file, I have not succeeded. There are 34 lines that need to be removed. 'prev' gets the starting line of where the delete needs to start and 'end' is where it stops... so I need to delete [prev:end] and then write those changes.
#!/usr/bin/python
lookup = 'Unmatched '
with open('doc.kml') as myFile:
for num, line in enumerate(myFile, 1):
if lookup in line:
print 'Found in Line:', num
prev = num-1
print 'Placemark Starts at line:', prev
end = prev+33
print '/Placemark Ends at line:', end

I would forget about line numbers. Focus on the contents of the line.
store the lines in another list of lines
when the start pattern is found, drop the remaining 33 (34?) lines by manually iterating on the file with next
the "drop the previous line" problem can be solved by popping the last line that we stored in the output list of lines
like this:
lookup = 'Unmatched '
filtered = [] # output list of lines
with open('doc.kml') as myFile:
for line in myFile:
if lookup in line:
filtered.pop() # drop the line we just added
for _ in range(34): # not sure of how many, you'll see
next(myFile,None) # drop a line
else:
filtered.append(line)
# in the end write back filtered file if needed (one could write directly instead of using a list to append to)
with open('newdoc.kml') as f:
f.writelines(filtered)

Python: Access "field" in line

I have the following .txt-File (modified bash emboss-dreg report, the original report has seqtable format):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
I would like to access the elements under "sequence" only, to compare them with some variables and delete the whole lines, if the comparison does not give the desired result (using Levenshtein distance for comparison).
But I can't even get started .... :(
I am searching for something like the linux -f option, to directly get to the right "field" in the line to do my comparison.
I came across re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r'\t',line)
print(cleaned)
which results in:
[' Start End Strand Pattern Sequence\n']
['\n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT\n']
['\n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC\n']
['\n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC\n']
['\n']
That is the closest I got to "split my lines into elements". I feel like totally going the wrong way, but searching Stack Overflow and google did not result in anything :(
I have never worked with seqtable-format before, so I tried to deal with it as .txt Maybe, there is another approach better for dealing with it?
Python is the main language I am learning, I am not so firm in Bash, but bash-answers for dealing with the issue would be ok for me, too.
I am thankful for any hint/link/help :)

The format itself seems to be using multiple lines as delimiters while your r'\t' is not doing anything (you're instructing Python to split on a literal \t). Also, based on what you've pasted the data is not using a tab delimiter anyway, but a random number of whitespaces to pad the table.
To address both, you can read the file, treat the first line as a header (if you need it), then read the rest line by line, strip the trailing\leading whitespace, check if there is any data there and if there is - further split it on whitespace to get to your line elements:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACACGTTTT
TGACCCTGCTTGGCGATCCCGGCGTTTC
TGATCGCGCAACTGCAGCGGGAGTTAC
As a bonus, since you have the header, you can turn it into a map and then use 'proxied' named access to get the element you're looking for so you don't need to worry about the element position:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
You can also use a header map to turn your rows into dict structures for even easier access.
UPDATE: Here's how to create a header map and then use it to build a dict out of your lines:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
As for how to 'delete' lines that you don't want for some reason, you'll have to create a temporary file, loop through your original file, compare your values, write the ones that you want to keep into the temporary file, delete the original file and finally rename the temporary file to match your original file, something like:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
This will produce the same file sans the second row from your example since its sequence ends in a TC and our comp_function() returns False in that case.
For a bit less complexity, instead of using temporary files you can load your whole source file into the working memory and then just overwrite it, but that would work only for files that can fit your working memory while the above approach can work with files as large as your free storage space.

Find and remove elements from list, while retaining the location for later insertion

Using the following in Python 2.7:
dfile = 'new_data.txt' # Depth file no. 1
d_row = [line.strip() for line in open(dfile)]
I have loaded a data file into a list without the newline character. Now I want to index all elements within d_row where the beginning of the string is not numeric and/or empty. Next, I require:
removal of all of the above detailed non-numeric instances and
save the strings and indexes for later insertion into an updated file.
Example of data:
Thu Mar 14 18:17:05 2013
Fri Mar 15 01:40:25 2013
FT
DepthChange: 0.000000,2895.336,0.000
1363285025.250000,9498.970
1363285025.300000,9498.970
1363285026.050000,9498.970
1363287840.450042,9458.010
1363287840.500042,9458.010
1363287840.850042,9458.010
1363287840.900042,9458.010
DepthChange: 0.000000,2882.810,9457.200
1363287840.950042,9458.010
DepthChange: 0.000000,2882.810,0.000
1363287841.000042,9457.170
1363287841.050042,9457.170
1363287841.100042,9457.170
1363287841.150042,9457.170
1363287841.200042,9457.170
1363287841.250042,9457.170
1363287841.300042,9457.170
1363291902.750102,9149.937
1363291902.800102,9149.822
1363291902.850102,9149.822
1363291902.900102,9149.822
1363291902.950102,9149.822
1363291903.000102,9149.822
1363291903.050102,9149.708
1363291903.100102,9149.708
1363291903.150102,9149.708
1363291903.200102,9149.708
1363291903.250102,9149.708
1363291903.300102,9149.592
1363291903.350102,9149.592
1363291903.400102,9149.592
1363291903.450102,9149.592
1363291903.500102,9149.592
DepthChange: 0.000000,2788.770,2788.709
1363291903.550102,9149.479
1363291903.600102,9149.379
I have been doing the removal step manually which is time consuming because the file contains over half a million rows. Currently I am unable to rewrite the file containing all of the original elements with some modifications.
Any tips would be much appreciated.

dfile = 'new_data.txt'
with open(dfile) as infile:
numericLines = set() # line numbers of lines that start with digits
emptyLines = set() # line numbers of lines that are empty
charLines = [] # line numbers of lines that start with a letter
for lineno, line in enumerate(infile):
if line[0].isalpha:
charLines.append(line.strip())
elif line[0].isdigit():
numericLines.add(lineno)
elif not line.strip():
emptyLines.add(lineno)

The easiest way to do this is in two passes: First get the lines and line numbers of the non-matching lines, and then get the lines of the matching lines.
d_rows = [line.strip() for line in open(dfile)]
good_rows = [(i, row) for i, row in enumerate(d_rows) if is_good_row(row)]
bad_rows = [(i, row) for i, row in enumerate(d_rows) if not is_good_row(row)]
This does mean making two passes over the list, but who cares? If the list is small enough to read the whole thing into memory as you're already doing, the extra cost is probably negligible.
Alternatively, if you need to avoid the cost of building two lists in two passes, you probably also need to avoid reading the whole file at once in the first place, so you'll have to do things a little more cleverly:
d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
good_rows, bad_rows = [], []
for i, row in enumerate(d_rows):
if is_good_row(row):
good_rows.append((i, row))
else:
bad_rows.append((i, row))
If you can push things even farther back to the point where you don't even need explicit good_rows and bad_rows lists, you can keep everything in an iterator all the way through, and waste no memory or up-front reading time at all:
d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
with open(outfile, 'w') as f:
for i, row in enumerate(d_rows):
if is_good_row(row):
f.write(row + '\n')
else:
whatever_you_wanted_to_do_with(i, row)

Thanks to all who replied to my question. Using a part of each reply I was able to attain the desired result. What finally worked is as follows:
goodrow_ind, badrow_ind, badrows = [], [], []
d_rows = (line for line in open(ifile))
with open(ofile, 'w') as f:
for i, row in enumerate(d_rows):
if row[0].isdigit():
f.write(row)
goodrow_ind.append((i))
else:
badrow_ind.append((i))
badrows.append((row))
ifile.close()
data = np.loadtxt(open(ofile,'rb'),delimiter=',')
The result is "good" and "bad" rows separated with an index for each.

writing lines group by group in different files

I've got a little script which is not working nicely for me, hope you can help and find the problem.
I have two starting files:
traveltimes: contains the lines I need, it's a column file (every row has just a number). The lines I need are separated by a line which starts with 11 whitespaces
header lines: contains three header lines
output_file: I want to get 29 files (STA%s). What's inside? Every file will contain the same header lines after which I want to append the group of lines contained in the traveltimes file (one different group of lines for every file). Every group of lines is made by 74307 rows (1 column)
So far this script creates 29 files with the same header lines but then it mixes up everything, I mean it writes something but it's not what I want.
Any idea????
def make_station_files(traveltimes, header_lines):
"""Gives the STAxx.tgrid files required by loc3d"""
sta_counter = 1
with open (header_lines, 'r') as file_in:
data = file_in.readlines()
for i in range (29):
with open ('STA%s' % (sta_counter), 'w') as output_files:
sta_counter += 1
for i in data [0:3]:
values = i.strip()
output_files.write ("%s\n\t1\n" % (values))
with open (traveltimes, 'r') as times_file:
#collector = []
for line in times_file:
if line.startswith (" "):
break
output_files.write ("%s" % (line))

Suggestion:
Read the header rows first. Make sure this works before proceeding. None of the rest of the code needs to be indented under this.
Consider writing a separate function to group the traveltimes file into a list of lists.
Once you have a working traveltimes reader and grouper, only then create a new STA file, print the headers to it, and then write the timegroups to it.
Build your program up step-by-step, making sure it does what you expect at each step. Don't try to do it all at once because then you won't easily be able to track down where the issue lies.
My quick edit of your script uses itertools.groupby() as a grouper. It is a little advanced because the grouping function is stateful and tracks it state in a mutable list:
def make_station_files(traveltimes, header_lines):
'Gives the STAxx.tgrid files required by loc3d'
with open (header_lines, 'r') as f:
headers = f.readlines()
def station_counter(line, cnt=[1]):
'Stateful station counter -- Keeps the count in a mutable list'
if line.strip() == '':
cnt[0] += 1
return cnt[0]
with open(traveltimes, 'r') as times_file:
for station, group in groupby(times_file, station_counter):
with open('STA%s' % (station), 'w') as output_file:
for header in headers[:3]:
output_file.write ('%s\n\t1\n' % (header.strip()))
for line in group:
if not line.startswith(' '):
output_file.write ('%s' % (line))
This code is untested because I don't have sample data. Hopefully, you'll get the gist of it.

Two simple questions about python

I have 2 simple questions about python:
1.How to get number of lines of a file in python?
2.How to locate the position in a file object to the
last line easily?

lines are just data delimited by the newline char '\n'.
1) Since lines are variable length, you have to read the entire file to know where the newline chars are, so you can count how many lines:
count = 0
for line in open('myfile'):
count += 1
print count, line # it will be the last line
2) reading a chunk from the end of the file is the fastest method to find the last newline char.
def seek_newline_backwards(file_obj, eol_char='\n', buffer_size=200):
if not file_obj.tell(): return # already in beginning of file
# All lines end with \n, including the last one, so assuming we are just
# after one end of line char
file_obj.seek(-1, os.SEEK_CUR)
while file_obj.tell():
ammount = min(buffer_size, file_obj.tell())
file_obj.seek(-ammount, os.SEEK_CUR)
data = file_obj.read(ammount)
eol_pos = data.rfind(eol_char)
if eol_pos != -1:
file_obj.seek(eol_pos - len(data) + 1, os.SEEK_CUR)
break
file_obj.seek(-len(data), os.SEEK_CUR)
You can use that like this:
f = open('some_file.txt')
f.seek(0, os.SEEK_END)
seek_newline_backwards(f)
print f.tell(), repr(f.readline())

Let's not forget
f = open("myfile.txt")
lines = f.readlines()
numlines = len(lines)
lastline = lines[-1]
NOTE: this reads the whole file in memory as a list. Keep that in mind in the case that the file is very large.

The easiest way is simply to read the file into memory. eg:
f = open('filename.txt')
lines = f.readlines()
num_lines = len(lines)
last_line = lines[-1]
However for big files, this may use up a lot of memory, as the whole file is loaded into RAM. An alternative is to iterate through the file line by line. eg:
f = open('filename.txt')
num_lines = sum(1 for line in f)
This is more efficient, since it won't load the entire file into memory, but only look at a line at a time. If you want the last line as well, you can keep track of the lines as you iterate and get both answers by:
f = open('filename.txt')
count=0
last_line = None
for line in f:
num_lines += 1
last_line = line
print "There were %d lines. The last was: %s" % (num_lines, last_line)
One final possible improvement if you need only the last line, is to start at the end of the file, and seek backwards until you find a newline character. Here's a question which has some code doing this. If you need both the linecount as well though, theres no alternative except to iterate through all lines in the file however.

For small files that fit memory,
how about using str.count() for getting the number of lines of a file:
line_count = open("myfile.txt").read().count('\n')

I'd like too add to the other solutions that some of them (those who look for \n) will not work with files with OS 9-style line endings (\r only), and that they may contain an extra blank line at the end because lots of text editors append it for some curious reasons, so you might or might not want to add a check for it.

The only way to count lines [that I know of] is to read all lines, like this:
count = 0
for line in open("file.txt"): count = count + 1
After the loop, count will have the number of lines read.

For the first question there're already a few good ones, I'll suggest #Brian's one as the best (most pythonic, line ending character proof and memory efficient):
f = open('filename.txt')
num_lines = sum(1 for line in f)
For the second one, I like #nosklo's one, but modified to be more general should be:
import os
f = open('myfile')
to = f.seek(0, os.SEEK_END)
found = -1
while found == -1 and to > 0:
fro = max(0, to-1024)
f.seek(fro)
chunk = f.read(to-fro)
found = chunk.rfind("\n")
to -= 1024
if found != -1:
found += fro
It seachs in chunks of 1Kb from the end of the file, until it finds a newline character or the file ends. At the end of the code, found is the index of the last newline character.

Answer to the first question (beware of poor performance on large files when using this method):
f = open("myfile.txt").readlines()
print len(f) - 1
Answer to the second question:
f = open("myfile.txt").read()
print f.rfind("\n")
P.S. Yes I do understand that this only suits for small files and simple programs. I think I will not delete this answer however useless for real use-cases it may seem.

Answer1:
x = open("file.txt")
opens the file or we have x associated with file.txt
y = x.readlines()
returns all lines in list
length = len(y)
returns length of list to Length
Or in one line
length = len(open("file.txt").readlines())
Answer2 :
last = y[-1]
returns the last element of list

Approach:
Open the file in read-mode and assign a file object named “file”.
Assign 0 to the counter variable.
Read the content of the file using the read function and assign it to a
variable named “Content”.
Create a list of the content where the elements are split wherever they encounter an “\n”.
Traverse the list using a for loop and iterate the counter variable respectively.
Further the value now present in the variable Counter is displayed
which is the required action in this program.
Python program to count the number of lines in a text file
# Opening a file
file = open("filename","file mode")#file mode like r,w,a...
Counter = 0
# Reading from file
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
The above code will print the number of lines present in a file. Replace filename with the file with extension and file mode with read - 'r'.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Delete chunks of duplicate files in python from a very large file - python

so just to clarify, BEGIN IONS TITLE=IgA_OTHCD_uni.3.3.2 RTINSECONDS=0.6932462 PEPMASS=702.4431 CHARGE=19+ this is repeated for the duplicated numbers etc right? so you could just check to see if these initial parts are duplicated, and if so, skip till the next END IONS

Related

Delete Range of Lines in Iterative Process

Python: Access "field" in line

Find and remove elements from list, while retaining the location for later insertion

writing lines group by group in different files

Two simple questions about python

Categories

Resources