How might I remove lines that have duplicates the first part of a line?
Example:
input file : include
line 1 : Messi , 1
line 2 : Messi , 2
line 3 : CR7 , 2
I want the output file to be:
line 1: CR7 , 2
Just CR7 , 2; I want to delete the lines that have duplicate first fields (e.g., Messi). The file is not sorted.
The deletion depends on the first column. If there is any match for the first column in the file, then I want to delete the line
How to do this in Python? Here is my code so far:
lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)
outfile.close()
This sample has the large original and the known duplicates.
There are a few ways.
You might want to read How do I find the duplicates in a list and create another list with them?
One answer from that, using your code:
from counter import Counter
with open(infilename, 'r') as inp:
lines = inp.readlines()
output_lines = [line for line, count in collections.Counter(lines).items() if count > 1]
with open(outfilename, "w") as out:
out.write("\n".join(output_lines))
Being provided with a sample, its a slightly different question. Here is your solution:
import collections
from typing import List
def remove_duplicate_first_columns(lines: List[str]) -> List[str]:
first_col = [line.split(',')[0] for line in lines]
dups = [col for col, count in collections.Counter(first_col).items() if count > 1]
non_dups = [line for line in lines if line.split(',')[0] not in dups]
return non_dups
with open('input.csv') as inp:
lines = inp.readlines()
non_dups = remove_duplicate_first_columns(lines)
with open('nondups.csv', 'w') as out:
print(''.join(non_dups), file=out)
print(f"There were {len(lines) - len(non_dups)} lines removed.")
print("This program is gratified to be of use")
I hope this completely answers your question.
You need to be able to remove something that was added earlier, so you cant directly dump to outfile.write(line). Instead use an accumulator to keep the data, and only once the full processing of the input is done, commit to writing the output.
lines_seen = set() # holds lines already seen
accumulator = []
with open(infilename, "r") as f:
for line in f.readlines():
if line not in lines_seen: # not a duplicate
accumulator.append(line)
lines_seen.add(line)
else:
accumulator.remove(line)
outfile = open(outfilename, "w")
outfile.write('\n'.join(accumulator))
outfile.close()
Here is another solution you might check it out.
lines_seen = set()
outfile = open(outfilename, "w")
with open(infilename, "r") as f:
lines = f.readlines()
outfile.write([line for line in lines if not (line.split(",")[0] in lines_seen or lines_seen.add(line.split(",")[0])])
outfile.close()
You can get some more info here! How do you remove duplicates from a list whilst preserving order?
Related
How you can implement deleting lines in a text document up to a certain line?
I find the line number using the code:
#!/usr/bin/env python
lookup = '00:00:00'
filename = "test.txt"
with open(filename) as text_file:
for num, line in enumerate(text_file, 1):
if lookup in line:
print(num)
print(num) outputs me the value of the string, for example 66.
How do I delete all the lines up to 66, i.e. up to the found line by word?
As proposed here with a small modification to your case:
read all lines of the file.
iterate the lines list until you reach the keyword.
write all remaining lines
with open("yourfile.txt", "r") as f:
lines = iter(f.readlines())
with open("yourfile.txt", "w") as f:
for line in lines:
if lookup in line:
f.write(line)
break
for line in lines:
f.write(line)
That's easy.
filename = "test.txt"
lookup = '00:00:00'
with open(filename,'r') as text_file:
lines = text_file.readlines()
res=[]
for i in range(0,len(lines),1):
if lookup in lines[i]:
res=lines[i:]
break
with open(filename,'w') as text_file:
text_file.writelines(res)
Do you know what lines you want to delete?
#!/usr/bin/env python
lookup = '00:00:00'
filename = "test.txt"
with open(filename) as text_file, open('okfile.txt', 'w') as ok:
lines = text_file.readlines()
ok.writelines(lines[4:])
This will delete the first 4 lines and store them in a different document in case you wanna keep the original.
Remember to close the files when you're done with them :)
Providing three alternate solutions. All begin with the same first part - reading:
filename = "test.txt"
lookup = '00:00:00'
with open(filename) as text_file:
lines = text_file.readlines()
The variations for the second parts are:
Using itertools.dropwhile which discards items from the iterator until the predicate (condition) returns False (ie discard while predicate is True). And from that point on, yields all the remaining items without re-checking the predicate:
import itertools
with open(filename, 'w') as text_file:
text_file.writelines(itertools.dropwhile(lambda line: lookup not in line, lines))
Note that it says not in. So all the lines before lookup is found, are discarded.
Bonus: If you wanted to do the opposite - write lines until you find the lookup and then stop, replace itertools.dropwhile with itertools.takewhile.
Using a flag-value (found) to determine when to start writing the file:
with open(filename, 'w') as text_file:
found = False
for line in lines:
if not found and lookup in line: # 2nd expression not checked once `found` is True
found = True # value remains True for all remaining iterations
if found:
text_file.write(line)
Similar to #c yj's answer, with some refinements - use enumerate instead of range, and then use the last index (idx) to write the lines from that point on; with no other intermediate variables needed:
for idx, line in enumerate(lines):
if lookup in line:
break
with open(filename, 'w') as text_file:
text_file.writelines(lines[idx:])
I am trying to code something where I first look for some string in a line in a txt file and when it is found I want to skip that row and the row below to get a new txt file without those rows. I really didn't get any solution from other questions here so maybe this will work
My code looks like this now:
with open("bla.txt", "r+") as f
new_f = f.readlines()
f.seek(0)
for line in new_f:
if "abc" not in line:
f.write(line)
else:
pass
pass
f.truncate()
I tried it with next(f) aswell but it didn't work for me. thanks in advance
This code creates a new file that skip the current and next row if the current row has the string ABC:
with open('bla.txt','r') as f:
text = f.read()
lines = text.split('\n')
with open('new_file.txt','w') as nf:
l = 0
while l<(len(lines)):
if 'ABC' in lines[l]:
l = l+2
else:
nf.write(lines[l]+'\n')
l = l+1
Try something simple like this:
import os
search_for = 'abc'
with open('input.txt') as f, open('output.txt', 'w') as o:
for line in f:
if search_for in line:
next(f) # we need to skip the next line
# since we are already processing
# the line with the string
# in effect, it skips two lines
else:
o.write(line)
os.rename('output.txt', 'input.txt')
Here is a repl with sample code.
I am reading a file and getting the first element from each start of the line, and comparing it to my list, if found, then I append it to the new output file that is supposed to be exactly like the input file in terms of the structure.
my_id_list = [
4985439
5605471
6144703
]
input file:
4985439 16:0.0719814
5303698 6:0.09407 19:0.132581
5605471 5:0.0486076
5808678 8:0.130536
6144703 5:0.193785 19:0.0492507
6368619 3:0.242678 6:0.041733
my attempt:
output_file = []
input_file = open('input_file', 'r')
for line in input_file:
my_line = np.array(line.split())
id = str(my_line[0])
if id in my_id_list:
output_file.append(line)
np.savetxt("output_file", output_file, fmt='%s')
Question is:
It is currently adding an extra empty line after each line written to the output file. How can I fix it? or is there any other way to do it more efficiently?
update:
output file should be for this example:
4985439 16:0.0719814
5605471 5:0.0486076
6144703 5:0.193785 19:0.0492507
try something like this
# read lines and strip trailing newline characters
with open('input_file','r') as f:
input_lines = [line.strip() for line in f.readlines()]
# collect all the lines that match your id list
output_file = [line for line in input_lines if line.split()[0] in my_id_list]
# write to output file
with open('output_file','w') as f:
f.write('\n'.join(output_file))
I don't know what numpy does to the text when reading it, but this is how you could do it without numpy:
my_id_list = {4985439, 5605471, 6144703} # a set is faster for membership testing
with open('input_file') as input_file:
# Your problem is most likely related to line-endings, so here
# we read the inputfile into an list of lines with intact line endings.
# To preserve the input, exactly, you would need to open the files
# in binary mode ('rb' for the input file, and 'wb' for the output
# file below).
lines = input_file.read().splitlines(keepends=True)
with open('output_file', 'w') as output_file:
for line in lines:
first_word = line.split()[0]
if first_word in my_id_list:
output_file.write(line)
getting the first word of each line is wasteful, since this:
first_word = line.split()[0]
creates a list of all "words" in the line when we just need the first one.
If you know that the columns are separated by spaces you can make it more efficient by only splitting on the first space:
first_word = line.split(' ', 1)[0]
I asked the similar question How to delete one line after the specific word with Python . However, I want to add one more condition. So I want to have two conditions to delete the first line:
After the word "COMPDAT".
Only if this first line contains "1" for item 4 and 5.
For example:
COMPDAT
'9850' 125 57 1 1 OPEN /
The code suggested in my previous question works only for condition 1:
input_file = open("input.txt", 'r')
prev_line = False
lines =[]
for line in input_file:
if not prev_line:
lines.append(line)
prev_line=False
if "COMPDAT" in line:
prev_line=True
input_file.close()
input_file = open("input.txt", 'w')
for line in lines:
input_file.write(line)
input_file.close()
How to change this code in order to satisfy also the second condition?
Thank you!
This is based on my answer to your other question
def line_and_line_before(file):
prev_line = None
for line in file:
yield (prev_line, line)
prev_line = line
def has_ones(line):
splitted_line = line.split(" ")
return len(splitted_line) > 4 and splitted_line[3] == '1' and splitted_line[4] == '1'
input_file = open("input.txt", 'r')
lines = []
for prev_line, line in line_and_line_before(input_file):
if (not prev_line or "COMPDAT" not in prev_line) or not has_ones(line):
lines.append(line)
input_file.close()
input_file = open("input.txt", 'w')
for line in lines:
input_file.write(line)
input_file.close()
You need to think in terms of when to keep line instead of when to remove line.
I'm new to python and the way it handles variables and arrays of variables in lists is quite alien to me. I would normally read a text file into a vector and then copy the last three into a new array/vector by determining the size of the vector and then looping with a for loop a copy function for the last size-three into a new array.
I don't understand how for loops work in python so I can't do that.
so far I have:
#read text file into line list
numberOfLinesInChat = 3
text_file = open("Output.txt", "r")
lines = text_file.readlines()
text_file.close()
writeLines = []
if len(lines) > numberOfLinesInChat:
i = 0
while ((numberOfLinesInChat-i) >= 0):
writeLine[i] = lines[(len(lines)-(numberOfLinesInChat-i))]
i+= 1
#write what people say to text file
text_file = open("Output.txt", "w")
text_file.write(writeLines)
text_file.close()
To get the last three lines of a file efficiently, use deque:
from collections import deque
with open('somefile') as fin:
last3 = deque(fin, 3)
This saves reading the whole file into memory to slice off what you didn't actually want.
To reflect your comment - your complete code would be:
from collections import deque
with open('somefile') as fin, open('outputfile', 'w') as fout:
fout.writelines(deque(fin, 3))
As long as you're ok to hold all of the file lines in memory, you can slice the list of lines to get the last x items. See http://docs.python.org/2/tutorial/introduction.html and search for 'slice notation'.
def get_chat_lines(file_path, num_chat_lines):
with open(file_path) as src:
lines = src.readlines()
return lines[-num_chat_lines:]
>>> lines = get_chat_lines('Output.txt', 3)
>>> print(lines)
... ['line n-3\n', 'line n-2\n', 'line n-1']
First to answer your question, my guress is that you had an index error you should replace the line writeLine[i] with writeLine.append( ). After that, you should also do a loop to write the output :
text_file = open("Output.txt", "w")
for row in writeLine :
text_file.write(row)
text_file.close()
May I suggest a more pythonic way to write this ? It would be as follow :
with open("Input.txt") as f_in, open("Output.txt", "w") as f_out :
for row in f_in.readlines()[-3:] :
f_out.write(row)
A possible solution:
lines = [ l for l in open("Output.txt")]
file = open('Output.txt', 'w')
file.write(lines[-3:0])
file.close()
This might be a little clearer if you do not know python syntax.
lst_lines = lines.split()
This will create a list containing all the lines in the text file.
Then for the last line you can do:
last = lst_lines[-1]
secondLAst = lst_lines[-2]
etc... list and string indexes can be reached from the end with the '-'.
or you can loop through them and print specific ones using:
start = start line, stop = where to end, step = what to increment by.
for i in range(start, stop-1, step):
string = lst_lines[i]
then just write them to a file.