Not all duplicates are deleted from a text file in Python - python

I am new to Python. I am trying to delete duplicates from my text file by doing the following:
line_seen = set()
f = open('a.txt', 'r')
w = open('out.txt', 'w')
for i in f:
if i not in line_seen:
w.write(i)
line_seen.add(i)
f.close()
w.close()
In the initial file I had
hello
world
python
world
hello
And in output file I got
hello
world
python
hello
So it did not remove the last duplicate. Can anyone help me to understand why it happened and how could I fix it?

The first line probably contains 'hello\n' - the last line contains only 'hello' - they are not the same.
Use
line_seen = set()
with open('a.txt', 'r') as f, open('out.txt', 'w') as w:
for i in f:
i = i.strip() # remove the \n from line
if i not in line_seen:
w.write(i + "\n")
line_seen.add(i)

The main problem is with the break line characters ("\n") which appears at the end of each line but the last line. You can use a combination of set, map and join function such as what follows:
f = open('a.txt', 'r')
w = open('out.txt', 'w')
w.write("\n".join(list(set(map(str.strip,f.readlines())))))
out.txt
python
world
hello
If you want to stick to your previous approach you can use:
line_seen = set()
f = open('a.txt', 'r')
w = open('out.txt', 'w')
for i in f:
i = i.strip()
if i not in line_seen:
w.write(i)
line_seen.add(i)
f.close()
w.close()

Most likely you didn't end the last line with a newline. The known line is `hello\n'. The last just 'hello'
Fix the input or strip() the read i

# Since we check if the line exists in lines, we can use a list instead of
# a set to preserve order
lines = []
infile = open('a.txt', 'r')
outfile = open('out.txt', 'w')
# Use the readlines method
for line in infile.readlines():
if line not in lines:
# Strip whitespace
line = line.strip()
lines.append(line)
for line in lines:
# Add the whitespace back
outfile.write("{}\n".format(line))
infile.close()
outfile.close()

Related

How to get first letter of each line in python?

here is what I got txt and open
txt file looks like
f = open('data.txt', 'r')
print(f.read())
the show['Cat\n','Dog\n','Cat\n','Dog\n'........]
output
But I would like to get this
['C\n','D\n','C\n','D\n'........]
First you'll want to open the file in read mode (r flag in open), then you can iterate through the file object with a for loop to read each line one at a time. Lastly, you want to access the first element of each line at index 0 to get the first letter.
first_letters = []
with open('data.txt', 'r') as f:
for line in f:
first_letters.append(line[0])
print(first_letters)
If you want to have the newline character still present in the string you can modify line 5 from above to:
first_letters.append(line[0] + '\n')
f = open("data.txt", "r")
for x in f:
print(x[0])
f.close()

Multiple str edits to a single .txt file python

I've scraped some comments from a webpage using selenium and saved them to a text file. Now I would like to perform multiple edits to the text file and save it again. I've tried to group the following into one smooth flow but I'm fairly new to python so I just couldn't get it right. Examples of what happened to me at the bottom. The only way I could get it to work is to open and close the file over and over.
These are the action I want to perform in the order the need to:
with open('results.txt', 'r') as f:
lines = f.readlines()
with open("results.txt", "w") as f:
for line in lines:
f.write(line.replace("a sample text line", ' '))
with open('results.txt', 'r') as f:
lines = f.readlines()
with open("results.txt", "w") as f:
pattern = r'\d in \d example text'
for line in lines:
f.write(re.sub(pattern, "", line))
with open('results.txt', 'r') as f:
lines = f.readlines()
with open('results.txt','w') as file:
for line in lines:
if not line.isspace():
file.write(line)
with open('results.txt', 'r') as f:
lines = f.readlines()
with open("results.txt", "w") as f:
for line in lines:
f.write(line.replace(" ", '-'))
I've tried to loop them into one but I get doubled lines, words, or extra spaces.
Any help is appreciated, thank you.
If you want to do these in one smooth pass, you better open another file to write the desired results i.e.
import re
pattern = r"\d in \d example text"
# Open your results file for reading and another one for writing
with open("results.txt", "r") as fh_in, open("output.txt", "w") as fh_out:
for line in fh_in:
# Process the line
line = line.replace("a sample text line", " ")
line = re.sub(pattern, "", line)
if line.isspace():
continue
line = line.replace(" ", "-")
# Write out
fh_out.write(line)
We process each line in order you described and the resultant line goes to output file.

Write the first word/letter of each line to a new file

I have a file 'master.sql' that contains:
a.b.c
d.e.f
g.h.i
and I want to write on 'databases.sql' just the first letters, like this:
a
d
g
Here is my code, but returns just the last letter, the 'g'.
with open ('master.sql', 'r') as f:
for line in f:
x=(line.split('.')[0])
with open('databases.sql', 'w') as f:
f.write(str(x))
How can I fix this?
You'll need to write your data as you read it, otherwise you're not going to be able to do what you want. Fortunately, with allows you to open multiple files concurrently. This should work for you.
with open ('master.sql', 'r') as f1, open('databases.sql', 'w') as f2:
for line in f1:
f2.write(line.split('.')[0] + '\n')
Don't forget to write a newline, because file.write doesn't add one automatically.
Using list:
x = []
with open('master.sql', 'r') as f:
for line in f.readlines():
x.append(line.split('.')[0])
with open('databases.sql', 'w') as f:
for word in x:
f.write(str(word)+'\n')
The variable x receives all values, but each loop overwrite the last value. Hence, the result is 'g'.
To save all values you can do like this:
lst = []
with open ('master.sql', 'r') as f:
for line in f:
lst.append(line.split('.')[0])
x = '\n'.join(lst)
with open('databases.sql', 'w') as f:
f.write(x)

Counting to 100,000 and writing that to a file

I haven't used Python for a while but I decided to create a program today to help me with some work I am trying to do. I am trying to create a program that writes the numbers 1-100,000 with the symbol | after each but can't seem to strip the file after I create it so it shows like this: 1|2|3|4.
My Code:
a = 0
b = "|"
while a < 100000:
a += 1 # Same as a = a + 1
new = (a,b)
f = open("export.txt","a") #opens file with name of "export.txt"
f.write(str(new))
f.close()
infile = "export.txt"
outfile = "newfile.txt"
delete_list = ["(","," "'"]
fin = open(infile)
fout = open(outfile, "w+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
export.txt:
newfile.txt:
It looks like you're doing a lot of work unnecessarily.
If all you want is a file that has the numbers 0-99999 with | after each, you could do:
delim = "|"
with open('export.txt', 'w') as f:
for a in xrange(100):
f.write("%d%s" % (a, delim))
I'm not sure what the purpose of the second file is, but, in general, to open one file to read from and a second to write to, you could do:
with open('export.txt', 'r') as fi:
with open('newfile.txt', 'w') as fo:
for line in fi:
for word in line.split('|'):
print(word)
fo.write(word)
Note that there are no newlines in the original file, so for line in fi is actually reading the entire contents of "export.txt" -- this could cause issues.
Try this for writing your file:
numbers = []
for x in range(1,100001):
numbers.append(str(x))
f = open('export.txt', 'w')
f.write('|'.join(numbers))
f.close()

Read lines from a text file, reverse and save in a new text file

So far I have this code:
f = open("text.txt", "rb")
s = f.read()
f.close()
f = open("newtext.txt", "wb")
f.write(s[::-1])
f.close()
The text in the original file is:
This is Line 1
This is Line 2
This is Line 3
This is Line 4
And when it reverses it and saves it the new file looks like this:
4 eniL si sihT 3 eniL si sihT 2 eniL si sihT 1 eniL si sihT
When I want it to look like this:
This is line 4
This is line 3
This is line 2
This is line 1
How can I do this?
You can do something like:
with open('test.txt') as f, open('output.txt', 'w') as fout:
fout.writelines(reversed(f.readlines()))
read() returns the whole file in a single string. That's why when you reverse it, it reverses the lines themselves too, not just their order. You want to reverse only the order of lines, you need to use readlines() to get a list of them (as a first approximation, it is equivalent to s = f.read().split('\n')):
s = f.readlines()
...
f.writelines(s[::-1])
# or f.writelines(reversed(s))
f = open("text.txt", "rb")
s = f.readlines()
f.close()
f = open("newtext.txt", "wb")
s.reverse()
for item in s:
print>>f, item
f.close()
The method file.read() returns a string of the whole file, not the lines.
And since s is a string of the whole file, you're reversing the letters, not the lines!
First, you'll have to split it to lines:
s = f.read()
lines = s.split('\n')
Or:
lines = f.readlines()
And your method, it is already correct:
f.write(lines[::-1])
Hope this helps!
There are a couple of steps here. First we want to get all the lines from the first file, and then we want to write them in reversed order to the new file. The code for doing this is as follows
lines = []
with open('text.txt') as f:
lines = f.readlines()
with open('newtext.txt', 'w') as f:
for line in reversed(lines):
f.write(line)
Firstly, we initialize a variable to hold our lines. Then we read all the lines from the 'test.txt' file.
Secondly, we open our output file. Here we loop through the lines in reversed order, writing them to the output file as we go.
A sample using list so it will be much easier:
I'm sure there answer that are more elegant but this way is clear to understand.
f = open(r"c:\test.txt", "rb")
s = f.read()
f.close()
rowList = []
for value in s:
rowList.append(value + "\n")
rowList.reverse()
f = open(r"c:\test.txt", "wb")
for value in rowList:
f.write(value)
f.close()
You have to work line by line.
f = open("text.txt", "rb")
s = f.read()
f.close()
f = open("newtext.txt", "wb")
lines = s.split('\n')
f.write('\n'.join(lines[::-1]))
f.close()
Use it like this if your OS uses \n to break lines
f = open("text.txt", "rb")
s = f.read()
f.close()
f = open("newtext.txt", "wb")
f.write(reversed(s.split("\n")).join("\n"))
f.close()
Main thing here is reversed(s.split("\n")).join("\n").
It does the following:
Split your string by line breaks - \n,
resulting an array
reverses the array
merges the array back with linebreaks \n to a string
Here the states:
string: line1 \n line2 \n line3
array: ["line1", "line2", "line3"]
array: ["line3", "line2", "line1"]
string: line3 \n line2 \n line1 \n
If your input file is too big to fit in memory, here is an efficient way to reverse it:
Split input file into partial files (still in original order).
Read each partial file from last to first, reverse it and append to output file.
Implementation:
import os
from itertools import islice
input_path = "mylog.txt"
output_path = input_path + ".rev"
with open(input_path) as fi:
for i, sli in enumerate(iter(lambda: list(islice(fi, 100000)), []), 1):
with open(f"{output_path}.{i:05}", "w") as fo:
fo.writelines(sli)
with open(output_path, "w") as fo:
for file_index in range(i, 0, -1):
path = f"{output_path}.{file_index:05}"
with open(path) as fi:
lines = fi.readlines()
os.remove(path)
for line in reversed(lines):
fo.write(line)

Categories