Removing duplicate characters from a file - Python

Removing duplicate characters from a file - Python - python

I have a file as below,
this is Rdaaaa
thissss Is Sethaaa
hiii
I want to remove all the duplicate characters from this file..
I tried two code..
This completely removes duplicate chars but does not seem to be efficient code.
with open("test.txt", "r") as f1:
with open("test1.txt", "w") as f2:
#content = f1.readlines()
char_set = set()
while True:
char = f1.read(1)
if char not in char_set:
char_set.add(char)
f2.write(char)
if not char:
break
print(char_set)
I also tried using regex following a stackoverflow post
import re
with open("test.txt", "r") as f1:
with open("test1.txt", "w") as f2:
content = f1.read()
f2.write(re.sub(r'([a-z])\1+',r'\1',content))
But this removes thiish to thish and not thiish to this
Any suggestions on the code with improved efficiency?

For "medium" sized files that can fit into memory, this approach is a bit faster and fewer lines. You can load the whole file into memory, and then create a dictionary from it, where the dictionary's keys are the individual characters in the file. This keeps the output chars in the same order as when they were first seen (property of dict).
This ran in about 100ms for a 2 MB file with 11501 distinct characters. Your use case may make another approach better.
# replace in_file and out_file with actual paths or file names
with open(in_file, "r") as f1, open(out_file, "w") as f2:
txt = f1.read()
ordered_set = ''.join(dict.fromkeys(txt).keys())
f2.write(ordered_set)

If you have a big file, and you don't want to load it in the memory, you can read it line by line instead of character by character, which is much better and faster:
file_input = open("old_file.txt", "r")
file_output = open("new_file.txt", "w")
memory = set()
while True:
line = file_input.readline()
if not line:
break
new_line = ""
for char in line:
if char == " ":
new_line += char
continue
if char not in memory:
memory.add(char)
new_line += char
file_output.writelines(new_line)
But if the file is small, you can read it once, and apply the same logic

It looks like you want to delete letters that appear repeatedly in succession. Try using itertools.groupby:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> s = '''this is Rdaaaa
... thissss Is Sethaaa
... hiii'''
>>> print(''.join(map(itemgetter(0), groupby(s))))
this is Rda
this Is Setha
hi

Try this
import re
# read the file
with open('file.txt', 'r') as f:
data = f.read()
# remove duplicate characters
data = re.sub(r'(.)\1+', r'\1', data)
# write the file
with open('file.txt', 'w') as f:
f.write(data)
The output is:
this is Rda
this Is Setha
hi

Related

Completely deleting duplicates words in a text file

I have some words in a text file like:
joynal
abedin
rahim
mohammad
joynal
abedin
mohammad
kudds
I want to delete the duplicate names. It will delete these duplicate entries totally from the text file
The output should be like:
rahim
kuddus
I have tried some coding but it's only giving me the duplicate values as one like 1.joynal and 2.abedin.
Edited: This is the code I tried:
content = open('file.txt' , 'r').readlines()
content_set = set(content)
cleandata = open('data.txt' , 'w')
for line in content_set:
cleandata.write(line)

Use a Counter:
from collections import Counter
with open(fn) as f:
cntr=Counter(w.strip() for w in f)
Then just print the words with a count of 1:
>>> print('\n'.join(w for w,cnt in cntr.items() if cnt==1))
rahim
kudds
Or do it the 'old fashion way' with a dict as a counter:
cntr={}
with open(fn) as f:
for line in f:
k=line.strip()
cntr[k]=cntr.get(k, 0)+1
>>> print('\n'.join(w for w,cnt in cntr.items() if cnt==1))
# same
If you want to output to a new file:
with open(new_file, 'w') as f_out:
f_out.write('\n'.join(w for w,cnt in cntr.items() if cnt==1))

you can just create a list which appends if name is not in and remove if name is in and occured a 2nd time.
with open("file1.txt", "r") as f, open("output_file.txt", "w") as g:
output_list = []
for line in f:
word = line.strip()
if not word in output_list:
output_list.append(word)
else:
output_list.remove(word)
g.write("\n".join(output_list))
print(output_list)
['rahim', 'kudds']
#in the text it is for each row one name like this:
rahim
kudds
The solution with counter is still the more elegant way imo

For completeness, if you don't care about order:
with open(fn) as f:
words = set(x.strip() for x in f)
with open(new_fn, "w") as f:
f.write("\n".join(words))
Where fn is the file you want to read from, and new_fn the file you want to write to.
In general for uniqueness think set---remembering that order is not gauranteed.

file = open("yourFile.txt") # open file
text = file.read() # returns content of the file
file.close()
wordList = text.split() # creates list of every word
wordList = list(dict.fromkeys(wordList)) # removes duplicate elements
str = ""
for word in wordList:
str += word
str += " " # creates a string that contains every word
file = open("yourFile.txt", "w")
file.write(str) # writes the new string in the file
file.close()

Python: reading a file and excluding lines with certain characters

I am trying to figure out how to write a function that opens a file and reads it, however I need it to ignore any lines that contain the character '-'
This is what I have so far:
def read_from_file(filename):
with open('filename', 'r') as file:
content = file.readlines()
Any help would be appreciated

Filter out character '-'-containing lines from your read-in lines:
filtered_lines = [x for x in content if '-' not in x]

I'd filter out while reading the file, not collect the unwanted lines in the first place.
def read_from_file(filename):
with open(filename) as file:
content = [line for line in file if '-' not in line]
Also note that the 'filename' in your open('filename', 'r') is wrong and that the 'r' is unnecessary, so I fixed/removed that.

Gwang-Jin Kim and Heap Overflow answers are both 100% right, but, I always feel that using the tools that Python give you to be a plus one, so here is a solution using the built-in filter() function:
list(filter(lambda line: "-" not in line, file.splitlines()))
def read_from_file(filename):
with open(filename, "r") as file:
content = filter(lambda line: "-" not in line, file.readlines())
return list(content)
Here is a more verbose, yet more efficient solution:
def read_from_file(filename):
content = []
with open(filename, "r") as file:
for line in file:
if "-" not in line:
content.append(line)
return content

python: Open file, edit one line, save it as the same file

I want to open a file, search for a specific word, change the word and save the file again. Sounds really easy - but I just can't get it working... I know that I have to overwrite the whole file but only change this one word!
My Code:
f = open('./myfile', 'r')
linelist = f.readlines()
f.close
for line in linelist:
i =0;
if 'word' in line:
for number in arange(0,1,0.1)):
myNumber = 2 - number
myNumberasString = str(myNumber)
myChangedLine = line.replace('word', myNumberasString)
f2 = open('./myfile', 'w')
f2.write(line)
f2.close
#here I have to do some stuff with these files so there is a reason
#why everything is in this for loop. And I know that it will
#overwrite the file every loop and that is good so. I want that :)
If I make it like this, the 'new' myfile file contains only the changed line. But I want the whole file with the changed line... Can anyone help me?
****EDIT*****
I fixed it! I just turned the loops around and now it works perfectly like this:
f=open('myfile','r')
text = f.readlines()
f.close()
i =0;
for number in arange(0,1,0.1):
fw=open('mynewfile', 'w')
myNumber = 2 - number
myNumberasString = str(myNumber)
for line in text:
if 'word' in line:
line = line.replace('word', myNumberasString)
fw.write(line)
fw.close()
#do my stuff here where I need all these input files

You just need to write out all the other lines as you go. As I said in my comment, I don't know what you are really trying to do with your replace, but here's a slightly simplified version in which we're just replacing all occurrences of 'word' with 'new':
f = open('./myfile', 'r')
linelist = f.readlines()
f.close
# Re-open file here
f2 = open('./myfile', 'w')
for line in linelist:
line = line.replace('word', 'new')
f2.write(line)
f2.close()
Or using contexts:
with open('./myfile', 'r') as f:
lines = f.readlines()
with open('./myfile', 'w') as f:
for line in lines:
line = line.replace('word', 'new')
f.write(line)

Use fileinput passing in whatever you want to replace:
import fileinput
for line in fileinput.input("in.txt",inplace=True):
print(line.replace("whatever","foo"),end="")
You don't seem to be doing anything special in your loop that cannot be calculated first outside the loop, so create the string you want to replace the word with and pass it to replace.
inplace=True will mean the original file is changed. If you want to verify everything looks ok then remove the inplace=True for the first run and you will actually see the replaced output instead of the lines being written to the file.
If you want to write to a temporary file, you can use a NamedTemporaryFile with shutil.move:
from tempfile import NamedTemporaryFile
from shutil import move
with open("in.txt") as f, NamedTemporaryFile(dir=".",delete=False) as out:
for line in f:
out.write(line.replace("whatever","foo"))
move("in.txt",out.name)
One problem you may encounter is matching substrings with replace so if you know the word is always followed in the middle of a sentence surrounded by whitespace you could add that but if not you will need to split and check every word.
from tempfile import NamedTemporaryFile
from shutil import move
from string import punctuation
with open("in.txt") as f, NamedTemporaryFile(dir=".",delete=False) as out:
for line in f:
out.write(" ".join(word if word.strip(punctuation) != "whatever" else "foo"
for word in line.split()))

The are three issues with your current code. First, create the f2 file handle before starting the loop, otherwise you'll overwrite the file in each iteration. Third, you are writing an unmodified line in f2.write(line). I guess you meant f2.write(myChangedLine)? Third, you should add an else statement that writes unmodified lines to the file. So:
f = open('./myfile', 'r')
linelist = f.readlines()
f.close
f2 = open('./myfile', 'w')
for line in linelist:
i =0;
if 'word' in line:
for number in arange(0,1,0.1)):
myNumber = 2 - number
myNumberasString = str(myNumber)
myChangedLine = line.replace('word', myNumberasString)
f2.write(myChangedLine)
else:
f2.write(line)
f2.close()

Read lines from a text file, reverse and save in a new text file

So far I have this code:
f = open("text.txt", "rb")
s = f.read()
f.close()
f = open("newtext.txt", "wb")
f.write(s[::-1])
f.close()
The text in the original file is:
This is Line 1
This is Line 2
This is Line 3
This is Line 4
And when it reverses it and saves it the new file looks like this:
4 eniL si sihT 3 eniL si sihT 2 eniL si sihT 1 eniL si sihT
When I want it to look like this:
This is line 4
This is line 3
This is line 2
This is line 1
How can I do this?

You can do something like:
with open('test.txt') as f, open('output.txt', 'w') as fout:
fout.writelines(reversed(f.readlines()))

read() returns the whole file in a single string. That's why when you reverse it, it reverses the lines themselves too, not just their order. You want to reverse only the order of lines, you need to use readlines() to get a list of them (as a first approximation, it is equivalent to s = f.read().split('\n')):
s = f.readlines()
...
f.writelines(s[::-1])
# or f.writelines(reversed(s))

f = open("text.txt", "rb")
s = f.readlines()
f.close()
f = open("newtext.txt", "wb")
s.reverse()
for item in s:
print>>f, item
f.close()

The method file.read() returns a string of the whole file, not the lines.
And since s is a string of the whole file, you're reversing the letters, not the lines!
First, you'll have to split it to lines:
s = f.read()
lines = s.split('\n')
Or:
lines = f.readlines()
And your method, it is already correct:
f.write(lines[::-1])
Hope this helps!

There are a couple of steps here. First we want to get all the lines from the first file, and then we want to write them in reversed order to the new file. The code for doing this is as follows
lines = []
with open('text.txt') as f:
lines = f.readlines()
with open('newtext.txt', 'w') as f:
for line in reversed(lines):
f.write(line)
Firstly, we initialize a variable to hold our lines. Then we read all the lines from the 'test.txt' file.
Secondly, we open our output file. Here we loop through the lines in reversed order, writing them to the output file as we go.

A sample using list so it will be much easier:
I'm sure there answer that are more elegant but this way is clear to understand.
f = open(r"c:\test.txt", "rb")
s = f.read()
f.close()
rowList = []
for value in s:
rowList.append(value + "\n")
rowList.reverse()
f = open(r"c:\test.txt", "wb")
for value in rowList:
f.write(value)
f.close()

You have to work line by line.
f = open("text.txt", "rb")
s = f.read()
f.close()
f = open("newtext.txt", "wb")
lines = s.split('\n')
f.write('\n'.join(lines[::-1]))
f.close()

Use it like this if your OS uses \n to break lines
f = open("text.txt", "rb")
s = f.read()
f.close()
f = open("newtext.txt", "wb")
f.write(reversed(s.split("\n")).join("\n"))
f.close()
Main thing here is reversed(s.split("\n")).join("\n").
It does the following:
Split your string by line breaks - \n,
resulting an array
reverses the array
merges the array back with linebreaks \n to a string
Here the states:
string: line1 \n line2 \n line3
array: ["line1", "line2", "line3"]
array: ["line3", "line2", "line1"]
string: line3 \n line2 \n line1 \n

If your input file is too big to fit in memory, here is an efficient way to reverse it:
Split input file into partial files (still in original order).
Read each partial file from last to first, reverse it and append to output file.
Implementation:
import os
from itertools import islice
input_path = "mylog.txt"
output_path = input_path + ".rev"
with open(input_path) as fi:
for i, sli in enumerate(iter(lambda: list(islice(fi, 100000)), []), 1):
with open(f"{output_path}.{i:05}", "w") as fo:
fo.writelines(sli)
with open(output_path, "w") as fo:
for file_index in range(i, 0, -1):
path = f"{output_path}.{file_index:05}"
with open(path) as fi:
lines = fi.readlines()
os.remove(path)
for line in reversed(lines):
fo.write(line)

Copy the last three lines of a text file in python?

I'm new to python and the way it handles variables and arrays of variables in lists is quite alien to me. I would normally read a text file into a vector and then copy the last three into a new array/vector by determining the size of the vector and then looping with a for loop a copy function for the last size-three into a new array.
I don't understand how for loops work in python so I can't do that.
so far I have:
#read text file into line list
numberOfLinesInChat = 3
text_file = open("Output.txt", "r")
lines = text_file.readlines()
text_file.close()
writeLines = []
if len(lines) > numberOfLinesInChat:
i = 0
while ((numberOfLinesInChat-i) >= 0):
writeLine[i] = lines[(len(lines)-(numberOfLinesInChat-i))]
i+= 1
#write what people say to text file
text_file = open("Output.txt", "w")
text_file.write(writeLines)
text_file.close()

To get the last three lines of a file efficiently, use deque:
from collections import deque
with open('somefile') as fin:
last3 = deque(fin, 3)
This saves reading the whole file into memory to slice off what you didn't actually want.
To reflect your comment - your complete code would be:
from collections import deque
with open('somefile') as fin, open('outputfile', 'w') as fout:
fout.writelines(deque(fin, 3))

As long as you're ok to hold all of the file lines in memory, you can slice the list of lines to get the last x items. See http://docs.python.org/2/tutorial/introduction.html and search for 'slice notation'.
def get_chat_lines(file_path, num_chat_lines):
with open(file_path) as src:
lines = src.readlines()
return lines[-num_chat_lines:]
>>> lines = get_chat_lines('Output.txt', 3)
>>> print(lines)
... ['line n-3\n', 'line n-2\n', 'line n-1']

First to answer your question, my guress is that you had an index error you should replace the line writeLine[i] with writeLine.append( ). After that, you should also do a loop to write the output :
text_file = open("Output.txt", "w")
for row in writeLine :
text_file.write(row)
text_file.close()
May I suggest a more pythonic way to write this ? It would be as follow :
with open("Input.txt") as f_in, open("Output.txt", "w") as f_out :
for row in f_in.readlines()[-3:] :
f_out.write(row)

A possible solution:
lines = [ l for l in open("Output.txt")]
file = open('Output.txt', 'w')
file.write(lines[-3:0])
file.close()

This might be a little clearer if you do not know python syntax.
lst_lines = lines.split()
This will create a list containing all the lines in the text file.
Then for the last line you can do:
last = lst_lines[-1]
secondLAst = lst_lines[-2]
etc... list and string indexes can be reached from the end with the '-'.
or you can loop through them and print specific ones using:
start = start line, stop = where to end, step = what to increment by.
for i in range(start, stop-1, step):
string = lst_lines[i]
then just write them to a file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing duplicate characters from a file - Python - python

Try this import re # read the file with open('file.txt', 'r') as f: data = f.read() # remove duplicate characters data = re.sub(r'(.)\1+', r'\1', data) # write the file with open('file.txt', 'w') as f: f.write(data) The output is: this is Rda this Is Setha hi

Related

Completely deleting duplicates words in a text file

Python: reading a file and excluding lines with certain characters

python: Open file, edit one line, save it as the same file

Read lines from a text file, reverse and save in a new text file

Copy the last three lines of a text file in python?

Categories

Resources