The task is to write the unique_file function which takes an input filename and an output filename as parameters. Your function should read contents from the input file and create a list of unique words --> Basically means no two or more of the same words can be writen in thee output file. The code I used is:
def unique_file(input_filename, output_filename):
file = open(input_filename,"r")
contents = file.read()
word_list = contents.split()
output_file = open(output_filename,'w+')
for word in word_list:
if word not in output_file:
output_file.write(word + '\n')
file.close()
output_file.close()
print('Done')
But this function just copies everything from the input file to the output file. So I get words like 'and' 'I' that occur more than once in the output file.
Please help.
You can't really check if word not in output_file: like that. I would suggest you use a set to get unique words:
def unique_file(input_filename, output_filename):
with open(input_filename) as file:
contents = file.read()
word_set = set(contents.split())
with open(output_filename, "w+") as output_file:
for word in word_set:
output_file.write(word + '\n')
print("Done")
Note the use of with to handle files - see the last paragraph of the docs.
That's because you cannot ask if a file contains a word like that. You'll have to create a list of words you're adding. EDIT: You should actually make seen a set(). Membership checking is less costly than with the list.
def unique_file(input_filename, output_filename):
file = open(input_filename,"r")
contents = file.read()
word_list = contents.split()
output_file = open(output_filename,'w+')
seen = set()
for word in word_list:
if word not in seen:
output_file.write(word + '\n')
seen.add(word)
file.close()
output_file.close()
print('Done')
If you don't need to worry about the order of the words you can just use the builtin set() which is a container that does not allow duplicates. Something like this should work:
def unique_file(input_filename, output_filename):
with open(input_filename, "r") as inp, open(output_filename, "w") as out:
out.writelines(set(inp.readlines()))
Related
I have a file as below,
this is Rdaaaa
thissss Is Sethaaa
hiii
I want to remove all the duplicate characters from this file..
I tried two code..
This completely removes duplicate chars but does not seem to be efficient code.
with open("test.txt", "r") as f1:
with open("test1.txt", "w") as f2:
#content = f1.readlines()
char_set = set()
while True:
char = f1.read(1)
if char not in char_set:
char_set.add(char)
f2.write(char)
if not char:
break
print(char_set)
I also tried using regex following a stackoverflow post
import re
with open("test.txt", "r") as f1:
with open("test1.txt", "w") as f2:
content = f1.read()
f2.write(re.sub(r'([a-z])\1+',r'\1',content))
But this removes thiish to thish and not thiish to this
Any suggestions on the code with improved efficiency?
For "medium" sized files that can fit into memory, this approach is a bit faster and fewer lines. You can load the whole file into memory, and then create a dictionary from it, where the dictionary's keys are the individual characters in the file. This keeps the output chars in the same order as when they were first seen (property of dict).
This ran in about 100ms for a 2 MB file with 11501 distinct characters. Your use case may make another approach better.
# replace in_file and out_file with actual paths or file names
with open(in_file, "r") as f1, open(out_file, "w") as f2:
txt = f1.read()
ordered_set = ''.join(dict.fromkeys(txt).keys())
f2.write(ordered_set)
If you have a big file, and you don't want to load it in the memory, you can read it line by line instead of character by character, which is much better and faster:
file_input = open("old_file.txt", "r")
file_output = open("new_file.txt", "w")
memory = set()
while True:
line = file_input.readline()
if not line:
break
new_line = ""
for char in line:
if char == " ":
new_line += char
continue
if char not in memory:
memory.add(char)
new_line += char
file_output.writelines(new_line)
But if the file is small, you can read it once, and apply the same logic
It looks like you want to delete letters that appear repeatedly in succession. Try using itertools.groupby:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> s = '''this is Rdaaaa
... thissss Is Sethaaa
... hiii'''
>>> print(''.join(map(itemgetter(0), groupby(s))))
this is Rda
this Is Setha
hi
Try this
import re
# read the file
with open('file.txt', 'r') as f:
data = f.read()
# remove duplicate characters
data = re.sub(r'(.)\1+', r'\1', data)
# write the file
with open('file.txt', 'w') as f:
f.write(data)
The output is:
this is Rda
this Is Setha
hi
I have some words in a text file like:
joynal
abedin
rahim
mohammad
joynal
abedin
mohammad
kudds
I want to delete the duplicate names. It will delete these duplicate entries totally from the text file
The output should be like:
rahim
kuddus
I have tried some coding but it's only giving me the duplicate values as one like 1.joynal and 2.abedin.
Edited: This is the code I tried:
content = open('file.txt' , 'r').readlines()
content_set = set(content)
cleandata = open('data.txt' , 'w')
for line in content_set:
cleandata.write(line)
Use a Counter:
from collections import Counter
with open(fn) as f:
cntr=Counter(w.strip() for w in f)
Then just print the words with a count of 1:
>>> print('\n'.join(w for w,cnt in cntr.items() if cnt==1))
rahim
kudds
Or do it the 'old fashion way' with a dict as a counter:
cntr={}
with open(fn) as f:
for line in f:
k=line.strip()
cntr[k]=cntr.get(k, 0)+1
>>> print('\n'.join(w for w,cnt in cntr.items() if cnt==1))
# same
If you want to output to a new file:
with open(new_file, 'w') as f_out:
f_out.write('\n'.join(w for w,cnt in cntr.items() if cnt==1))
you can just create a list which appends if name is not in and remove if name is in and occured a 2nd time.
with open("file1.txt", "r") as f, open("output_file.txt", "w") as g:
output_list = []
for line in f:
word = line.strip()
if not word in output_list:
output_list.append(word)
else:
output_list.remove(word)
g.write("\n".join(output_list))
print(output_list)
['rahim', 'kudds']
#in the text it is for each row one name like this:
rahim
kudds
The solution with counter is still the more elegant way imo
For completeness, if you don't care about order:
with open(fn) as f:
words = set(x.strip() for x in f)
with open(new_fn, "w") as f:
f.write("\n".join(words))
Where fn is the file you want to read from, and new_fn the file you want to write to.
In general for uniqueness think set---remembering that order is not gauranteed.
file = open("yourFile.txt") # open file
text = file.read() # returns content of the file
file.close()
wordList = text.split() # creates list of every word
wordList = list(dict.fromkeys(wordList)) # removes duplicate elements
str = ""
for word in wordList:
str += word
str += " " # creates a string that contains every word
file = open("yourFile.txt", "w")
file.write(str) # writes the new string in the file
file.close()
f = open (FilePath, "r")
#print f
with open(FilePath, "r") as f:
lines = f.readlines()
#print lines
for iterms in lines:
new_file = iterms[::-1]
print new_file
it gives me a result like this:
7340.12,8796.4871825,0529.710635,751803.0,fit.69-81-63-40tuo
original list is like this:
out04-32-45-95.tif,0.330693,536043.5237,5281852.0362,20.2260
it is supposed to be like this:
20.2260, ...........out04-32-45-95.tif
You should be using your for loop like:
for iterms in lines:
new_file = ','.join(iterms.split(',')[::-1])
print new_file
Explanation:
In your current code, the line iterms[::-1] reverses the entire string present in your line. But you want to only reverse the words separated by ,.
Hence, you need to follow below steps:
Split the words based on , and get list of words:
word_list = iterms.split(',')
Reverse the words in the list:
reversed_word_list = word_list[::-1]
Join the reversed wordlist with ,
new_line = ','.join(reversed_word_list)
I have some pretty large text files (>2g) that I would like to process word by word. The files are space-delimited text files with no line breaks (all words are in a single line). I want to take each word, test if it is a dictionary word (using enchant), and if so, write it to a new file.
This is my code right now:
with open('big_file_of_words', 'r') as in_file:
with open('output_file', 'w') as out_file:
words = in_file.read().split(' ')
for word in word:
if d.check(word) == True:
out_file.write("%s " % word)
I looked at lazy method for reading big file in python, which suggests using yield to read in chunks, but I am concerned that using chunks of predetermined size will split words in the middle. Basically, I want chunks to be as close to the specified size while splitting only on spaces. Any suggestions?
Combine the last word of one chunk with the first of the next:
def read_words(filename):
last = ""
with open(filename) as inp:
while True:
buf = inp.read(10240)
if not buf:
break
words = (last+buf).split()
last = words.pop()
for word in words:
yield word
yield last
with open('output.txt') as output:
for word in read_words('input.txt'):
if check(word):
output.write("%s " % word)
You might be able to get away with something similar to an answer on the question you've linked to, but combining re and mmap, eg:
import mmap
import re
with open('big_file_of_words', 'r') as in_file, with open('output_file', 'w') as out_file:
mf = mmap.mmap(in_file.fileno(), 0, access=ACCESS_READ)
for word in re.finditer('\w+', mf):
# do something
fortunately Petr Viktorin has already written code for us. The following code reads a chunk from a file, then does a yield for each contained word. If a word spans chunks, that's handled also.
line = ''
while True:
word, space, line = line.partition(' ')
if space:
# A word was found
yield word
else:
# A word was not found; read a chunk of data from file
next_chunk = input_file.read(1000)
if next_chunk:
# Add the chunk to our line
line = word + next_chunk
else:
# No more data; yield the last word and return
yield word.rstrip('\n')
return
https://stackoverflow.com/a/7745406/143880
We have a file named wordlist, which contains 1,876 KB worth of alphabetized words, all of which are longer than 4 letters and contain one carriage return between each new two-letter construction (ab, ac, ad, etc., words all contain returns between them):
wfile = open("wordlist.txt", "r+")
I want to create a new file that contains only words that are not derivatives of other, smaller words. For example, the wordlist contains the following words ["abuser, abused, abusers, abuse, abuses, etc.] The new file that is created should retain only the word "abuse" because it is the "lowest common denominator" (if you will) between all those words. Similarly, the word "rodeo" would be removed because it contains the word rode.
I tried this implementation:
def root_words(wordlist):
result = []
base = wordlist[1]
for word in wordlist:
if not word.startswith(base):
result.append(base)
print base
base=word
result.append(base)
return result;
def main():
wordlist = []
wfile = open("wordlist.txt", "r+")
for line in wfile:
wordlist.append(line[:-1])
wordlist = root_words(wordlist)
newfile = open("newwordlist.txt", "r+")
newfile.write(wordlist)
But it always froze my computer. Any solutions?
I would do something like this:
def bases(words):
base = next(words)
yield base
for word in words:
if word and not word.startswith(base):
yield word
base = word
def get_bases(infile, outfile):
with open(infile) as f_in:
words = (line.strip() for line in f_in)
with open(outfile, 'w') as f_out:
f_out.writelines(word + '\n' for word in bases(words))
This goes through the corncob list of 58,000 words in a fifth of a second on my fairly old laptop. It's old enough to have one gig of memory.
$ time python words.py
real 0m0.233s
user 0m0.180s
sys 0m0.012s
It uses iterators everywhere it can to go easy on the memory. You could probably increase performance by slicing off the end of the lines instead of using strip to get rid of the newlines.
Also note that this relies on your input being sorted and non-empty. That was part of the stated preconditions though so I don't feel too bad about it ;)
One possible improvement is to use a database to load the words and avoid loading the full input file in RAM. Another option is to process the words as you read them from the file and write the results without loading everything in memory.
The following example treats the file as it is read without pre-loading stuff in memory.
def root_words(f,out):
result = []
base = f.readline()
for word in f:
if not word.startswith(base):
out.write(base + "\n")
base=word
out.write(base + "\n")
def main():
wfile = open("wordlist.txt", "r+")
newfile = open("newwordlist.txt", "w")
root_words(wfile,newfile)
wfile.close()
newfile.close()
Memory complexity of this solution is O(1) since the variable base is the only thing that you need in order to process the file. This can be done thanks to that the file is alphabetically sorted.
since the list is alphabetized, this does the trick (takes 0.4seconds with 5 megs of data, so should not be a problem with 1.8)
res = [" "]
with open("wordlist.txt","r") as f:
for line in f:
tmp = line.strip()
if tmp.startswith(res[-1]):
pass
else:
res.append(tmp)
with open("newlist.txt","w") as f:
f.write('\n'.join(res[1:]))