Python replace string in text file with value from list - python

My problem is to replace strings in a text file, with another string. These key strings are in a list called word_list. I've tried the following, nothing seems to work. It prints out the sentence in document.text as it appears, with no replacement:
word_list = {'hi' : 'test', 'how' : 'teddy'}
with open("document.txt") as main:
words = main.read().split()
replaced = []
for y in words:
replacement = word_list.get(y, y)
replaced.append(replacement)
text = ' '.join(word_list.get(y, y) for y in words)
print text
new_main = open("done.txt", 'w')
new_main.write(text)
new_main.close()
Content of document.txt:
hi you, how is he?
Current output is the same as document.txt when it should be:
test you, teddy is he?
Any solutions/ help would be appreciated :)

As you seem to want to replace words, this will use a more natural definition of 'word':
import re
word_list = {'hi' : 'test', 'how' : 'teddy'}
with open('document.txt') as main, open('done.txt', 'w') as done:
text = main.read()
done.write(re.sub(r'\b\w+\b', lambda x: word_list.get(x.group(), x.group()), text))

word_list = {'hi' : 'test', 'how' : 'teddy'}
with open("document.txt") as main:
with open('done.txt', 'w') as new_main:
input_data = main.read()
for key, value in word_list.iteritems():
input_data = input_data.replace(key, value)
new_main.write(input_data)
This will read the entire contents of the file (not the most efficient if it's a large file), then iterate over your search and replace items in your dictionary, and call replace on the input text. Once complete it will write the data out to your new file.
Some things to remember with this approach
if your input file is large, it will be slow
you search pattern can also match word fragments, ie. hi will watch which, so you should cater for that too.

Related

Sort words based on first letter in text file, python

This is the function to read words from text file, sort these and then store in another text file.
#file contains words
file=open('/content/gdrive/MyDrive/Post_OCR_Classifictaion/Dict_try.txt').read().split()
#sorting order based on letters
letters="abcçdefgğhıijklmnoöprsştuüvyz"
d={i:letters.index(i) for i in letters}
#sort function
sorted_list=sorted(file,key=d.get)
#store after sortting in new file
textfile = open("/content/gdrive/MyDrive/Post_OCR_Classifictaion/Dict_try_sort.txt", "w")
for element in sorted_list:
textfile.write(element + "\n")
textfile.close()
These are the words in text file:
aço
çzb
ogğ
beg
zğe
öge
ğg
gaço
ogğ
But it gives error:
Here
sorted_list=sorted(file,key=d.get)
file is list of words whilst d is dict with keys being letters. You need first retrieve first letter of word then search for it in dict, for example using lambda i.e.
sorted_list=sorted(file,key=lambda word:d[word[0]])
You can convert your words to indexes:
def key(word):
return tuple(d[x] for x in word)
sorted_list = sorted(file, key=key)
You should probably try to account for both upper and lower case and also any circumstance where the line in 'file' either has a length of zero or starts with a character that is not in your list. Consider this possible solution:-
letters = "abcçdefgğhıijklmnoöprsştuüvyz"
letters = letters.upper() + letters
def myfunc(s):
try:
return letters.index(s[0])
except Exception:
pass
return -1
ml = ['ğc', 'öp', 'Rr', 'Po', 'aw', 'tp', 'çd']
mls = sorted(ml, key=myfunc)
print(mls)
The output of this would be:-
['Po', 'Rr', 'aw', 'çd', 'ğc', 'öp', 'tp']

Converting a file in list format into a dictionary with multiple conditions. (python)

Disclaimer, sorry if I have not explicitly expressed my issue. Terminology is still new to me. Thank you in advance for reading.
alright, I have a function named
def pluralize(word)
The aim is to pluralize all nouns within a file. The output I desire is: {'plural': word_in_plural, 'status' : x}
word_in_plural is the pluralized version of the input argument (word) and x is a string which can have one of the following values; 'empty_string', 'proper_noun', 'already_in_plural', 'success'.
My code so far looks like..
filepath = '/proper_noun.txt'
def pluralize(word):
proper_nouns = [line.strip() for line in open (filepath)] ### reads in file as list when function is called
dictionary = {'plural' : word_in_plural, 'status', : x} ### defined dictionary
if word == '': ### if word is an empty string, return values; 'word_in_plural = '' and x = 'empty_string'
dictionary['plural'] = ''
dictionary['status'] = 'empty_string'
return dictionary
what you can see above is my attempt at writing a condition that returns a value specified if the word is an empty string.
The next goal is to create a condition that if word is already in plural (assuming it ends with 's' 'es' 'ies' .. etc), then the function returns a dictionary with the values: **word_in_plural = word and x = 'already_in_plural'. So the input word remains untouched. eg. (input: apartments, output: apartments)
if word ### is already in plural (ending with plural), function returns a dictionary with values; word_in_plural = word and x = 'already_in_plural'
any ideas on how to read the last characters of the string to implement the rules ? I also very much doubt the logic.
Thank you for your input SOF community.
You can index the word by -1 to get its last character. You can slice a string to get the the last two [-2:] or last three [-3:] characters
last_char = word[-1]
last_three_char = word[-3:]

Print a list of unique words from a text file after removing punctuation, and find longest word

Goal is to a) print a list of unique words from a text file and also b) find the longest word.
I cannot use imports in this challenge.
File handling and main functionality are what I want, however the list needs to be cleaned. As you can see from the output, words are getting joined with punctuation and therefore maxLength is obviously incorrect.
with open("doc.txt") as reader, open("unique.txt", "w") as writer:
unwanted = "[],."
unique = set(reader.read().split())
unique = list(unique)
unique.sort(key=len)
regex = [elem.strip(unwanted).split() for elem in unique]
writer.write(str(regex))
reader.close()
maxLength = len(max(regex,key=len ))
print(maxLength)
res = [word for word in regex if len(word) == maxLength]
print(res)
===========
Sample:
pioneered the integrated placement year concept over 50 years ago [7][8][9] with more than 70 per cent of students taking a placement year, the highest percentage in the UK.[10]
Here's a solution that uses str.translate() to throw away all bad characters (+ newline) before we ever do the split(). (Normally we'd use a regex with re.sub(), but you're not allowed.) This makes the cleaning a one-liner, which is really neat:
bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))
# We can directly read and clean the entire output, without a reader object:
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
# cleaned_input = reader.read().translate(bad_transtable)
# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))
with open("unique.txt", "w") as writer:
for word in unique_words:
writer.write(f'{word}\n')
max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])
Notes:
since the input is already 100% cleaned and split, no need to append to a list/insert to a set as we go, then have to make another cleaning pass later. We can just create unique_words directly! (using set() to keep only uniques). And while we're at it, we might as well use sorted(..., key=lambda w: -len(w)) to sort it in decreasing length. Only need to call sort() once. And no iterative-append to lists.
hence we guarantee that max_length = len(unique_words[0])
this approach is also going to be more performant than nested loops for line in <lines>: for word in line.split(): ...iterative append() to wordlist
no need to do explicit writer/reader.open()/.close(), that's what the with statement does for you. (It's also more elegant for handling IO when exceptions happen.)
you could also merge the printing of the max_length words inside the writer loop. But it's cleaner code to keep them separate.
note we use f-string formatting f'{word}\n' to add the newline back when we write() an output line
in Python we use lower_case_with_underscores for variable names, hence max_length not maxLength. See PEP8
in fact here, we don't strictly need a with-statement for the writer, if all we're going to do is slurp its entire contents in one go in with open('doc.txt').read(). (That's not scaleable for huge files, you'd have to read in chunks or n lines).
str.maketrans() is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string e.g. ' '.maketrans()
str.maketrans() is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode, but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.
Alternative solution if you don't yet know str.translate()
dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
cleaned_input = cleaned_input.replace(bad_char, ' ')
And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, e.g if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:
open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])
Here is another solution without any function.
bad = '`~##$%^&*()-_=+[]{}\|;\':\".>?<,/?'
clean = ' '
for i in a:
if i not in bad:
clean += i
else:
clean += ' '
cleans = [i for i in clean.split(' ') if len(i)]
clean_uniq = list(set(cleans))
clean_uniq.sort(key=len)
print(clean_uniq)
print(len(clean_uniq[-1]))
Here is a solution. The trick is to use the python str method .isalpha() to filter non-alphanumerics.
with open("unique.txt", "w") as writer:
with open("doc.txt") as reader:
cleaned_words = []
for line in reader.readlines():
for word in line.split():
cleaned_word = ''.join([c for c in word if c.isalpha()])
if len(cleaned_word):
cleaned_words.append(cleaned_word)
# print unique words
unique_words = set(cleaned_words)
print(unique_words)
# write words to file? depends what you need here
for word in unique_words:
writer.write(str(word))
writer.write('\n')
# print length of longest
print(len(sorted(unique_words, key=len, reverse=True)[0]))

Making multiple search and replace more precise in Python for lemmatizer

I am trying to make my own lemmatizer for Spanish in Python2.7 using a lemmatization dictionary.
I would like to replace all of the words in a certain text with their lemma form. This is the code that I have been working on so far.
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()
lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
for line in f:
depurated_line = line.rstrip()
(val, key) = depurated_line.split("\t")
lemmatize_word_dict[key] = val
txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt
Here is an example dictionary file which contains the lemmatized forms used to replace the words in the input, or my_tyext_lower. The example dictionary is a tab-separated 2-column file in which Col. 1 Represented the values and Col 2 represents the keys to match.
ExampleDictionary
flojo floja
flojo flojas
flojo flojos
cargamento cargamentos
cargante cargantes
decepción decepciones
decepcionante decepcionantes
decentar decenté
decentar decentéis
decentar decentemos
decentar decentó
My desired output is as follows:
flojo y cargante. decepcionante. decentar decentar
Using these inputs (and the example phrase, as listed in my_textwithin the code). My actual output currently is:
felitrojo y cargramarramarrartserargramarramarrunirdo. decepáginacionarrtícolitroargramarramarrunirdo. decentar decentar
Currently, I can't seem to understand what it going wrong with the code.
It seems that it is replacing letters or chunks of each word, instead of recognizing the word, finding it in the lemma dictionary and then replace that instead.
For instance, this is the result that I am getting when I use the entire dictionary (more than 50.000 entries). This problem does not happen with my small example dictionary. Only when I use the complete dictionary which makes me think that prehaps it is double "replacing" at some point?
Is there a pythonic technique that I am missing and can incorporate into this code to make my search and replace function more precise, to identify the full words for replacement rather than chunks and/or NOT make any double replacements?
Because you use text.replace there's a chance that you'll still be matching a sub-string, and the text will get processed again. It's better to process one input word at a time and build the output string word-by-word.
I've switched your key-value the other way around (because you want to look up the right and find the word on the left), and I mainly changed the replace_all:
import re
def replace_all(text, dic):
result = ""
input = re.findall(r"[\w']+|[.,!?;]", text)
for word in input:
changed = dic.get(word,word)
result = result + " " + changed
return result
my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()
lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
for line in f:
kv = line.split()
lemmatize_word_dict[kv[1]] =kv[0]
txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt
I see two problems with your code:
it will also replace words if they appear as part of a bigger word
by replacing words one after the other, you could replace (parts of) words that have already been replaced
Instead of that loop, I suggest using re.sub with word boundaries \b to make sure that you replace complete words only. This way, you can also pass a callable as a replacement function.
import re
def replace_all(text, dic):
return re.sub(r"\b\w+\b", lambda m: dic.get(m.group(), m.group()), text)

Searching and writing

I need to write a program which looks for words with the same three middle characters(each word is 5 characters long) in a list, then writes them into a file like this :
wasdy
casde
tasdf
gsadk
csade
hsadi
Between the similar words i need to leave an empty line. I am kinda stuck.
Is there a way to do this? I use Python 3.2 .
Thanks for your help.
I would use the itertools.groupby function for this. Assuming wordlist is a list of the words you want to group, this code does the trick.
import itertools
for k, v in itertools.groupby(wordlist, lambda word: word[1:4]):
# here, k is the key the words are grouped by, i.e. word[1:4]
# and v is a list/iterable of the words in the group
for word in v:
print word
print
itertools.groupby(wordlist, lambda word: word[1:4]) basically takes all the words, and groups them by word[1:4], i.e. the three middle characters. Here's the output of the above code with your sample data:
wasdy
casde
tasdf
gsadk
csade
hsadi
 
To get you started: try using the builtin sorted function on the list of words, and for the key you should experiment with using a slice(1, 4).
For example:
some_list = ['wasdy', 'casde', 'tasdf', 'gsadk', 'other', 'csade', 'hsadi']
sorted(some_list, key = lambda x: sorted(x[1:4]))
# outputs ['wasdy', 'casde', 'tasdf', 'gsadk', 'csade', 'hsadi', 'other']
edit: It was unclear to me whether you wanted "same three middle characters, in order" or just "same three middle characters". If the latter, then you could look at sorted(some_list, key = lambda x: x[1:4]) instead.
try:
from collections import defaultdict
dict_of_words = defaultdict(list)
for word in list_of_words:
dict_of_words[word[1:-1]].append(word)
then, to write to an output file:
with open('outfile.txt', 'w') as f:
for key in dict_of_words:
f.write('\n'.join(dict_of_words[key])
f.write('\n' )
word_list = ['wasdy', 'casde','tasdf','gsadk','csade','hsadi']
def test_word(word):
return all([x in word[1:4] for x in ['a','s','d']])
f = open('yourfile.txt', 'w')
f.write('\n'.join([word for word in word_list if test_word(word)]))
f.close()
returns:
wasdy
casde
tasdf
gsadk
csade
hsadi

Categories