Replace words with word-substitutions from another file - python
The words from my text file (mytext.txt) needs to be replaced by some other word provided in another text file (replace.txt)
cat mytext.txt
this is here. and it should be there.
me is this will become you is that.
cat replace.txt
this that
here there
me you
The following code does not work as expected.
with open('mytext.txt', 'r') as myf:
with open('replace.txt' , 'r') as myr:
for line in myf.readlines():
for l2 in myr.readlines():
original, replace = l2.split()
print line.replace(original, replace)
Expected output:
that is there. and it should be there.
you is that will become you is that.
Edit: I stand corrected, the OP is asking for word by word replacement rather than simple string replace ('become' -> 'become' rather than 'becoyou'). I guess a dict version might look like this, using the regex split method found on the comments of the accepted answer to Splitting a string into words and punctuation:
import re
def clean_split(string_input):
"""
Split a string into its component tokens and return as list
Treat spaces and punctuations, including in-word apostrophes as separate tokens
>>> clean_split("it's a good day today!")
["it", "'", "s", " ", "a", " ", "good", " ", "day", " ", "today", "!"]
"""
return re.findall(r"[\w]+|[^\w]", string_input)
with open('replace.txt' , 'r') as myr:
replacements = dict(tuple(line.split()) for line in myr)
with open('mytext.txt', 'r') as myf:
for line in myf:
print ''.join(replacements.get(word, word) for word in clean_split(line)),
I am not competent to reason well about re efficiency, if someone points out glaring inefficiencies I would be most grateful.
Edit 2: OK I was inserting spaces between words and punctuation, now that's fixed by treating spaces as tokens and doing a ''.join() instead of a ' '.join()
It looks like you want your inner loop to read the contents of 'replace.txt' for each line of 'mytext.txt'. That's very inefficient, and it won't actually work as written because once you've read all the lines of 'replace.txt' the file pointer is left at the end of the file, so when you try to process the 2nd line of 'mytext.txt' there won't be any lines left to read in 'replace.txt'.
You could send the myr file pointer back to the start of the file using myr.seek(0), but as I said, that's not very efficient. A much better strategy is to read 'replace.txt' into an appropriate data structure, and then use that data to do your replacements on each line of 'mytext.txt'.
A good data structure to use for this would be a dict. Eg,
replacements = {'this': 'that', 'here': 'there', 'me': 'you'}
Can you figure out how to build such a dict from 'replace.txt'?
I see that gman and 7stud have covered the issue of saving the results of your replacements so that they accumulate, so I won't bother discussing that. :)
here you go using re.sub:
>>> with open('mytext.txt') as f1, open('replace.txt') as f2:
... my_text = f1.read()
... for x in f2:
... x=x.strip().split()
... my_text = re.sub(r"\b%s\b" % x[0],x[1],my_text)
... print my_text
...
that is there. and it should be there.
you is that will become you is that.
\b%s\b defines the word boundaries
The following will solve your problem. The problem with your code is that you are printing after each replacement.
The optimal solution will be:
myr=open("replace.txt")
replacement=dict()
for i in myr.readlines():
original,replace=i.split()
replacement[original]=replace
myf=open("mytext.txt")
for i in myf.readlines():
for j in i.split():
if(j in replacement.keys()):
i=i.replace(j,replacement[j])
print i
As an alternative, we may use string's template to achieve this, it works but VERY ugly and inefficient though:
from string import Template
with open('replace.txt', 'r') as myr:
# read the replacement first and build a dictionary from it
d = {str(k): v for k,v in [line.strip().split(" ") for line in myr]}
d
{'here': 'there', 'me': 'you', 'this': 'that'}
with open('mytext.txt', 'r') as myf:
for line in myf:
print Template('$'+' $'.join(line.strip().replace('$', '_____').\
split(' '))).safe_substitute(**d).\
replace('$', '').replace('_____', '')
Results:
that is there. and it should be there.
you is that will become you is that.
You are printing the line after one replacement, then printing the line again after the next replacement. You want to print the line after doing all the replacements.
str.replace(old, new[, count])
Return a copy of the string...
You are discarding the copy every time because you don't save it in a variable. In other words, replace() does not change line.
Next, the word there contains the substring here(which is replaced by there), so the result ends up being tthere.
You can fix those problems like this:
import re
with open('replace.txt' , 'r') as f:
repl_dict = {}
for line in f:
key, val = line.split()
repl_dict[key] = val
with open('mytext.txt', 'r') as f:
for line in f:
for key, val in repl_dict.items():
line = re.sub(r"\b" + key + r"\b", val, line, flags=re.X)
print line.rstrip()
--output:--
that is there. and it should be there.
you is that will become you is that.
Or, like this:
import re
#Create a dict that returns the key itself
# if the key is not found in the dict:
class ReplacementDict(dict):
def __missing__(self, key):
self[key] = key
return key
#Create a replacement dict:
with open('replace.txt') as f:
repl_dict = ReplacementDict()
for line in f:
key, val = line.split()
repl_dict[key] = val
#Create the necessary inputs for re.sub():
def repl_func(match_obj):
return repl_dict[match_obj.group(0)]
pattern = r"""
\w+ #Match a 'word' character, one or more times
"""
regex = re.compile(pattern, flags=re.X)
#Replace the words in each line with the
#entries in the replacement dict:
with open('mytext.txt') as f:
for line in f:
line = re.sub(regex, repl_func, line)
print line.rstrip())
With replace.txt like this:
this that
here there
me you
there dog
...the output is:
that is there. and it should be dog.
you is that will become you is that.
Related
Get words from first line until first space and without first character in python
I have a textfile where I want to extract the first word, but without the first character and put it into a list. Is there a way in python to do this without using regex? A text example of what I have looks like: #blabla sjhdiod jszncoied Where I want the first word in this case blabla without the #. If regex is the only choice, then how will the regex look like?
This should do the trick: l = [] for line in open('file'): l.append(line.split()[0][1:]) Edit: If you have empty lines, this will throw an error. You will have to check for empty lines. Here is a possible solution: l = [] for line in open('file'): if line.strip(): l.append(line.split()[0][1:])
Pythonic way: my_list = [line.split(' ', 1)[0][1:] for line in open('file') if line.startswith('#')]
a textfile where I want to extract the first word, but without the first character and put it into a list result = [] with open('file.txt', 'r') as f: l = next(f).strip() # getting the 1st line result.append(l[1:l.find(' ')]) print(result) The output: ['blabla']
Simple enough if your input is so regular: s = "#blabla sjhdiod jszncoied" s.split()[0].strip('#') blabla split splits on whitespace by default. Take the first token and strip away '#'.
Python: how to write a variable number of lists to a line of a new file
I have an output of which each line contains one list, each list contains one word of a sentence after hyphenation. It looks something like this: ['I'] ['am'] ['a'] ['man.'] ['I'] ['would'] ['like'] ['to'] ['find'] ['a'] ['so','lu','tion.'] (let's say it's hyphenated like this, I'm not a native English speaker) etc. Now, what I'd like to do is write this output to a new .txt file, but each sentence (sentence ends when item in list contains a point) has to be written to a newline. I'd like to have following result written to this .txt file: I am a man. I would like to find a so,lu,tion. etc. The coding that precedes all this is the following: with open('file.txt','r') as f: for line in f: for word in line.split(): if h_en.syllables(word)!=[]: h_en.syllables (word) else: print ([word]) The result I want is a file which contains a sentence at each line. Each word of a sentence is represented by it's hyphenated version. Any suggestions? Thank you a lot.
Something basic like this seems to answer your need: def write_sentences(filename, *word_lists): with open(filename, "w") as f: sentence = [] for word_list in word_lists: word = ",".join(word_list) ##last edit sentence.append(word) if word.endswith("."): f.write(" ".join(sentence)) f.write("\n") sentence = [] Feed the write_sentences function with the output filename, then each of your word lists as arguments. If you have a list of word lists (e.g [['I'], ['am'], ...]), you can use * when calling the function to pass everything. EDIT: changed to make it work with the latest edit of the answer (with multiple words in the word lists)
This short regex does what you want when it is compiled in MULTILINE mode: >>> regex = re.compile("\[([a-zA-Z\s]*\.?)\]$",re.MULTILINE)` >>> a = regex.findall(string) >>> a [u'I', u'am', u'a man.', u'I', u'would like', u'to find', u'a solution.'] Now you just manipulate the list until you get your wanted result. An example follows, but there are more ways to do it: >>> b = ' '.join(a) >>> b 'I am a real man. I want a solution.' >>> c = re.sub('\.','.\n',b) >>> print(c) 'I am a real man.' ' I want a solution.' >>> with open("result.txt", "wt") as f: f.write(c)
words = [['I'],['am'],['a'],['man.'],['I'],['would'],['like'],['to'],['find'],['a'],['so','lu','tion.']] text = "".join( "".join(item) + ("\n" if item[-1].endswith(".") else " ") for item in words) with open("out.txt", "wt") as f: f.write(text)
Read a multi-line string as one line in Python
I am writing a program that analyzes a large directory text file line-by-line. In doing so, I am trying to extract different parts of the file and categorize them as 'Name', 'Address', etc. However, due to the format of the file, I am running into a problem. Some of the text i have is split into two lines, such as: '123 ABCDEF ST APT 456' How can I make it so that even through line-by-line analysis, Python returns this as a single-line string in the form of '123 ABCDEF ST APT 456'?
if you want to remove newlines: "".join( my_string.splitlines())
Assuming you are using windows if you do a print of the file to your screen you will see '123 ABCDEF ST\nAPT 456\n' the \n represent the line breaks. so there are a number of ways to get rid of the new lines in the file. One easy way is to split the string on the newline characters and then rejoin the items from the list that will be created when you do the split myList = [item for item in myFile.split('\n')] newString = ' '.join(myList)
To replace the newlines with a space: address = '123 ABCDEF ST\nAPT 456\n' address.replace("\n", " ")
import re def mergeline(c, l): if c: return c.rstrip() + " " + l else: return l def getline(fname): qstart = re.compile(r'^\'[^\']*$') qend = re.compile(r'.*\'$') with open(fname) as f: linecache, halfline = ("", False) for line in f: if not halfline: linecache = "" linecache = mergeline(linecache, line) if halfline: halfline = not re.match(qend, line) else: halfline = re.match(qstart, line) if not halfline: yield linecache if halfline: yield linecache for line in getline('input'): print line.rstrip()
Assuming you're iterating through your file with something like this: with open('myfile.txt') as fh: for line in fh: # Code here And also assuming strings in your text file are delimited with single quotes, I would do this: while not line.endswith("'"): line += next(fh) That's a lot of assuming though.
i think i might have found a easy solution just put .replace('\n', " ") to whatever string u want to convert Example u have my_string = "hi i am an programmer\nand i like to code in python" like anything and if u want to convert it u can just do my_string.replace('\n', " ") hope it helps
Splitting lines in python based on some character
Input: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W 55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56 281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34 :18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22. Output: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22. '!' is the starting character and +0013 should be the ending of each line (if present). Problem which I am getting: Output is like : !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W Any help would be highly appreciated...!!! My code: file_open= open('sample.txt','r') file_read= file_open.read() file_open2= open('output.txt','w+') counter =0 for i in file_read: if '!' in i: if counter == 1: file_open2.write('\n') counter= counter -1 counter= counter +1 file_open2.write(i)
You can try something like this: with open("abc.txt") as f: data=f.read().replace("\r\n","") #replace the newlines with "" #the newline can be "\n" in your system instead of "\r\n" ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines for x in ans: print "!"+x #or write to some other file .....: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Could you just use str.split? lines = file_read.split('!') Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file: file_open2.writelines('!{0}\n'.format(line) for line in lines) You might need: file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines) if you find that you're getting more newlines than you wanted in the output. A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly: with open('inputfile') as fin: lines = fin.read() with open('outputfile','w') as fout: fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
Another option, using replace instead of split, since you know the starting and ending characters of each line: In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W 55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56 281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34 :18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '') In [15]: print data.replace('+0013!', "+0013\n!") !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Just for some variance, here is a regular expression answer: import re outputFile = open('output.txt', 'w+') with open('sample.txt', 'r') as f: for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL): outputFile.write(line.replace("\n", "") + '\n') outputFile.close() It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4 After we have a match, we strip out the new lines from the match, and write it to the file.
Let's try to add a \n before every "!"; then let python splitlines :-) : file_read.replace("!", "!\n").splitlines()
I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files >>> def split_on_stream(it,sep="!"): prev = "" for line in it: line = (prev + line.strip()).split(sep) for parts in line[:-1]: yield parts prev = line[-1] yield prev >>> with open("test.txt") as fin: for parts in split_on_stream(fin): print parts ,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Iterate through words of a file in Python
I need to iterate through the words of a large file, which consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not applicable in my case, because of its single line structure. Any alternatives?
It really depends on your definition of word. But try this: f = file("your-filename-here").read() for word in f.split(): # do something with word print word This will use whitespace characters as word boundaries. Of course, remember to properly open and close the file, this is just a quick example.
Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering. First of all, this is a bad format; if you have any kind of control over the file, make it one word per line. If not, use something like: line = '' while True: word, space, line = line.partition(' ') if space: # A word was found yield word else: # A word was not found; read a chunk of data from file next_chunk = input_file.read(1000) if next_chunk: # Add the chunk to our line line = word + next_chunk else: # No more data; yield the last word and return yield word.rstrip('\n') return
You really should consider using Generator def word_gen(file): for line in file: for word in line.split(): yield word with open('somefile') as f: word_gen(f)
There are more efficient ways of doing this, but syntactically, this might be the shortest: words = open('myfile').read().split() If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.
I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer): Here is my totally functional approach which avoids having to read and split lines. It makes use of the itertools module: Note for python 3, replace itertools.imap with map import itertools def readwords(mfile): byte_stream = itertools.groupby( itertools.takewhile(lambda c: bool(c), itertools.imap(mfile.read, itertools.repeat(1))), str.isspace) return ("".join(group) for pred, group in byte_stream if not pred) Sample usage: >>> import sys >>> for w in readwords(sys.stdin): ... print (w) ... I really love this new method of reading words in python I really love this new method of reading words in python It's soo very Functional! It's soo very Functional! >>> I guess in your case, this would be the way to use the function: with open('words.txt', 'r') as f: for word in readwords(f): print(word)
Read in the line as normal, then split it on whitespace to break it down into words? Something like: word_list = loaded_string.split()
After reading the line you could do: l = len(pattern) i = 0 while True: i = str.find(pattern, i) if i == -1: break print str[i:i+l] # or do whatever i += l Alex.
What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago: l = [] f = open("filename.txt", "rU") for line in f: for word in line.split() l.append(word) longer version of what Donald Miner suggested.