Compare 2 files in python and print non-matching text - python

I have two files:
Resp.txt:
vrf XXX
address-family ipv4 unicast
import route-target
123:45
212:43
!
export route-policy ABCDE
export route-target
9:43
!
maximum prefix 12 34
spanning tree enable
bandwidth 10
!
!
and sample.txt
vrf
address-family ipv4 unicast
import route-target
export route-target
maximum prefix
I want to match resp.txt and sample.txt such that if contents of sample are not present in resp, I get those lines of text. The output should be like:
spanning tree enable
bandwidth 10
I am using :
t2=open('sample.txt','r')
abc=open('resp.txt','r')
for x in t2:
for line in abc:
if x.strip() in line.strip():
print 'yes'
else:
print line
But it's matching every line in both the text files and hence, not showing the correct result.

So the simplest solution to get all the strings not in sample.txt is to use set difference:
file_1 = set()
file_2 = set()
with open('Resp.txt', 'r') as f:
for line in f:
file_1.add(line.strip())
with open('Sample.txt', 'r') as f:
for line in f:
file_2.add(line.strip())
print(file_1 - file_2)
Which returns:
{'export route-policy ABCDE', 'vrf XXX', 'spanning tree enable', '!', '212:43', 'bandwidth 10', 'maximum prefix 12 34', '9:43', '123:45'}
However, this doesn't include certain rules applied to Resp.txt, for example:
If line is "maximum prefix" ignore the numbers.
These rules can be applied while reading Resp.txt:
import re
file_1 = set()
file_2 = set()
with open('Resp.txt', 'r') as f:
for line in f:
line = line.strip()
if line == "!":
continue
elif re.match( r'\d+:\d+', line): # Matches times.
continue
elif line.startswith("vrf"):
line = "vrf"
elif line.startswith("maximum prefix"):
line = "maximum prefix"
file_1.add(line)
with open('Sample.txt', 'r') as f:
for line in f:
file_2.add(line.strip())
print(file_1) - file_2)
Which returns:
{'export route-policy ABCDE', 'bandwidth 10', 'spanning tree enable'}
Which is correct because sample.txt does not contain route-policy.
These rules could be made more robust, but they should be illustrative enough.
Keep in mind set will only find unique differences, and not all (say you have multiple 'spanning tree enable' lines and would like to know how many times these are seen. In that case, you could do something more in line with your original code:
import re
file_1 = []
file_2 = []
with open('Resp.txt', 'r') as f:
for line in f:
line = line.strip()
if line == "!":
continue
elif re.match( r'\d+:\d+', line):
continue
elif line.startswith("vrf"):
line = "vrf"
elif line.startswith("maximum prefix"):
line = "maximum prefix"
file_1.append(line)
with open('Sample.txt', 'r') as f:
for line in f:
file_2.append(line.strip())
diff = []
for line in file_1:
if line not in file_2:
diff.append(line)
print(diff)
Result:
['export route-policy ABCDE', 'spanning tree enable', 'bandwidth 10']
While this method is slower (although you probably won't notice), it can find duplicate lines and maintains the order of the lines found.

Related

Lines missing in python

I am writing a code in python where I am removing all the text after a specific word but in output lines are missing. I have a text file in unicode which have 3 lines:
my name is test1
my name is
my name is test 2
What I want is to remove text after word "test" so I could get the output as below
my name is test
my name is
my name is test
I have written a code but it does the task but also removes the second line "my name is"
My code is below
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index > 0:
txt += line[:index + len(splitStr)] + "\n"
with open(r"test.txt", "w") as fp:
fp.write(txt)
It looks like if there is no keyword found the index become -1.
So you are avoiding the lines w/o keyword.
I would modify your if by adding the condition as follows:
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index > 0:
txt += line[:index + len(splitStr)] + "\n"
elif index < 0:
txt += line
with open(r"test.txt", "w") as fp:
fp.write(txt)
No need to add \n because the line already contains it.
Your code does not append the line if the splitStr is not defined.
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index != -1:
txt += line[:index + len(splitStr)] + "\n"
else:
txt += line
with open(r"test.txt", "w") as fp:
fp.write(txt)
In my solution I simulate the input file via io.StringIO. Compared to your code my solution remove the else branch and only use one += operater. Also splitStr is set only one time and not on each iteration. This makes the code more clear and reduces possible errore sources.
import io
# simulates a file for this example
the_file = io.StringIO("""my name is test1
my name is
my name is test 2""")
txt = ""
splitStr = "test"
with the_file as fp:
# each line
for line in fp.readlines():
# cut somoething?
if splitStr in line:
# find index
index = line.find(splitStr)
# cut after 'splitStr' and add newline
line = line[:index + len(splitStr)] + "\n"
# append line to output
txt += line
print(txt)
When handling with files in Python 3 it is recommended to use pathlib for that like this.
import pathlib
file_path = pathlib.Path("test.txt")
# read from wile
with file_path.open('r') as fp:
# do something
# write back to the file
with file_path.open('w') as fp:
# do something
Suggestion:
for line in fp.readlines():
i = line.find('test')
if i != -1:
line = line[:i]

Appending two text lines if the second line starts with a particular word

Consider a .txt file with the following content:
Pinus ponderosa P. & C. Lawson
var. scopulorum Engelm.
[5,800] - [7,800] 9,200 ft. [May] - [Jun]. Needleleaf
evergreen tree, mesophanerophyte; nanophyll, sclerophyll.
I would like to append any line starting with var. to the previous line.
Here's my code:
with open('myfile.txt', 'r') as f:
txt = ''
for line in f:
line = line.replace('\n', '')
if next(f)[:4] == 'var.':
txt = '{}\n{} {}'.format(txt, line, next(f))
This throws the following error:
Traceback (most recent call last): File "<stdin>", line 5, in <module> StopIteration
The expected output is:
Pinus ponderosa P. & C. Lawson var. scopulorum Engelm.
[5,800] - [7,800] 9,200 ft. [May] - [Jun]. Needleleaf
evergreen tree, mesophanerophyte; nanophyll, sclerophyll.
You can do it in one shot instead of iterating over the lines. Also if you wanted edit the file:
with open('myfile.txt', 'r') as f:
txt = f.read()
txt = txt.replace('\nvar.', ' var.')
with open('myfile.txt', 'w') as f:
f.write(txt)
This is one approach.
Ex:
with open(filename, 'r') as f:
txt = ''
for line in f:
line = line.strip()
if line.startswith('var.'): #Use str.startswith
txt += " " + line
else:
txt += "\n" + line
print(txt.strip())
Output:
Pinus ponderosa P. & C. Lawson var. scopulorum Engelm.
[5,800] - [7,800] 9,200 ft. [May] - [Jun]. Needleleaf
evergreen tree, mesophanerophyte; nanophyll, sclerophyll.

How do I match two plain text files line by line using Python

As per my requirement, I wish to match two text files line by line in Python on Windows platform. for example I have the following text files:
File1:
My name is xxx
command completed successfully.
My mother's name is yyy
My mobile number is 12345
the heavy lorry crashed into the building at midnight
lorry eat in the faculty a red apple
File2:
My name is xxx
command . successfully.
The name of my mother is
what a heavy lorry it is that crashed into the building
lorry eat an apple in the faculty
I apologize for not being clear enough so my problem is how can i align a script movie with its subtitles, i writ the following code in Python but it's not enough to get the alignement from the two text files:
# Open file for reading in text mode (default mode)
f1 = open('F:/CONTRIBUTION 2017/SCRIPT-SUBTITLES CODES/Script Alignement Papers/f1.txt','r')
f2 = open('F:/CONTRIBUTION 2017/SCRIPT-SUBTITLES CODES/Script Alignement Papers/f2.txt','r')
#Print confirmation
# print("-----------------------------------")
#print("Comparing files ", " > " + fname1, " < " +fname2, sep='\n')
# print("-----------------------------------")
# Read the first line from the files
f1_line = f1.readline()
f2_line = f2.readline()
# Initialize counter for line number
line_no = 1
# Loop if either file1 or file2 has not reached EOF
while f1_line != '' or f2_line != '':
# Strip the leading whitespaces
f1_line = f1_line.rstrip()
f2_line = f2_line.rstrip()
# Compare the lines from both file
if f1_line != f2_line:
# If a line does not exist on file2 then mark the output with + sign
if f2_line == '' and f1_line != '':
print("=================================================================")
print("=================================================================")
print("line does not exist on File 2 ====================")
print("=================================================================")
print(">+", "Line-%d" % line_no, f1_line)
# otherwise output the line on file1 and mark it with > sign
elif f1_line != '':
print("=================================================================")
print("=================================================================")
print("otherwise output the line on file1 ====================")
print("=================================================================")
print(">", "Line-%d" % line_no, f1_line)
# If a line does not exist on file1 then mark the output with + sign
if f1_line == '' and f2_line != '':
print("=================================================================")
print("=================================================================")
print("=line does not exist on File 1 ====================")
print("=================================================================")
print("<+", "Line-%d" % line_no, f2_line)
# otherwise output the line on file2 and mark it with < sign
elif f2_line != '':
print("=================================================================")
print("=================================================================")
print("otherwise output the line on file2 ====================")
print("=================================================================")
print("<", "Line-%d" % line_no, f2_line)
# Print a blank line
print()
#Read the next line from the file
f1_line = f1.readline()
f2_line = f2.readline()
#Increment line counter
line_no += 1
# Close the files
f1.close()
f2.close()
If can anyone help to do this matching, i would be very grateful.
It would be good to post code you tried writting. This feels like we are doing your homework and makes you look lazy. That being said, take a look at the following:
with open(file1, 'r') as f1, open(file2, 'r') as f2:
if f1.readlines() == f2.readlines():
print('Files {} & {} are identical!'.format(file1, file2))
PS: This checks whether the files are identical. If you want something like a logical comparison you have to do some research first.
One possible way is to store the lines of the file in a list and then compare the lists.
lines_of_file1 = []
file = open("file1.txt","r")
line = 'sample'
while line != '':
line = file.readline()
lines_of_file1.append(line)
file.close()
lines_of_file2 = []
file = open("file2.txt","r")
line = 'sample'
while line != '':
line = file.readline()
lines_of_file2.append(line)
file.close()
same = True
for line1 in lines_of_file1:
for line2 in lines_of_file2:
if line1 != line2:
same = False
break
if same:
print("Files are same")
else:
print("Files are not same")
Hope that helps.

Python: Add a new line after the first word in a sentence if the first word is all caps

I'm trying to modify a txt file. The file is a movie script in the format:
BEN We’ve discussed this before.
LUKE I can be a Jedi. I’m ready.
I'd like insert a new line after the character:
BEN
We’ve discussed this before.
LUKE
I can be a Jedi. I’m ready.
How do I do this in python? I currently have:
def modify_file(file_name):
fh=fileinput.input(file_name,inplace=True)
for line in fh:
split_line = line.split()
if(len(split_line)>0):
first_word = split_line[0]
replacement = first_word+'\n'
first_word=first_word.replace(first_word,replacement)
sys.stdout.write(first_word)
fh.close()
As suggested in one of the comments, this can be done using split and isupper. An example is provided below:
source_path = 'source_path.txt'
f = open(source_path)
lines = f.readlines()
f.close()
temp = ''
for line in lines:
words = line.split(' ')
if words[0].isupper():
temp += words[0] + '\n' + ' '.join(words[1:])
else:
temp += line
f = open(source_path, 'w')
f.write(temp)
f.close()
There are multiple problems with your code.
import fileinput
def modify_file(file_name):
fh=fileinput.input("output.txt",inplace=True)
for line in fh:
split_line = line.split()
if(len(split_line)>0):
x=split_line[0]+"\n"+" ".join(split_line[1:])+"\n"
sys.stdout.write(x)
fh.close() #==>this cannot be in the if loop.It has to be at the outer for level

Read Lines, Process to List, and Write to File in PYTHON

I am very new in Python and I am processing the tweets below:
#PrincessSuperC Hey Cici sweetheart! Just wanted to let u know I luv u! OH! and will the mixtape drop soon? FANTASY RIDE MAY 5TH!!!!
#Msdebramaye I heard about that contest! Congrats girl!!
UNC!!! NCAA Champs!! Franklin St.: I WAS THERE!! WILD AND CRAZY!!!!!! Nothing like it...EVER http://tinyurl.com/49955t3
Do you Share More #jokes #quotes #music #photos or #news #articles on #Facebook or #Twitter?
Good night #Twitter and #TheLegionoftheFallen. 5:45am cimes awfully early!
I just finished a 2.66 mi run with a pace of 11'14"/mi with Nike+ GPS. #nikeplus #makeitcount
Disappointing day. Attended a car boot sale to raise some funds for the sanctuary, made a total of 88p after the entry fee - sigh
no more taking Irish car bombs with strange Australian women who can drink like rockstars...my head hurts.
Just had some bloodwork done. My arm hurts
And it supposed to have an output of a feature vectors as follows:
featureList = ['hey', 'cici', 'luv', 'mixtape', 'drop', 'soon', 'fantasy', 'ride', 'heard',
'congrats', 'ncaa', 'franklin', 'wild', 'share', 'jokes', 'quotes', 'music', 'photos', 'news',
'articles', 'facebook', 'twitter', 'night', 'twitter', 'thelegionofthefallen', 'cimes', 'awfully',
'finished', 'mi', 'run', 'pace', 'gps', 'nikeplus', 'makeitcount', 'disappointing', 'day', 'attended',
'car', 'boot', 'sale', 'raise', 'funds', 'sanctuary', 'total', 'entry', 'fee', 'sigh', 'taking',
'irish', 'car', 'bombs', 'strange', 'australian', 'women', 'drink', 'head', 'hurts', 'bloodwork',
'arm', 'hurts']
However, the current output that i got is only
hey, cici, luv, mixtape, drop, soon, fantasy, ride
which comes from the first tweet only. And it keeps loopping in that one tweet only without going to the next line.. I tried to use nextLine but apparently it does not work on Python. My code is as follows:
#import regex
import re
import csv
import pprint
import nltk.classify
#start replaceTwoOrMore
def replaceTwoOrMore(s):
#look for 2 or more repetitions of character
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
return pattern.sub(r"\1\1", s)
#end
#start process_tweet
def processTweet(tweet):
# process the tweets
#Convert to lower case
tweet = tweet.lower()
#Convert www.* or https?://* to URL
tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)
#Convert #username to AT_USER
tweet = re.sub('#[^\s]+','AT_USER',tweet)
#Remove additional white spaces
tweet = re.sub('[\s]+', ' ', tweet)
#Replace #word with word
tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
#trim
tweet = tweet.strip('\'"')
return tweet
#end
#start getStopWordList
def getStopWordList(stopWordListFileName):
#read the stopwords
stopWords = []
stopWords.append('AT_USER')
stopWords.append('URL')
fp = open(stopWordListFileName, 'r')
line = fp.readline()
while line:
word = line.strip()
stopWords.append(word)
line = fp.readline()
fp.close()
return stopWords
#end
#start getfeatureVector
#start getfeatureVector
def getFeatureVector(tweet):
featureVector = []
#split tweet into words
words = tweet.split()
for w in words:
#replace two or more with two occurrences
w = replaceTwoOrMore(w)
#strip punctuation
w = w.strip('\'"?,.')
#check if the word stats with an alphabet
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w)
#ignore if it is a stop word
if(w in stopWords or val is None):
continue
else:
featureVector.append(w.lower())
return featureVector
#end
#Read the tweets one by one and process it
fp = open('data/sampleTweets.txt', 'r')
line = fp.readline()
st = open('data/feature_list/stopwords.txt', 'r')
stopWords = getStopWordList('data/feature_list/stopwords.txt')
while line:
processedTweet = processTweet(line)
featureVector = getFeatureVector(processedTweet)
with open('data/niek_corpus_feature_vector.txt', 'w') as f:
f.write(', '.join(featureVector))
#end loop
fp.close()
UPDATE:
After trying to change the loop as suggested below:
st = open('data/feature_list/stopwords.txt', 'r')
stopWords = getStopWordList('data/feature_list/stopwords.txt')
with open('data/sampleTweets.txt', 'r') as fp:
for line in fp:
processedTweet = processTweet(line)
featureVector = getFeatureVector(processedTweet)
with open('data/niek_corpus_feature_vector.txt', 'w') as f:
f.write(', '.join(featureVector))
fp.close()
I got the following output, which is only the the words from the last line of the tweets.
bloodwork, arm, hurts
I am still trying to figure it out.
If you only want use readline() and not readlines use a loop as follows.
st = open('data/feature_list/stopwords.txt', 'r')
stopWords = getStopWordList('data/feature_list/stopwords.txt')
with open('data/sampleTweets.txt', 'r') as fp:
for line in fp:
processedTweet = processTweet(line)
featureVector = getFeatureVector(processedTweet)
with open('data/niek_corpus_feature_vector.txt', 'ab') as f:
f.write(', '.join(featureVector))
line = fp.readline()
only reads a single line in the file. You then process that line in the while and exit immediately thereafter. You need to read every line in the file. Once you have read the entire file, you should then process each line as you have already done.
lines = fp.readlines()
# Now process each line
for line in lines:
# Now process the line as you do in your original code
while line:
processedTweet = processTweet(line)
Python File readlines() Method
The method readlines() reads until EOF using readline()
and returns a list containing the lines. If the optional sizehint
argument is present, instead of reading up to EOF, whole lines
totalling approximately sizehint bytes (possibly after rounding up to
an internal buffer size) are read.
Following is the syntax for readlines() method:
fileObject.readlines( sizehint ); Parameters sizehint -- This is the number of bytes to be read from the file.
Return Value: This method returns a list containing the lines.
Example The following example shows the usage of readlines() method.
#!/usr/bin/python
# Open a file
fo = open("foo.txt", "rw+") print "Name of the file: ", fo.name
# Assuming file has following 5 lines
# This is 1st line
# This is 2nd line
# This is 3rd line
# This is 4th line
# This is 5th line
line = fo.readlines() print "Read Line: %s" % (line)
line = fo.readlines(2) print "Read Line: %s" % (line)
# Close opend file
fo.close()
Let us compile and run the above program, this will produce the following result:
Name of the file: foo.txt Read Line: ['This is 1st line\n', 'This is
2nd line\n',
'This is 3rd line\n', 'This is 4th line\n',
'This is 5th line\n']
Read Line: []

Categories