Compare two text files in ruby

Compare two text files in ruby - python

I have two text files file1.txt and file2.txt. I want to find the difference b/w the file which will highlight the equal, insertion and deletion text. The final goal is to create a html file which will have the text (equal, insertion and deletion text) highlighted with different color and styles.
file1.txt
I am testing this ruby code for printing the file diff.
file2.txt
I am testing this code for printing the file diff.
I am using this code
doc1 = File.open('file1.txt').read
doc2 = open('file2.txt').read
final_doc = Diffy::Diff.new(doc1, doc2).each_chunk.to_a
The output is :
-I am testing this ruby code for printing the file diff.
+I am testing this code for printing the file diff.
However, I need the output in similar to below format.
equal:
I am testing this
insertion:
ruby
equal:
code for printing the file diff.
In python there is a difflib through which it can be achieved but I have not found such functionality in the Ruby.

I've found there's a few different libraries in Ruby for doing "Diffs", but they're more focused on checking line by line. I created some code that is used to compare a couple of relatively short strings and show the differences, a sort of quick hack that works great if it doesn't matter too much about highlighting the removed sections in the parts that they were removed from - to do that would require just a bit more thinking about the algorith. But this code works wonders for a small amount of text at a time.
The key is, like with any language processing, getting your tokenization right. You can't just process a string word by word. Really the best way would be to first loop through, recursively, and associate each token with a position in the text and use that to do the analysis, but this method below is fast and easy.
def self.change_differences(text1,text2) #oldtext, newtext
result = ""
tokens = text2.split(/(?<=[?.!,])/) #Positive look behind regexp.
for token in tokens
if text1.sub!(token,"") #Yes it contained it.
result += "<span class='diffsame'>" + token + "</span>"
else
result += "<span class='diffadd'>" + token + "</span>"
end
end
tokens = text1.split(/(?<=[?.!,])/) #Positive look behind regexp.
for token in tokens
result += "<span class='diffremove'>"+token+"</span>"
end
return result
end
Source: me!

Related

Pyparsing for Paragraphs

I have run into a slight problem with pyparsing that I can't seem to solve. I'd like to write a rule that will parse a multiline paragraph for me. The end goal is to end up with a recursive grammar that will parse something like:
Heading: awesome
This is a paragraph and then
a line break is inserted
then we have more text
but this is also a different line
with more lines attached
Other: cool
This is another indented block
possibly with more paragraphs
This is another way to keep this up
and write more things
But then we can keep writing at the old level
and get this
Into something like HTML: so maybe (of course with a parse tree, I can transform this to whatever format I like).
<Heading class="awesome">
<p> This is a paragraph and then a line break is inserted and then we have more text </p>
<p> but this is also a different line with more lines attached<p>
<Other class="cool">
<p> This is another indented block possibly with more paragraphs</p>
<p> This is another way to keep this up and write more things</p>
</Other>
<p> But then we can keep writing at the old level and get this</p>
</Heading>
Progress
I have managed to get to the stage where I can parse the heading row, and an indented block using pyparsing. But I can't:
Define a paragraph as a multiple lines that should be joined
Allow a paragraph to be indented
An Example
Following from here, I can get the paragraphs to output to a single line, but there doesn't seem to be a way to turn this into a parse tree without removing the line break characters.
I believe a paragraph should be:
words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd
But this doesn't seem to work for me. Any ideas would be awesome :)

So I managed to solve this, for anybody who stumbles upon this in the future. You can define the paragraph like this. Although it is certainly not ideal, and doesn't exactly match the grammar that I described. The relevant code is:
line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd)
emptyline = ~line
paragraph = OneOrMore(line) + emptyline
paragraph.setParseAction(join_lines)
Where join_lines is defined as:
def join_lines(tokens):
stripped = [t.strip() for t in tokens]
joined = " ".join(stripped)
return joined
That should point you in the right direction if this matches your needs :) I hope that helps!
A Better Empty Line
The definition of empty line given above is definitely not ideal, and it can be improved dramatically. The best way I've found is the following:
empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
empty_line.setWhitespaceChars("")
This allows you to have empty lines that are filled with spaces, without breaking the match.

Caveat emptor: I can spell p-y-t-h-o-n and that's pretty much all there is to my knowledge. I tried to take some online classes but after about 20 lectures learning not much, I gave up long time ago. So, what I am going to ask is very simple but I need help:
I have a file with the following structure:
object_name_here:
object_owner:
- me#my.email.com
- user#another.email.com
object_id: some_string_here
identification: some_other_string_here
And this block repeats itself hundreds of times in the same file.
Other than object_name_here being unique and required, all other lines may or may not be present, email addresses can be from none to 10+ different email addresses.
what I want to do is to export this information into a flat file, likes of /etc/passwd, with a twist
for instance, I want the block above to yield a line like this:
object_name_here:object_owner=me#my_email.com,user#another.email.com:objectid=some_string_here:identification=some_other_string_here
again, the number of fields or length of the content fields are not fixed by any means. I am sure this is pretty easy task to accomplish with python but how, I don't know. I don't even know where to start from.
Final Edit: Okay, I am able to write a shell script (bash, ksh etc.) to parse the information, but, when I asked this question originally, I was under the impression that, python had a simpler way of handling uniform or semi-uniform data structures as this one. My understanding was proven to be not very accurate. Sorry for wasting your time.

As jaypb points out, regular expressions are a good idea here. If you're interested in some python 101, I'll give you some simple code to get you started on your own solution.
The following code is a quick and dirty way to lump every six lines of a file into one line of a new file:
# open some files to read and write
oldfile = open("oldfilename","r")
newfile = open("newfilename","w")
# initiate variables and iterate over the input file
count = 0
outputLine = ""
for line in oldfile:
# we're going to append lines in the file to the variable outputLine
# file.readline() will return one line of a file as a string
# str.strip() will remove whitespace at the beginning and end of a string
outputLine = outputLine + oldfile.readline().strip()
# you know your interesting stuff is six lines long, so
# reset the output string and write it to file every six lines
if count%6 == 0:
newfile.write(outputLine + "\n")
outputLine = ""
# increment the counter
count = count + 1
# clean up
oldfile.close()
newfile.close()
This isn't exactly what you want to do but it gets you close. For instance, if you want to get rid of " - " from the beginning of the email addresses and replace it with "=", instead of just appending to outputLine you'd do something like
if some condition:
outputLine = outputLine + '=' + oldfile.readline()[3:]
that last bit is a python slice, [3:] means "give me everything after the third element," and it works for things like strings or lists.
That'll get you started. Use google and the python docs (for instance, googling "python strip" takes you to the built-in types page for python 2.7.10) to understand every line above, then change things around to get what you need.

Since you are replacing text substrings with different text substrings, this is a pretty natural place to use regular expressions.
Python, fortunately, has an excellent regular expressions library called re.
You will probably want to heavily utilize
re.sub(pattern, repl, string)
Look at the documentation here:
https://docs.python.org/3/library/re.html
Update: Here's an example of how to use the regular expression library:
#!/usr/bin/env python
import re
body = None
with open("sample.txt") as f:
body = f.read()
# Replace emails followed by other emails
body = re.sub(" * - ([a-zA-Z.#]*)\n * -", r"\1,", body)
# Replace declarations of object properties
body = re.sub(" +([a-zA-Z_]*): *[\n]*", r"\1=", body)
# Strip newlines
body = re.sub(":?\n", ":", body)
print (body)
Example output:
$ python example.py
object_name_here:object_owner=me#my.email.com, user#another.email.com:object_id=some_string_here:identification=some_other_string_here

Searching a string for an exact match from a list in Python

I'm working on a project that searches specific user's Twitter streams from my followers list and retweets them. The code below works fine, but if the string appears in side of the word (for instance if the desired string was only "man" but they wrote "manager", it'd get retweeted). I'm still pretty new to python, but my hunch is RegEx will be the way to go, but my attempts have proved useless thus far.
if tweet["user"]["screen_name"] in friends:
for phrase in list:
if phrase in tweet["text"].lower():
print tweet
api.retweet(tweet["id"])
return True

Since you only want to match whole words the easiest way to get Python to do this is to split the tweet text into a list of words and then test for the presence of each of your words using in.
There's an optimization you can use because position isn't important: by building a set from the word list you make searching much faster (technically, O(1) rather than O(n)) because of the fast hashed access used by sets and dicts (thank you Tim Peters, also author of The Zen of Python).
The full solution is:
if tweet["user"]["screen_name"] in friends:
tweet_words = set(tweet["text"].lower().split())
for phrase in list:
if phrase in tweet_words:
print tweet
api.retweet(tweet["id"])
return True
This is not a complete solution. Really you should be taking care of things like purging leading and trailing punctuation. You could write a function to do that, and call it with the tweet text as an argument instead of using a .split() method call.
Given that optimization it occurred to me that iteration in Python could be avoided altogether if the phrases were a set also (the iteration will still happen, but at C speeds rather than Python speeds). So in the code that follows let's suppose that you have during initialization executed the code
tweet_words = set(l.lower() for l in list)
By the way, list is a terrible name for a variable, since by using it you make the Python list type unavailable under its usual name (though you can still get at it with tricks like type([])). Perhaps better to call it word_list or something else both more meaningful and not an existing name. You will have to adapt this code to your needs, it's just to give you the idea. Note that tweet_words only has to be set once.
list = ['Python', 'Perl', 'COBOL']
tweets = [
"This vacation just isn't worth the bother",
"Goodness me she's a great Perl programmer",
"This one slides by under the radar",
"I used to program COBOL but I'm all right now",
"A visit to the doctor is not reported"
]
tweet_words = set(w.lower() for w in list)
for tweet in tweets:
if set(tweet.lower().split()) & tweet_words:
print(tweet)

If you want to use regexes to do this, look for a pattern that is of the form \b<string>\b. In your case this would be:
pattern = re.compile(r"\bman\b")
if re.search(pattern, tweet["text"].lower()):
#do your thing
\b looks for a word boundary in regex. So prefixing and suffixing your pattern with it will match only the pattern. Hope it helps.

Spell check program in python

Exercise problem: "given a word list and a text file, spell check the
contents of the text file and print all (unique) words which aren't
found in the word list."
I didn't get solutions to the problem so can somebody tell me how I went and what the correct answer should be?:
As a disclaimer none of this parses in my python console...
My attempt:
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
#I'm aware that something is wrong here since I get an error when I use it.....when I just write blablabla.txt it says that it can't find the thing. Is this function only gonna work if I'm working off the online IVLE program where all those files are automatically linked to the console or how would I do things from python without logging into the online IVLE?
for words in data:
for words not in a
print words
wrong = words not in a
right = words in a
print="wrong spelling:" + "properly splled words:" + right
oh yeh...I'm very sure I've indented everything correctly but I don't know how to format my question here so that it doesn't come out as a block like it has. sorry.
What do you think?

There are many things wrong with this code - I'm going to mark some of them below, but I strongly recommend that you read up on Python control flow constructs, comparison operators, and built-in data types.
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
# The filename needs to be a string value - put "C:\..." in quotes!
for words in data:
# data is a string - iterating over it will give you one letter
# per iteration, not one word
for words not in a
# aside from syntax (remember the colons!), remember what for means - it
# executes its body once for every item in a collection. "not in a" is not a
# collection of any kind!
print words
wrong = words not in a
# this does not say what you think it says - "not in" is an operator which
# takes an arbitrary value on the left, and some collection on the right,
# and returns a single boolean value
right = words in a
# same as the previous line
print="wrong spelling:" + "properly splled words:" + right

I don't know what you are trying to iterate over, but why don't you just first iterate over your words (which are in the variable a I guess?) and then for every word in a you iterate over the wordlist and check whether or not that word is in the wordslist.
I won't paste code since it seems like homework to me (if so, please add the homework tag).
Btw the first argument to open() should be a string.

It's simple really. Turn both lists into sets then take the difference. Should take like 10 lines of code. You just have to figure out the syntax on your own ;) You aren't going to learn anything by having us write it for you.

How to search for string in Python by removing line breaks but return the exact line where the string was found?

I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py sample.txt "String Extraction"
The problem I have with this is that for cases where search string is broken towards the end of the line :
If you are going to index large binary files, remember to change the
size limits. String
Extraction is a common problem
If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).
Much appreciated guys!

Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.
My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.
Edit:
Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:
def iterwords(fh):
for number, line in enumerate(fh):
for word in re.split(r'\s+', line.strip()):
yield number, word
It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.
The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:
python search.py 'multi word search string' file.txt
There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)
* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).

There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them line1 and line2), concatenating them into line3 or something similar, and then search that resultant line.
Then you'd assign line2 to line1, get a new line2, and repeat the process.

Use the flag re.MULTILINE when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINE
Then use \s to represent all white space (including new lines).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.