diff for single lines - python

All diff tools I've found are just comparing line by line instead of char by char. Is there any library that gives details on single line strings? Maybe also a percentage difference, though I guess there are separate functions for that?

This algorithm diffs word-by-word:
http://github.com/paulgb/simplediff
available in Python and PHP. It can even spit out HTML formatted output using the <ins> and <del> tags.

I was looking for something similar recently, and came across wdiff. It operates on words, not characters, but is this close to what you're looking for?

What you could try is to split both strings up character by character into lines and then you can use diff on that. It's a dirty hack, but atleast it should work and is quite easy to implement.
Alternately you can split the string up into a list of chars in Python and use difflib. Check Python difflib reference

You can implement a simple Needleman–Wunsch algorithm. The pseudo code is available on Wikipedia: http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

Related

How to use re.search to find multiple strings?

I'm trying to check if a certain line in many different text documents equals one of many different strings. To goal here is to classify those documents and then parse them according to that classification.
In my text editor I can use regex to search for:
(?:^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)
But if I try to use this in an if statement in python I keep getting syntax errors for any way I tried it.
My current version is this:
if re.search('(^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)', article_lines[5].lower()replace('´','')):
no_author = True
I saw that a possible solution is using a for loop and putting the different strings into a list, but as this would require some extra steps I'd prefer to do it as I tried if possible.
You should include what the error is. Your problem is probably just a typo:
if re.search('(^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)', article_lines[5].lower().replace('´','')):
no_author = True

Python3 - Generate string matching multiple regexes, without modifying them

I would like to generate string matching my regexes using Python 3. For this I am using handy library called rstr.
My regexes:
^[abc]+.
[a-z]+
My task:
I must find a generic way, how to create string that would match both my regexes.
What I cannot do:
Modify both regexes or join them in any way. This I consider as ineffective solution, especially in the case if incompatible regexes:
import re
import rstr
regex1 = re.compile(r'^[abc]+.')
regex2 = re.compile(r'[a-z]+')
for index in range(0, 1000):
generated_string = rstr.xeger(regex1)
if re.fullmatch(regex2, generated_string):
break;
else:
raise Exception('Regexes are probably incompatibile.')
print('String matching both regexes is: {}'.format(generated_string))
Is there any workaround or any magical library that can handle this? Any insights appreciated.
Questions which are seemingly similar, but not helpful in any way:
Match a line with multiple regex using Python
Asker already has the string, which he just want to check against multiple regexes in the most elegant way. In my case we need to generate string in a smart way that would match regexes.
If you want really generic way, you can't really use brute force approach.
What you look for is create some kind of representation of regexp (as rstr does through call of sre_parse.py) and then calling some SMT solver to satisfy both criteria.
For Haskell there is https://github.com/audreyt/regex-genex which uses Yices SMT solver to do just that, but I doubt there is anything like this for Python. If I were you, I'd bite a bullet and call it as external program from your python program.
I don't know if there is something that can fulfill your needs much smother.
But I would do it something like (as you've done it already):
Create a Regex object with the re.compile() function.
Generate String based on 1st regex.
Pass the string you've got into the 2nd regex object using search() method.
If that passes... your done, string passed both regexs.
Maybe you can create a function and pass both regexes as parameters and test "2 by 2" using the same logic.
And then if you have 8 regexes to match...
Just do:
call (regex1, regex2)
call (regex2, regex3)
call (regex4, regex5)
...
I solved this using a little alternative approach. Notice second regex is basically insurance so only lowercase letters are generated in our new string.
I used Google's python package sre_yield which allows charset limitation. Package is also available on PyPi. My code:
import sre_yield
import string
sre_yield.AllStrings(r'^[abc]+.', charset=string.ascii_lowercase)[0]
# returns `aa`

Regex named conditional lookahead (in Python)

I'm hoping to match the beginning of a string differently based on whether a certain block of characters is present later in the string. A very simplified version of this is:
re.search("""^(?(pie)a|b)c.*(?P<pie>asda)$""", 'acaaasda')
Where, if <pie> is matched, I want to see a at the beginning of the string, and if it isn't then I'd rather see b.
I'd use normal numerical lookahead but there's no guarantee how many groups will or won't be matched between these two.
I'm currently getting error: unknown group name. The sinking feeling in my gut tells me that this is because what I want is impossible (look-ahead to named groups isn't exactly a feature of a regular language parser), but I really really really want this to work -- the alternative is scrapping 4 or 5 hours' worth of regex writing and redoing it all tomorrow as a recursive descent parser or something.
Thanks in advance for any help.
Unfortunately, I don't think there is a way to do what you want to do with named groups. If you don't mind duplication too much, you could duplicate the shared conditions and OR the expressions together:
^(ac.*asda|bc.*)$
If it is a complicated expression you could always use string formatting to share it (rather than copy-pasting the shared part):
common_regex = "c.*"
final_regex = "^(a{common}asda|b{common})$".format(common=common_regex)
You can use something like that:
^(?:a(?=c.*(?P<pie>asda)$)|b)c.*$
or without .*$ if you don't need it.

Is there a python way of serially searching a string with regexes, akin to java's re.appendReplacement and appendTail

I cannot find a way to serially search a string and append replacements. Let's say I am implementing a templating language. A simplified template looks something like this:
Hello words on #DATE# in #COUNTRY# on this beautiful day.
Imagine a very long template, with many #SOMETHING# tags. Now I want to use regex to parse through this, and every time I found #SOMETHING#, do some python logic, replace it with some string, append it, and continue. All I found is that I can break the string up into tokens and matches and then reassemble it. Is there something better, without generating all those string chunks? Maybe I am trying to optimize too early, but in Java, we have the
appendReplacement(StringBuffer,String) and appendTail(StringBuffer)
methods and I was wondering if something similar can be done in Python.
See http://docs.oracle.com/javase/tutorial/essential/regex/matcher.html
You can use a function as the "replacement" in re.sub. Then re.sub will invoke your function for every match in the string, and the return value of the function will be the replacement in the string.

regular expression help with converting exp1^exp2 to pow(exp1, exp2)

I am converting some matlab code to C, currently I have some lines that have powers using the ^, which is rather easy to do with something along the lines \(?(\w*)\)?\^\(?(\w*)\)?
works fine for converting (glambda)^(galpha),using the sub routine in python pattern.sub(pow(\g<1>,\g<2>),'(glambda)^(galpha)')
My problem comes with nested parenthesis
So I have a string like:
glambdastar^(1-(1-gphi)*galpha)*(glambdaq)^(-(1-gphi)*galpha);
And I can not figure out how to convert that line to:
pow(glambdastar,(1-(1-gphi)*galpha))*pow(glambdaq,-(1-gphi)*galpha));
Unfortunately, regular expressions aren't the right tool for handling nested structures. There are some regular expressions engines (such as .NET) which have some support for recursion, but most — including the Python engine — do not, and can only handle as many levels of nesting as you build into the expression (which gets ugly fast).
What you really need for this is a simple parser. For example, iterate over the string counting parentheses and storing their locations in a list. When you find a ^ character, put the most recently closed parenthesis group into a "left" variable, then watch the group formed by the next opening parenthesis. When it closes, use it as the "right" value and print the pow(left, right) expression.
I think you can use recursion here.
Once you figure out the Left and Right parts, pass each of those to your function again.
The base case would be that no ^ operator is found, so you will not need to add the pow() function to your result string.
The function will return a string with all the correct pow()'s in place.
I'll come up with an example of this if you want.
Nested parenthesis cannot be described by a regexp and require a full parser (able to understand a grammar, which is something more powerful than a regexp). I do not think there is a solution.
See recent discussion function-parser-with-regex-in-python (one of many similar discussions). Then follow the suggestion to pyparsing.
An alternative would be to iterate until all ^ have been exhausted. no?.
Ruby code:
# assuming str contains the string of data with the expressions you wish to convert
while str.include?('^')
str!.gsub!(/(\w+)\^(\w+)/, 'pow(\1,\2)')
end

Categories