Compare lines ignoring certain characters with difflib [duplicate] - python

According to the documentation, you can provide a linejunk function to ignore certian lines. However, I can't get it to work. Here is some sample code for discussion:
from re import search
from difflib import ndiff
t1 = 'one 1\ntwo 2\nthree 3'
t2 = 'one 1\ntwo 29\nthree 3'
diff = ndiff(t1.splitlines(), t2.splitlines(), lambda x: search('2', x))
My intention is to ignore the second line and diff will be a generator that doesn't show any differences.
Thanks for the help.

I've recently met with the same problem.
Here's what I've found out:
cf. http://bugs.python.org/issue14332
The main intent of the *junk parameters is to speed up matching to
find differences, not to mask differences.
c.f.
http://hg.python.org/cpython/rev/0a69b1e8b7fe/
The patch provides a better explanation of the "junk" and "ignore" concepts in difflib docs
These junk-filtering functions speed up matching to find differences
and do not cause any differing lines or characters to be ignored.

Your example has a problem: the first two arguments to ndiff should each be a list of strings; you have a single string which is treated just like a list of characters. See the docs. Use e.g. t1 = 'one 1\ntwo 2\nthree 3'.splitlines()
However as the following example shows, difflib.ndiff doesn't call the linejunk function for all lines. This is longstanding behaviour -- verified with Python 2.2 to 2.6 inclusive, and 3.1.
Example script:
from difflib import ndiff
t1 = 'one 1\ntwo 2\nthree 3'.splitlines()
t2 = 'one 1\ntwo 29\nthree 3'.splitlines()
def lj(line):
rval = '2' in line
print("lj: line=%r, rval=%s" % (line, rval))
return rval
d = list(ndiff(t1, t2 )); print("%d %r\n" % (1, d))
d = list(ndiff(t1, t2, lj)); print("%d %r\n" % (2, d))
d = list(ndiff(t2, t1, lj)); print("%d %r\n" % (3, d))
Output from running with Python 2.6:
1 [' one 1', '- two 2', '+ two 29', '? +\n', ' three 3']
lj: line='one 1', rval=False
lj: line='two 29', rval=True
lj: line='three 3', rval=False
2 [' one 1', '- two 2', '+ two 29', '? +\n', ' three 3']
lj: line='one 1', rval=False
lj: line='two 2', rval=True
lj: line='three 3', rval=False
3 [' one 1', '- two 29', '? -\n', '+ two 2', ' three 3']
You may wish to report this as a bug. However note that the docs don't say explicitly what meaning is attached to lines that are "junk". What output were you expecting?
Further puzzlement: adding these lines to the script:
t3 = 'one 1\n \ntwo 2\n'.splitlines()
t4 = 'one 1\n\n#\n\ntwo 2\n'.splitlines()
d = list(ndiff(t3, t4 )); print("%d %r\n" % (4, d))
d = list(ndiff(t4, t3 )); print("%d %r\n" % (5, d))
d = list(ndiff(t3, t4, None)); print("%d %r\n" % (6, d))
d = list(ndiff(t4, t3, None)); print("%d %r\n" % (7, d))
produces this output:
4 [' one 1', '- ', '+ ', '+ #', '+ ', ' two 2']
5 [' one 1', '+ ', '- ', '- #', '- ', ' two 2']
6 [' one 1', '- ', '+ ', '+ #', '+ ', ' two 2']
7 [' one 1', '+ ', '- ', '- #', '- ', ' two 2']
In other words the result when using the default linejunk function is the same as not using a linejunk function, in a case containing different "junk" lines (whitespace except for an initial hash).
Perhaps if you could tell us what you are trying to achieve, we might be able to suggest an alternative approach.
Edit after further info
If your intention is in generality to ignore all lines containing '2', meaning pretend that they don't exist for ndiff purposes, all you have to do is turn the pretence into reality:
t1f = [line for line in t1 if '2' not in line]
t2f = [line for line in t2 if '2' not in line]
diff = ndiff(t1f, t2f)

Related

Delete spaces in dictionary values python

I'm reading the data from an outsource. The data has "Name" and "Value with warnings" so I put those in a dictionary in a manner as
d[data[i:i+6]] = data[i+8:i+17], data[i+25:i+36]
Thus at the end I have my dict as;
{'GPT-P ': ('169 ', 'H '), 'GOT-P ': ('47 ', ' '), .....
As seen above both keys and values might have unnecessary spaces.
I was able to overcome spaces in keys with;
d = {x.replace(' ',''): v
for x, v in d.items()}
but can't seem to manage similar for values. I tried using d.values() but it trims the key name and also works only for 1 of the values.
Can you help me understand how I can remove space for several values (2 values in this particular case) and end up with something like;
{'GPT-P': ('169', 'H'), 'GOT-P ': ('47', ''), .....
Thanks. Stay safe and healthy
You will need to do the space replacement in your v values also but
it seems that in your case the values in your dictionary are tuples.
I guess you will want to remove spaces in all elements of each tuple so you will need a second iteration here. You can do something like this:
d = {'GPT-P ': ('169 ', 'H '), 'GOT-P ': ('47 ', ' ')}
{x.replace(' ', ''): tuple(w.replace(' ', '') for w in v) for x, v in d.items()}
Which returns:
{'GPT-P': ('169', 'H'), 'GOT-P': ('47', '')}
Notice that there is list (or tuple) comprehension tuple(w.replace(' ', '') for w in v) within the dictionary comprehension.
Given:
DoT={'GPT-P ': ('169 ', 'H '), 'GOT-P ': ('47 ', ' ')}
Since you have tuples of strings as your values, you need to apply .strip() to each string in the tuple:
>>> tuple(e.strip() for e in ('47 ', ' '))
('47', '')
Apply that to each key, value in a dict comprehension and there you are:
>>> {k.strip():tuple(e.strip() for e in t) for k,t in DoT.items()}
{'GPT-P': ('169', 'H'), 'GOT-P': ('47', '')}
You use .replace(' ','') in your attempt. That will replace ALL spaces:
>>> ' 1 2 3 '.replace(' ','')
'123'
It is more typical to use one of the .strips():
>>> ' 1 2 3 '.strip()
'1 2 3'
>>> ' 1 2 3 '.lstrip()
'1 2 3 '
>>> ' 1 2 3 '.rstrip()
' 1 2 3'
You can use .replace or any of the .strips() in the comprehensions that I used above.

Edabit task doesn't show correct result

I'm doing a simple task which requires to sort a list by expression result and running this code:
def sort_by_answer(lst):
ans = []
dict = {}
for i in lst:
if 'x' in i:
i = i.replace('x', '*')
dict.update({i: eval(i)})
dict = {k: v for k, v in sorted(dict.items(), key=lambda item: item[1])}
res = list(dict.keys())
for i in res:
if '*' in i:
i = i.replace('*', 'x')
ans.append(i)
else:
ans.append(i)
return ans
It checks out but the site for which i'm doing this test(here's a link to the task(https://edabit.com/challenge/9zf6scCreprSaQAPq) tells my that my list is not correctly sorted, which it is, can someone help me improve this code or smth so it works in every case-scenario?
P.S.
if 'x' in i:
i = i.replace('x', '*')
This is made so i can use the eval function but the site input has 'x' instead of '*' in their lists..
You can try this. But using eval is dangerous on untrusted strings.
In [63]: a=['1 + 1', '1 + 7', '1 + 5', '1 + 4']
In [69]: def evaluate(_str):
...: return eval(_str.replace('x','*'))
output
In [70]: sorted(a,key=evaluate)
Out[70]: ['1 + 1', '1 + 4', '1 + 5', '1 + 7']
In [71]: sorted(['4 - 4', '2 - 2', '5 - 5', '10 - 10'],key=evaluate)
Out[71]: ['4 - 4', '2 - 2', '5 - 5', '10 - 10']
In [72]: sorted(['2 + 2', '2 - 2', '2 x 1'],key=evaluate)
Out[72]: ['2 - 2', '2 x 1', '2 + 2']
I don't think it is an issue with your code, probably they are using something older that 3.6 and it is messing up the order of the dict. A tuple would be safer.
def sort_by_answer(lst):
string = ','.join(lst).replace('x','*')
l = string.split(',')
d = [(k.replace('*','x'), eval(k)) for k in l]
ans = [expr for expr, value in sorted(d, key = lambda x: x[1])]
return ans
EDIT:
#Ch3steR's answer is more pythonic:
def sort_by_answer(lst):
return sorted(lst, key= lambda x: eval(x.replace('x','*')))

Sort list of strings by very special key [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I must implement sorting the list of strings in a way which is much similar to sorted function, but with one important distinction. As you know, the sorted function accounts space character prior digits character, so sorted(['1 ', ' 9']) will give us [' 9', '1 ']. I need sorted that accounts digit character prior space chars, so in our example the result will be ['1 ', ' 9'].
Update
As I understand, by default the sorted behaviour relies on the order of chars in ascii 'alphabet' (i.e. ''.join([chr(i) for i in range(59, 127)])), so I decided to implement my own ascii 'alphabet' in the my_ord function.
I planned to use this function in junction with simple my_sort function as a key for sorted,
def my_ord(c):
punctuation1 = ''.join([chr(i) for i in range(32, 48)])
other_stuff = ''.join([chr(i) for i in range(59, 127)])
my_alphabet = string.digits + punctuation1 + other_stuff
return my_alphabet.find(c)
def my_sort(w):
return sorted(w, key=my_ord)
like this: sorted([' 1 ', 'abc', ' zz zz', '9 '], key=my_sort).
What I'm expecting in this case, is ['9 ', ' 1 ', ' zz zz', 'abc']. Unfortunately, the result not only doesn't match the expected - moreover, it differs from time to time.
You can use lstrip as the key function to ignore the whitespace on the left, front of the string.
r = sorted(['1 ', ' 9' , ' 4', '2 '], key=str.lstrip)
# r == ['1 ', '2 ', ' 4', ' 9']
key specifies a function of one argument that is used to extract a comparison key from each list element, doc.
Try this
import string
MY_ALPHABET = (
string.digits
+ ''.join([chr(i) for i in range(32, 127) if chr(i) not in string.digits])
)
inp = [' 1 ', 'abc', ' zz zz', '9 ', 'a 1', 'a ']
print(inp, '-->', sorted(inp, key=lambda w: [MY_ALPHABET.index(c) for c in w]))
You want a combination of lexical and numerical sorting. You can do that by chopping up the string into a tuple and converting the digits to int. Now the tuple compare will consider each element by its own comparison rules.
I've used regex to split the string into (beginning text, white space, the digits, everything else) created an int and used that for the key. if the string didn't match the pattern, it just returns the original string in a tuple so that it can be used for comparison also.
I moved the whitespace before the digit (group(2)) after the digit but it may make more sense to leave it out of the comparison completely.
import re
test = ['1 ', ' 9']
wanted = ['1 ', ' 9']
def sort_key(val):
"""Return tuple of (text, int, spaces, remainder) or just
(text) suitable for sorting text lexagraphically but embedded
number numerically"""
m = re.match(r"(.*?)(\s*)(\d+)(.*)", val)
if m:
return (m.group(1), int(m.group(3)), m.group(2), m.group(4))
else:
return (val,)
result = sorted(test, key=sort_key)
print(test, '-->', result)
assert result == wanted, "results compare"
For completeness and maybe efficiency in extreme cases, here is a solution using numpy argsort:
import numpy as np
lst = ['1 ', ' 9' , ' 4', '2 ']
order = np.argsort(np.array([s.lstrip() for s in lst]))
result = list(np.array(lst)[order])
Overall, I think that using sorted(..., key=...) is generally superior and this solution makes more sense if the input is already a numpy array. On the other hand, it uses strip() only once per item and makes use of numpy, so it is possible that for large enough lists, it could be faster. Additionally, it produces order, whitch shows where each sorted element was in the original list.
As a last comment, from the code you provide, but not the example you give, I am not sure if you just want to strip the leading white spaces, or do more, e.g. best-way-to-strip-punctuation-from-a-string-in-python, or first order on the string without punctuatation and then if they are equal, order on the rest (solution by tdelaney) In any case it might not be a bad idea to compile a pattern, e.g.
import numpy as np
import re
pattern = re.compile(r'[^\w]')
lst = ['1 ', ' 9' , ' 4', '2 ']
order = np.argsort(np.array([pattern.sub('',s) for s in lst]))
result = list(np.array(lst)[order])
or:
import re
pattern = re.compile(r'[^\w]')
r = sorted(['1 ', ' 9' , ' 4', '2 '], key= lambda s: pattern.sub('',s))

Align numbers in sublist

I have a set of numbers that I want to align considering the comma:
10 3
200 4000,222 3 1,5
200,21 0,3 2
30000 4,5 1
mylist = [['10', '3', '', ''],
['200', '4000,222', '3', '1,5'],
['200,21', '', '0,3', '2'],
['30000', '4,5', '1', '']]
What I want is to align this list considering the comma:
expected result:
mylist = [[' 10 ', ' 3 ', ' ', ' '],
[' 200 ', '4000,222', '3 ', '1,5'],
[' 200,21', ' ', '0,3', '2 '],
['30000 ', ' 4,5 ', '1 ', ' ']]
I tried to turn the list:
mynewlist = list(zip(*mylist))
and to find the longest part after the comma in every sublist:
for m in mynewlist:
max([x[::-1].find(',') for x in m]
and to use rjust and ljust but I don't know how to ljust after a comma and rjust before the comma, both in the same string.
How can I resolve this without using format()?
(I want to align with ljust and rjust)
Here's another approach that currently does the trick. Unfortunately, I can't see any simple way to make this work, maybe due to the time :-)
Either way, I'll explain it. r is the result list created before hand.
r = [[] for i in range(4)]
Then we loop through the values and also grab an index with enumerate:
for ind1, vals in enumerate(zip(*mylist)):
Inside the loop we grab the max length of the decimal digits present and the max length of the word (the word w/o the decimal digits):
l = max(len(v.partition(',')[2]) for v in vals) + 1
mw = max(len(v if ',' not in v else v.split(',')[0]) for v in vals)
Now we go through the values inside the tuple vals and build our results (yup, can't currently think of a way to avoid this nesting).
for ind2, v in enumerate(vals):
If it contains a comma, it should be formatted differently. Specifically, we rjust it based on the max length of a word mw and then add the decimal digits and any white-space needed:
if ',' in v:
n, d = v.split(',')
v = "".join((n.rjust(mw),',', d, " " * (l - 1 - len(d))))
In the opposite case, we simply .rjust and then add whitespace:
else:
v = "".join((v.rjust(mw) + " " * l))
finally, we append to r.
r[ind1].append(v)
All together:
r = [[] for i in range(4)]
for ind1, vals in enumerate(zip(*mylist)):
l = max(len(v.partition(',')[2]) for v in vals) + 1
mw = max(len(v if ',' not in v else v.split(',')[0]) for v in vals)
for ind2, v in enumerate(vals):
if ',' in v:
n, d = v.split(',')
v = "".join((n.rjust(mw),',', d, " " * (l - 1 - len(d))))
else:
v = "".join((v.rjust(mw) + " " * l))
r[ind1].append(v)
Now, we can print it out:
>>> print(*map(list,zip(*r)), sep='\n)
[' 10 ', ' 3 ', ' ', ' ']
[' 200 ', '4000,222', '3 ', '1,5']
[' 200,21', ' ', '0,3', '2 ']
['30000 ', ' 4,5 ', '1 ', ' ']
Here's a bit different solution that doesn't transpose my_list but instead iterates over it twice. On the first pass it generates a list of tuples, one for each column. Each tuple is a pair of numbers where first number is length before comma and second number is length of comma & everything following it. For example '4000,222' results to (4, 4). On the second pass it formats the data based on the formatting info generated on first pass.
from functools import reduce
mylist = [['10', '3', '', ''],
['200', '4000,222', '3', '1,5'],
['200,21', '', '0,3', '2'],
['30000', '4,5', '1', '']]
# Return tuple (left part length, right part length) for given string
def part_lengths(s):
left, sep, right = s.partition(',')
return len(left), len(sep) + len(right)
# Return string formatted based on part lengths
def format(s, first, second):
left, sep, right = s.partition(',')
return left.rjust(first) + sep + right.ljust(second - len(sep))
# Generator yielding part lengths row by row
parts = ((part_lengths(c) for c in r) for r in mylist)
# Combine part lengths to find maximum for each column
# For example data it looks like this: [[5, 3], [4, 4], [1, 2], [1, 2]]
sizes = reduce(lambda x, y: [[max(z) for z in zip(a, b)] for a, b in zip(x, y)], parts)
# Format result based on part lengths
res = [[format(c, *p) for c, p in zip(r, sizes)] for r in mylist]
print(*res, sep='\n')
Output:
[' 10 ', ' 3 ', ' ', ' ']
[' 200 ', '4000,222', '3 ', '1,5']
[' 200,21', ' ', '0,3', '2 ']
['30000 ', ' 4,5 ', '1 ', ' ']
This works for python 2 and 3. I didn't use ljust or rjust though, i just added as many spaces before and after the number as are missing to the maximum sized number in the column:
mylist = [['10', '3', '', ''],
['200', '4000,222', '3', '1,5'],
['200,21', '', '0,3', '2'],
['30000', '4,5', '1', '']]
transposed = list(zip(*mylist))
sizes = [[(x.index(",") if "," in x else len(x), len(x) - x.index(",") if "," in x else 0)
for x in l] for l in transposed]
maxima = [(max([x[0] for x in l]), max([x[1] for x in l])) for l in sizes]
withspaces = [
[' ' * (maxima[i][0] - sizes[i][j][0]) + number + ' ' * (maxima[i][1] - sizes[i][j][1])
for j, number in enumerate(l)] for i, l in enumerate(transposed)]
result = list(zip(*withspaces))
Printing the result in python3:
>>> print(*result, sep='\n')
(' 10 ', ' 3 ', ' ', ' ')
(' 200 ', '4000,222', '3 ', '1,5')
(' 200,21', ' ', '0,3', '2 ')
('30000 ', ' 4,5 ', '1 ', ' ')

comparing values in two dictionaries in python

I am stuck in the middle of my coding because of this:
I have two dictionaries as follows:
a = {0:['1'],1:['0','-3']}
b = {'box 4': ['0 and 2', '0 and -3', ' 0 and -1', ' 2 and 3'], 'box 0': [' 1 ', ' 1 and 4 ', ' 3 and 4']
I want to find if the values in the first dictionaries match the values in the second and if it does, I want to return the matched key and values in dictionary b.
For example: The result of the comparison will return box4, ['0','-3'] here as ['0','-3'] is an item in a and it has been found also in b ['0 and -3'], however if only '3' has been found I don't want it to return anything as there's no values match it. the result will also return box0, ['1'] as it is an item in a and it has been found also in b.
Any ideas ? I appreciate any helps.
You say, "the result of the comparison will return box4 here as ['0','-3'] is an item in a and it has been found also in b ['0 and -3'],". I do not see '0 and -3' in b.
Also, your question is not clear enough. Your code snippets are not complete and you have presented just once case here.
Nevertheless, I will make the mistake of assuming that you want something like this
normalized_values = set([" and ".join(tokens) for tokens in a.values()])
for k in b:
if normalized_values.intersection(set(b[k])):
print k
here you go: its simple coded,
>>> a_values = a.values()
>>> for x,y in b.items():
... for i in y:
... i = i.strip()
... if len(i)>1:
... i = i.split()[::2]
... if i in a_values:
... print x,i
... else:
... if list(i) in a_values:
... print x,list(i)
box 4 ['0', '-3']
box 0 ['1']
pythonic way:
>>> [ [x,i] for x,y in b.items() for i in y if re.findall('-?\d',i) in a_values ]
[['box 4', ' 0 and -3'], ['box 0', ' 1 ']]

Categories