Programmatically figuring out if translated names are equivalent - python

I'm trying to see if two translated names are equivalent. Sometimes the translation will have the names ordered differently. For example:
>>> import difflib
>>> a = 'Yuk-shing Au'
>>> b = 'Au Yuk Sing'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.6086956521739131
'Yuk-Shing Au' and 'Au Yuk Sing' are the same person. Is there a way to detect something like this, such that the ratio for names like this will be much higher? Similar to the result for:
>>> a = 'Yuk-shing Au'
>>> b = 'Yuk Sing Au'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.8181818181818182

You can normalize the ordering of names before comparing:
def normalize(name):
name_parts = name.replace("-", " ").split()
return " ".join(sorted(name_parts)).lower()

Related

how to simply align semicolon from strings in Python?

I need to align semicolon in strings. For Example:
Input:
1-1: abc
1-2-1: defghi
1-2-1a: jklmnopqr
1-2-1a-1-1-1a: stuvwxyz
Ouput:
1-1 : abc
1-2-1 : defghi
1-2-1a : jklmnopqr
1-2-1a-1-1-1a: stuvwxyz
And the following is my solution.
strs = ['1-1: abc', '2-2-2: defghi', '3-3-3b: jklmnopqr', '1-2-1a-1-1-1a: stuvwxyz']
lengths = [s.find(':') for s in strs]
for i, s in enumerate(strs):
if lengths[i] == -1:
new_strs.append(s)
else:
new_strs.append(s[:lengths[i]] + ' ' * (max(lengths) - lengths[i]) + s[lengths[i]:])
Is there any simple way to implement? Thank you.
Functions like ljust and rjust are your friends when trying to align output:
>>> aligned = [f"{s.split(':')[0].ljust(max(s.index(':') for s in strs))}:{s.split(':')[1]}" for s in strs]
>>> print("\n".join(aligned))
1-1 : abc
2-2-2 : defghi
3-3-3b : jklmnopqr
1-2-1a-1-1-1a: stuvwxyz
or a little less compactly:
>>> i = max(s.index(":") for s in strs)
>>> cols = [s.split(":") for s in strs]
>>> aligned = [f"{c[0].ljust(i)}:{c[1]}" for c in cols]
Python f-strings let you reference variables in the current namespace. In addition, its format specification also allows field descriptors to come from variables also. After splitting out the two parts of the string, find the longest value to the left of the colon and you have the padding for all of them.
>>> strs = ['1-1: abc', '2-2-2: defghi', '3-3-3b: jklmnopqr', '1-2-1a-1-1-1a: stuvwxyz']
>>> parts = [s.split(":", 1) for s in strs]
>>> field_len = max(len(p[0]) for p in parts)
>>> for p0, p1 in parts:
... print(f"{p0:<{field_len}s}: {p1}")
...
1-1 : abc
2-2-2 : defghi
3-3-3b : jklmnopqr
1-2-1a-1-1-1a: stuvwxyz

Perform mathematical expression written as a string using variables values from dictionary (python)

Let's say I have this operation as a string variable:
formula = '{a} + {b}'
And I have a dictionary such as
data = {'a': 3, 'b': 4}
Is there such a functionality in some library where:
evaluate(operation = formula, variables = data)
gives:
7
If you are using Python3 you can do something like this with string formatting:
>>> import ast
>>> data = {'a': 3, 'b': 4}
>>> formula = '{a} + {b}'
>>> res_string = formula.format(a=data['a'], b=data['b'])
>>> res = ast.literal_eval(res_string)
>>> print(res)
7
Or even better as pointed by Steven in the comments:
res_string = formula.format(**data)
Or if you are using Python3.6 you can even do this with the cool f-string:
>>> f"{sum(data.values())}"
'7'
Although not recomended, you can use eval(). Check out:
>>> data = {'a': 3, 'b': 4}
>>> eval('{a} + {b}'.format(**data))
>>> 7
eval() will try to execute the given string as python code.
For more information about python format you can take a look at the really nice pyformat site.
First you need to parse your string, then you need to have a proper dictionary in order to map the founded operators to their equivalent functions, which you can use operator module for this aim:
In [54]: from operator import add
In [55]: operators = {'+': add} # this is an example, if you are dealing with more operations you need to add them to this dictionary
In [56]: def evaluate(formula, data):
a, op, b = re.match(r'{(\w)} (\W) {(\w)}', formula).groups()
op = operators[op]
a, b = data[a], data[b]
return op(a, b)
....:
In [57]: evaluate(formula, data)
Out[57]: 7

python find digits with leading '_v'

Is there a better way of finding digits in a string which starts with '_v' which stands for version number? What I want is just '001'
filename = 'greatv02_v001_jam.mb'
parts = re.split('_v|\_',filename)
>>['greatv02', '001', 'jam.mb']
b = re.findall(r'\d+', filename)
>>['02', '001']
Is there a way to split a string with something along these lines?
parts = re.split('_v###_',filename)
or
parts = re.split('_v*_',filename)
You could use lookarounds:
>>> filename = 'greatv02_v001_jam.mb'
>>> import re
>>> re.findall(r'(?<=_v)\d+', filename)
['001']
>>>
>>> filename = 'greatv02_v001_av456jam.mb'
>>> re.findall(r'(?<=_v)\d+', filename)
['001']
>>> filename = 'greatv02_v001_v456jam.mb'
>>> re.findall(r'(?<=_v)\d+', filename)
['001', '456']
>>>
Ugly, but you could partition the file name twice
>>> filename.partition('_v')[2].partition('_')[0]
'001'
Use regex's grouping like this:
.*_v(\d+).*
Demo:
>>> filename = 'greatv02_v001_jam.mb'
>>> pattern = re.compile(r'.*_v(\d+).*')
>>> re.search(pattern, filename).group(1)
'001'
How about the regex _v(?P<version>\d+).*:
>>> regex = re.compile("_v(?P<version>\d+).*")
>>> r = regex.search(string)
# List the groups found
>>> r.groups()
(u'001',)
# List the named dictionary objects found
>>> r.groupdict()
{u'version': u'001'}
# Run findall
>>> regex.findall(string)
[u'001']
# Run timeit test
>>> setup = ur"import re; regex =re.compile("_v(?P<version>\d+).*");string="""greatv02_v00 ...
>>> t = timeit.Timer('regex.search(string)',setup)
>>> t.timeit(10000)
0.005126953125

Extract some interested part of list value

From this list
List = ['/asd/dfg/ert.py','/wer/cde/xcv.img']
Got this
List = ['ert.py','xcv.img']
There's a low-level split-based approach:
>>> a = ['/asd/dfg/ert.py','/wer/cde/xcv.img']
>>> b = [elem.split("/")[-1] for elem in a]
>>> b
['ert.py', 'xcv.img']
Or a higher-level, more descriptive approach, which is probably more robust:
>>> import os
>>> b = [os.path.basename(filename) for filename in a]
>>> b
['ert.py', 'xcv.img']
Of course this assumes that I've guessed right about what you wanted; your example is somewhat underspecified.
$List = array('/asd/dfg/ert.py','/wer/cde/xcv.img');
$pattern = "#/.*/#";
foreach ($List AS $key => $str)
$List[$key] = preg_replace($pattern, '', $str);
print_r($List);

How to search and export value with python and re?

I'm trying to export some value from the text to a txt file.
my text has this form:
"a='one' b='2' c='3' a='two' b='8' c='3'"
I want to export all the value of the key "a"
The result must be like
one
two
The other answers are correct for your particular case, but I think a regex with lookbehind/lookahead is a more general solution, i.e.:
import re
text = "a='one' b='2' c='3' a='two' b='8' c='3'"
expr = r"(?<=a=')[^']*(?=')"
matches = re.findall(expr,text)
for m in matches:
print m ##or whatever
This will match for any expression between single quotes preceded by a=, i.e. a='xyz', a='my#1.abcd' and a='a=5%' will all match
This regex is very easy to understand:
pattern = r"a='(.*?)'"
It doesn't use lookarounds (like (?<=a=')[^']*(?=') ) - so it's very simple ..
Whole program:
#!/usr/bin/python
import re
text = "a='one' b='2' c='3' a='two' b='8' c='3'"
pattern = r"a='(.*?)'"
for m in re.findall( pattern, text ):
print m
you can use something like this:
import re
r = re.compile(r"'([a-z]+)'")
f = open('input')
text = f.read()
m = r.finditer(text)
for mm in m:
print mm.group(1)
thought i would give a solution without re:
>>> text = "a='one' b='2' c='3' a='two' b='8' c='3'"
>>> step1 = text.split(" ")
>>> step1
["a='one'", "b='2'", "c='3'", "a='two'", "b='8'", "c='3'"]
>>> step2 = []
>>> for pair in step1:
split_pair = pair.split("=")
step2.append([split_pair[0],split_pair[1]])
>>> print step2
[['a', "'one'"], ['b', "'2'"], ['c', "'3'"], ['a', "'two'"], ['b', "'8'"], ['c', "'3'"]]
>>> results = []
>>> for split_pair in step2:
if split_pair[0] == "a":
results.append(split_pair[1])
>>> results
["'one'", "'two'"]
not the most elegant method, but it works.
Another non-regex solution: you could use the shlex module and the .partition method (or .split() with maxsplit=1):
>>> import shlex
>>> s = "a='one' b='2' c='3' a='two' b='8' c='3'"
>>> shlex.split(s)
['a=one', 'b=2', 'c=3', 'a=two', 'b=8', 'c=3']
>>> shlex.split(s)[0].partition("=")
('a', '=', 'one')
and so it's simply
>>> for group in shlex.split(s):
... key, eq, val = group.partition("=")
... if key == 'a':
... print val
...
one
two
with lots of variations of the same.

Categories