Regex match words and end of string - python

2 Regex question
How can I match a word or 2 words in a subpattern ()?
How can i match a word or 2 words that's either followed by a specific word like "with" OR the end of the string $
I tried
(\w+\W*\w*\b)(\W*\bwith\b|$)
but it's definitely not working
edit:
I'm thinking of matching both "go to mall" and "go to", in a way that i can group "go to" in python.

Perhaps something like this?
>>> import re
>>> r = re.compile(r'(\w+(\W+\w+)?)(\W+with\b|\Z)')
>>> r.search('bar baz baf bag').group(1)
'baf bag'
>>> r.search('bar baz baf with bag').group(1)
'baz baf'
>>> r.search('bar baz baf without bag').group(1)
'without bag'
>>> r.search('bar with bag').group(1)
'bar'
>>> r.search('bar with baz baf with bag').group(1)
'bar'

Here's what I came up with:
import re
class Bunch(object):
def __init__(self, **kwargs):
self.__dict__.update(kwargs)
match = re.compile(
flags = re.VERBOSE,
pattern = r"""
( (?!with) (?P<first> [a-zA-Z_]+ ) )
( \s+ (?!with) (?P<second> [a-zA-Z_]+ ) )?
( \s+ (?P<awith> with ) )?
(?![a-zA-Z_\s]+)
| (?P<error> .* )
"""
).match
s = 'john doe with'
b = Bunch(**match(s).groupdict())
print 's:', s
if b.error:
print 'error:', b.error
else:
print 'first:', b.first
print 'second:', b.second
print 'with:', b.awith
Output:
s: john doe with
first: john
second: doe
with: with
Tried it also with:
s: john
first: john
second: None
with: None
s: john doe
first: john
second: doe
with: None
s: john with
first: john
second: None
with: with
s: john doe width
error: john doe width
s: with
error: with
BTW: re.VERBOSE and re.DEBUG are your friends.
Regards,
Mick.

Related

Getting rid of white space between name, number and height

I have txt file like this;
name lastname 17 189cm
How do I get it to be like this?
name lastname, 17, 189cm
Using str.strip and str.split:
>>> my_string = 'name lastname 17 189cm'
>>> s = list(map(str.strip, my_string.split()))
>>> ', '.join([' '.join(s[:2]), *s[2:] ])
'name lastname, 17, 189cm'
You can use regex to replace multiple spaces (or tabs) with a comma:
import re
text = 'name lastname 17 189cm'
re.sub(r'\s\s+|\t', ', ', text)
text = 'name lastname 17 189cm'
out = ', '.join(text.rsplit(maxsplit=2)) # if sep is not provided then any consecutive whitespace is a separator
print(out) # name lastname, 17, 189cm
You could use re.sub:
import re
s = "name lastname 17 189cm"
re.sub("[ ]{2,}",", ", s)
PS: for the first problem you proposed, I had the following solution:
s = "name lastname 17 189cm"
s[::-1].replace(" ",",", 2)[::-1]

Combined regex pattern to match beginning and end of string and remove a separator character

I have the following strings:
"LP, bar, company LLP, foo, LLP"
"LLP, bar, company LLP, foo, LP"
"LLP,bar, company LLP, foo,LP" # note the absence of a space after/before comma to be removed
I am looking for a regex that takes those inputs and returns the following:
"LP bar, company LLP, foo LLP"
"LLP bar, company LLP, foo LP"
"LLP bar, company LLP, foo LP"
What I have so fat is this:
import re
def fix_broken_entity_names(name):
"""
LLP, NAME -> LLP NAME
NAME, LP -> NAME LP
"""
pattern_end = r'^(LL?P),'
pattern_beg_1 = r', (LL?P)$'
pattern_beg_2 = r',(LL?P)$'
combined = r'|'.join((pattern_beg_1, pattern_beg_2, pattern_end))
return re.sub(combined, r' \1', name)
When I run it tho:
>>> fix_broken_entity_names("LP, bar, company LLP, foo,LP")
Out[1]: ' bar, company LLP, foo '
I'd be very thankful for any tips or solutions :)
You can use
import re
texts = ["LP, bar, company LLP, foo, LLP","LLP, bar, company LLP, foo, LP","LLP,bar, company LLP, foo,LP"]
for text in texts:
result = ' '.join(re.sub(r"^(LL?P)\s*,|,\s*(LL?P)$", r" \1\2 ", text).split())
print("'{}' -> '{}'".format(text, result))
Output:
'LP, bar, company LLP, foo, LLP' -> 'LP bar, company LLP, foo LLP'
'LLP, bar, company LLP, foo, LP' -> 'LLP bar, company LLP, foo LP'
'LLP,bar, company LLP, foo,LP' -> 'LLP bar, company LLP, foo LP'
See a Python demo. The regex is ^(LL?P)\s*,|,\s*(LL?P)$:
^(LL?P)\s*, - start of string, LLP or LP (Group 1), zero or more whitespaces, comma
| - or
,\s*(LL?P)$ - a comma, zero or more whitespaces, LP or LLP (Group 2) and then of string.
Note the replacement is a concatenation of Group 1 and 2 values enclosed within single spaces, and a post-process step is to remove all leading/trailing whitespace and shrink whitespace inside the string to single spaces.
Make use of capture groups and reformat it how you wish:
regex:
([^,\r\n]+) *, *([^,\r\n]+) *, *([^,\r\n]+) *, *([^,\r\n]+) *, *([^,\r\n]+)
replacement
\1 \2, \3, \4 \5
https://regex101.com/r/jcEzzy/1/

Check if first letters of consecutive words in string match acronym of another string

Say I have a list and a string:
l=['hello my name is michael',
'hello michael is my name',
'hello michaela is my name',
'hello my name is michelle',
'hello i'm Michael',
'hello my lastname is michael',
'hello michael',
'hello my name is michael brown']
s="hello my name is michael"
Internally, I want to search for each word in the string and count how many times each word from this string appears in each list element.
hello my name is michael: 5
hello michael is my name: 5 (all words are present)
hello michaela is my name: 5 (extra characters at end of word are Ok)
hello my name is michelle: 4
hello i'm Michael: 2
hello my lastname is michael: 4 (extra characters are end of word are not Ok)
hello michael: 2
hello my name is michael brown: 5
Finally, I wish to return all matches in the order of the highest count items first. So the output would be:
hello my name is michael: 5
hello michael is my name: 5
hello michaela is my name: 5
hello my name is michael brown: 5
hello my name is michelle: 4
hello my lastname is michael: 4
hello i'm Michael: 2
hello michael: 2
This is essentially a regex matching and sorting problem, but I am over my head on this one. Any advice how to proceed with any or all of the steps?
I don't understand your expected output. Do you mean like this:
import re
l = ['hello my name is michael',
'hello michael is my names',
'hello michaela is my name',
'hello my name is michelle',
'hello i am Michael',
'hello my lastname is michael',
'hello michael',
'hello my name is michael brown']
s = "Hello my name is Michael"
s = s.lower().split()
for item in l:
d = item.lower().split()
count = 0
for ss in s:
try:
if ss in d or re.search(ss+"\w+",item.lower()).group() in d:
count += 1
except:
pass
print (item, count)

how to find a particular string in an element of array in python

I have a list of strings in python and if an element of the list contains the word "parthipan" I should print a message. But the below script is not working
import re
a = ["paul Parthipan","paul","sdds","sdsdd"]
last_name = "Parthipan"
my_regex = r"(?mis){0}".format(re.escape(last_name))
if my_regex in a:
print "matched"
The first element of the list contains the word "parthipan", so it should print the message.
If you want to do this with a regexp, you can't use the in operator. Use re.search() instead. But it works with strings, not a whole list.
for elt in a:
if re.search(my_regexp, elt):
print "Matched"
break # stop looking
Or in more functional style:
if any(re.search(my_regexp, elt) for elt in a)):
print "Matched"
You don't need regex for this simply use any.
>>> a = ["paul Parthipan","paul","sdds","sdsdd"]
>>> last_name = "Parthipan".lower()
>>> if any(last_name in name.lower() for name in a):
... print("Matched")
...
Matched
Why not:
a = ["paul Parthipan","paul","sdds","sdsdd"]
last_name = "Parthipan"
if any(last_name in ai for ai in a):
print "matched"
Also what for is this part:
...
import re
my_regex = r"(?mis){0}".format(re.escape(last_name))
...
EDIT:
Im just too blind to see what for do You need regex here. It would be best if You would give some real input and output. This is small example which could be done in that way too:
a = ["paul Parthipan","paul","sdds","sdsdd",'Mala_Koala','Czarna,Pala']
last_name = "Parthipan"
names=[]
breakers=[' ','_',',']
for ai in a:
for b in breakers:
if b in ai:
names.append(ai.split(b))
full_names=[ai for ai in names if len(ai)==2]
last_names=[ai[1] for ai in full_names]
if any(last_name in ai for ai in last_names):
print "matched"
But if regex part is really needed I cant imagine how to find '(?mis)Parthipan' in 'Parthipan'. Most simple would be in reverse direction 'Parthipan' in '(?mis)Parthipan'. Like here...
import re
a = ["paul Parthipan","paul","sdds","sdsdd",'Mala_Koala','Czarna,Pala']
last_name = "Parthipan"
names=[]
breakers=[' ','_',',']
for ai in a:
for b in breakers:
if b in ai:
names.append(ai.split(b))
full_names=[ai for ai in names if len(ai)==2]
last_names=[r"(?mis){0}".format(re.escape(ai[1])) for ai in full_names]
print last_names
if any(last_name in ai for ai in last_names):
print "matched"
EDIT:
Yhm, with regex You have few possibilities...
import re
a = ["paul Parthipan","paul","sdds","sdsdd",'jony-Parthipan','koala_Parthipan','Parthipan']
lastName = "Parthipan"
myRegex = r"(?mis){0}".format(re.escape(lastName))
strA=';'.join(a)
se = re.search(myRegex, strA)
ma = re.match(myRegex, strA)
fa = re.findall(myRegex, strA)
fi=[i.group() for i in re.finditer(myRegex, strA, flags=0)]
se = '' if se is None else se.group()
ma = '' if ma is None else ma.group()
print se, 'match' if any(se) else 'no match'
print ma, 'match' if any(ma) else 'no match'
print fa, 'match' if any(fa) else 'no match'
print fi, 'match' if any(fi) else 'no match'
output, only first one seems ok, so only re.search gives proper solution:
Parthipan match
no match
['Parthipan', 'Parthipan', 'Parthipan', 'Parthipan'] match
['Parthipan', 'Parthipan', 'Parthipan', 'Parthipan'] match

Regex with unicode and str

I have a list of regex and a replace function.
regex function
replacement_patterns = [(ur'\\u20ac', ur' euros'),(ur'\xe2\x82\xac', r' euros'),(ur'\b[eE]?[uU]?[rR]\b', r' euros'), (ur'\b([0-9]+)[eE][uU]?[rR]?[oO]?[sS]?\b',ur' \1 euros')]
class RegexpReplacer(object):
def __init__(self, patterns=replacement_patterns):
self.patterns = [(re.compile(regex, re.UNICODE | re.IGNORECASE), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.patterns:
(s, count) = re.subn(pattern, repl, s)
return s
If I write the string as bellow:
string='730\u20ac.\r\n\n ropa surf ... 5,10 muy buen estado..... 170 \u20ac\r\n\nPack 850\u20ac, reparaci\u00f3n. \r\n\n'
replacer = RegexpReplacer()
texto= replacer.replace(string)
I get perfect results.
But if I call the function when iterating over a JSON file I have just loaded, it does not work (no error but no replacement)
What seems to happen is that when I call the function over the typed variable the function receives a STR, and when I call it from the JSON iteration it receives a unicode.
My question is why my regex is not working on the unicode, wouldnt it be supposed to?
Maybe you need something like this
import re
regex = re.compile("^http://.+", re.UNICODE)
And if you need more than one, you can do like this
regex = re.compile("^http://.+", re.UNICODE | re.IGNORECASE)
Get the example
>>> r = re.compile("^http://.+", re.UNICODE | re.IGNORECASE)
>>> r.match('HTTP://ыыы')
<_sre.SRE_Match object at 0x7f572455d648>
Does it correct result?
>>> class RegexpReplacer(object):
... def __init__(self, patterns=replacement_patterns):
... self.patterns = [(re.compile(regex, re.UNICODE | re.IGNORECASE), repl) for (regex, repl) in patterns]
... def replace(self, text):
... s = text
... for (pattern, repl) in self.patterns:
... (s, count) = re.subn(pattern, repl, s)
... return s
...
>>> string='730\u20ac.\r\n\n ropa surf ... 5,10 muy buen estado..... 170 \u20ac\r\n\nPack 850\u20ac, reparaci\u00f3n. \r\n\n'
>>> replacer = RegexpReplacer()
>>> texto= replacer.replace(string)
>>> texto
u'730 euros.\r\n\n ropa surf ... 5,10 muy buen estado..... 170 euros\r\n\nPack 850 euros, reparaci\\u00f3n. \r\n\n'
If you want Unicode replacement patterns, you need also be operating on Unicode strings. JSON should be returning Unicode as well.
Change the following by removing \\ and removing UTF-8 (won't see in a Unicode string). Also you compile with IGNORE_CASE so no need for [eE], etc.:
replacement_patterns = [(ur'\u20ac', ur' euros'),(ur'\be?u?r\b', r' euros'), (ur'\b([0-9]+)eu?r?o?s?\b',ur' \1 euros')]
Make the following a Unicode string (add u):
string = u'730\u20ac.\r\n\n ropa surf ... 5,10 muy buen estado..... 170 \u20ac\r\n\nPack 850\u20ac, reparaci\u00f3n. \r\n\n'
Then it should operator on Unicode JSON as well.

Categories