Match word if not followed or preceded by < or > - python

I am trying to not match words that are followed or preceded by an XML tag.
import re
strTest = "<random xml>hello this was successful price<random xml>"
for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
c1 = c.group(1)
c2 = c.group(2)
if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
print c1
Result is:
xml
this
was
successful
xml
Wanted Result:
this
was
successful
I have been trying negative lookahead and negative lookbehind assertions. I'm not sure if this is the right approach, I would appreciate any help.

First, to answer your question directly:
I do it by examining each 'word' consisting of a sequence of characters containing (mainly) alphabetics or '<' or '>'. When the regex offers them to some_only I look for one of the latter two characters. If neither appears I print the 'word'.
>>> import re
>>> strTest = "<random xml>hello this was successful price<random xml>"
>>> def some_only(matchobj):
... if '<' in matchobj.group() or '>' in matchobj.group():
... pass
... else:
... print (matchobj.group())
... pass
...
>>> ignore = re.sub(r'[<>\w]+', some_only, strTest)
this
was
successful
This works for your test string; however, as others have already mentioned, using a regex on xml will usually lead to many woes.
To use a more conventional approach I had to tidy away a couple of errors in that xml string, namely to change random xml to random_xml and to using a proper closing tag.
I prefer to use the lxml library.
>>> strTest = "<random_xml>hello this was successful price</random_xml>"
>>> from lxml import etree
>>> tree = etree.fromstring(strTest)
>>> tree.text
'hello this was successful price'
>>> tree.text.split(' ')[1:-1]
['hello', 'this', 'was', 'successful', 'price']
>>> tree.text.split(' ')[1:-1]
['this', 'was', 'successful']

I'll give it a shot. Since we are already doing more than just a regex, put it into a list and drop the first/last items:
import re
strTest = "<random xml>hello this was successful price<random xml>"
thelist = []
for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
c1 = c.group(1)
c2 = c.group(2)
if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
thelist.append(c1)
thelist = thelist[1:-1]
print (thelist)
result:
['this', 'was', 'successful']
I would personally try to parse the XML instead, but since you have this code already up this slight modification could do the trick.

A simple way to do it, with a list, but I am supposing the followed or preceded word by an XML tag and the proper tag are not separated by an space:
test = "<random xml>hello this was successful price<random xml>"
test = test.split()
new_test = []
for val in test:
if "<" not in val and ">" not in val:
new_test.append(val)
print(new_test)
The result will be:
['this', 'was', 'successful']

My soultion...
I don't see the need to use regex at all, you could solve it in a one-line list comprehension:
words = [w for w in test.split() if "<" not in w and ">" not in w]

Related

Slice string at last digit in Python

So I have strings with a date somewhere in the middle, like 111_Joe_Smith_2010_Assessment and I want to truncate them such that they become something like 111_Joe_Smith_2010. The code that I thought would work is
reverseString = currentString[::-1]
stripper = re.search('\d', reverseString)
But for some reason this doesn't always give me the right result. Most of the time it does, but every now and then, it will output a string that looks like 111_Joe_Smith_2010_A.
If anyone knows what's wrong with this, it would be super helpful!
You can use re.sub and $ to match and substitute alphabetical characters
and underscores until the end of the string:
import re
d = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
new_s = [re.sub('[a-zA-Z_]+$', '', i) for i in d]
Output:
['111_Joe_Smith_2010', '111_Bob_Smith_2010']
You could strip non-digit characters from the end of the string using re.sub like this:
>>> import re
>>> re.sub(r'\D+$', '', '111_Joe_Smith_2010_Assessment')
'111_Joe_Smith_2010'
For your input format you could also do it with a simple loop:
>>> s = '111_Joe_Smith_2010_Assessment'
>>> i = len(s) - 1
>>> while not s[i].isdigit():
... i -= 1
...
>>> s[:i+1]
'111_Joe_Smith_2010'
You can use the following approach:
def clean_names():
names = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
for name in names:
while not name[-1].isdigit():
name = name[:-1]
print(name)
Here is another solution using rstrip() to remove trailing letters and underscores, which I consider a pretty smart alternative to re.sub() as used in other answers:
import string
s = '111_Joe_Smith_2010_Assessment'
new_s = s.rstrip(f'{string.ascii_letters}_') # For Python 3.6+
new_s = s.rstrip(string.ascii_letters+'_') # For other Python versions
print(new_s) # 111_Joe_Smith_2010

How to regex search strings in different order in python

I have a script that takes in an argument and tries to find a match using regex. On single values, I don't have any issues, but when I pass multiple words, the order matters. What can I do so that the regex returns no matter what the order of the supplied words are? Here is my example script:
import re
from sys import argv
data = 'some things other stuff extra words'
pattern = re.compile(argv[1])
search = re.search(pattern, data)
print search
if search:
print search.group(0)
print data
So based on my example, if I pass "some things" as an arg, then it matches, but if i pass "things some", it doesn't matches, and I would like it to. Optionally, I would like it to also return if either "some" or "things" match.
The argument passed could possibly be a regex
I think you want something like this:
search = filter(None, (re.search(arg, data) for arg in argv[1].split()))
Or
search = re.search('|'.join(argv[1].split()), data)
You can then check the search results, if len(search) == len(argv[1].split()), then it means all patterns matched, and if search is truthy, then it means at least one of them matched.
Ok, I think I got it, you can use a lookahead assertion like this:
>>> re.search('(?=.*thing)(?=.*same)', data)
You can obviously programatically build such regex:
re.search(''.join('(?=.*{})'.format(arg) for arg in argv[1].split()), data)
I think it would be better to just create several regexes and match each of them against the string. If any of them matches, you return True.
If you are just trying to match constant strings, the in operator is enough:
'some' in data or 'things' in data
You could also just split the data text into sublists, and check if the ordering/reverse ordering of search exists in it:
import re
data = 'some, things other stuff extra words blah.'
search = "things, some"
def search_text(data, search):
data_words = re.compile('\w+').findall(data)
# ['some', 'things', 'other', 'stuff', 'extra', 'words', 'blah']
search_words = re.compile('\w+').findall(search)
# ['things', 'some']
len_search = len(search_words)
candidates = [data_words[i:i+len_search] for i in range(0, len(data_words)-1, len_search-1)]
# [['some', 'things'], ['things', 'other'], ['other', 'stuff'], ['stuff', 'extra'], ['extra', 'words'], ['words', 'blah']]
return search_words in candidates or search_words[::-1] in candidates
print(search_text(data, search))
Which Outputs:
True

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

python split string on multiple delimeters without regex

I have a string that I need to split on multiple characters without the use of regular expressions. for example, I would need something like the following:
>>>string="hello there[my]friend"
>>>string.split(' []')
['hello','there','my','friend']
is there anything in python like this?
If you need multiple delimiters, re.split is the way to go.
Without using a regex, it's not possible unless you write a custom function for it.
Here's such a function - it might or might not do what you want (consecutive delimiters cause empty elements):
>>> def multisplit(s, delims):
... pos = 0
... for i, c in enumerate(s):
... if c in delims:
... yield s[pos:i]
... pos = i + 1
... yield s[pos:]
...
>>> list(multisplit('hello there[my]friend', ' []'))
['hello', 'there', 'my', 'friend']
Solution without regexp:
from itertools import groupby
sep = ' []'
s = 'hello there[my]friend'
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
I've just posted an explanation here https://stackoverflow.com/a/19211729/2468006
A recursive solution without use of regex. Uses only base python in contrast to the other answers.
def split_on_multiple_chars(string_to_split, set_of_chars_as_string):
# Recursive splitting
# Returns a list of strings
s = string_to_split
chars = set_of_chars_as_string
# If no more characters to split on, return input
if len(chars) == 0:
return([s])
# Split on the first of the delimiter characters
ss = s.split(chars[0])
# Recursive call without the first splitting character
bb = []
for e in ss:
aa = split_on_multiple_chars(e, chars[1:])
bb.extend(aa)
return(bb)
Works very similarly to pythons regular string.split(...), but accepts several delimiters.
Example use:
print(split_on_multiple_chars('my"example_string.with:funny?delimiters', '_.:;'))
Output:
['my"example', 'string', 'with', 'funny?delimiters']
If you're not worried about long strings, you could force all delimiters to be the same using string.replace(). The following splits a string by both - and ,
x.replace('-', ',').split(',')
If you have many delimiters you could do the following:
def split(x, delimiters):
for d in delimiters:
x = x.replace(d, delimiters[0])
return x.split(delimiters[0])
re.split is the right tool here.
>>> string="hello there[my]friend"
>>> import re
>>> re.split('[] []', string)
['hello', 'there', 'my', 'friend']
In regex, [...] defines a character class. Any characters inside the brackets will match. The way I've spaced the brackets avoids needing to escape them, but the pattern [\[\] ] also works.
>>> re.split('[\[\] ]', string)
['hello', 'there', 'my', 'friend']
The re.DEBUG flag to re.compile is also useful, as it prints out what the pattern will match:
>>> re.compile('[] []', re.DEBUG)
in
literal 93
literal 32
literal 91
<_sre.SRE_Pattern object at 0x16b0850>
(Where 32, 91, 93, are the ascii values assigned to , [, ])

How can I get part of regex match as a variable in python?

In Perl it is possible to do something like this (I hope the syntax is right...):
$string =~ m/lalala(I want this part)lalala/;
$whatIWant = $1;
I want to do the same in Python and get the text inside the parenthesis in a string like $1.
If you want to get parts by name you can also do this:
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
>>> m.groupdict()
{'first_name': 'Malcom', 'last_name': 'Reynolds'}
The example was taken from the re docs
See: Python regex match objects
>>> import re
>>> p = re.compile("lalala(I want this part)lalala")
>>> p.match("lalalaI want this partlalala").group(1)
'I want this part'
import re
astr = 'lalalabeeplalala'
match = re.search('lalala(.*)lalala', astr)
whatIWant = match.group(1) if match else None
print(whatIWant)
A small note: in Perl, when you write
$string =~ m/lalala(.*)lalala/;
the regexp can match anywhere in the string. The equivalent is accomplished with the re.search() function, not the re.match() function, which requires that the pattern match starting at the beginning of the string.
import re
data = "some input data"
m = re.search("some (input) data", data)
if m: # "if match was successful" / "if matched"
print m.group(1)
Check the docs for more.
there's no need for regex. think simple.
>>> "lalala(I want this part)lalala".split("lalala")
['', '(I want this part)', '']
>>> "lalala(I want this part)lalala".split("lalala")[1]
'(I want this part)'
>>>
import re
match = re.match('lalala(I want this part)lalala', 'lalalaI want this partlalala')
print match.group(1)
import re
string_to_check = "other_text...lalalaI want this partlalala...other_text"
p = re.compile("lalala(I want this part)lalala") # regex pattern
m = p.search(string_to_check) # use p.match if what you want is always at beginning of string
if m:
print m.group(1)
In trying to convert a Perl program to Python that parses function names out of modules, I ran into this problem, I received an error saying "group" was undefined. I soon realized that the exception was being thrown because p.match / p.search returns 0 if there is not a matching string.
Thus, the group operator cannot function on it. So, to avoid an exception, check if a match has been stored and then apply the group operator.
import re
filename = './file_to_parse.py'
p = re.compile('def (\w*)') # \w* greedily matches [a-zA-Z0-9_] character set
for each_line in open(filename,'r'):
m = p.match(each_line) # tries to match regex rule in p
if m:
m = m.group(1)
print m

Categories