Spliting string with lookahead/lookbehind assertions for empty string match [duplicate]

Spliting string with lookahead/lookbehind assertions for empty string match [duplicate] - python

This question already has an answer here:
python re.split lookahead pattern
(1 answer)
Closed 6 years ago.
I`m trying to split and to rename some ugly looking variable names (as an example):
In[1]: import re
ugly_names = ['some-Ugly-Name', 'ugly:Case:Style', 'uglyNamedFunction']
new_names = []
In[2]: patt = re.compile(r'(?<=[a-z])[\-:]?(?=[A-Z])')
In[3]: for name in ugly_names:
loc_name = patt.split(name)
new_names.append("_".join(s.lower() for s in loc_name))
print(new_names)
Out[3]: ['some_ugly_name', 'ugly_case_style', 'uglynamedfunction']
What's wrong with my pattern? Why doesn't it match on empty string, or I'm missing something?
p.s.: Is it possible with Python's regex to split on empty strings or should I use some other functions and .groups()?

Not a direct answer to the question, but just an alternative way - use the inflection library (have to handle : separately though):
>>> import inflection
>>>
>>> [inflection.underscore(name.replace(":", "_")) for name in ugly_names]
['some_ugly_name', 'ugly_case_style', 'ugly_named_function']

Related

how to remove parantheses and string from a string [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
Literally, I've been trying to a way to solve this but it seems that I'm poor on regex;)
I need to remove (WindowsPath and )"from the strings in a list
x= ["(WindowsPath('D:/test/1_birds_bp.png'),WindowsPath('D:/test/1_eagle_mp.png'))", "(WindowsPath('D:/test/2_reptiles_bp.png'),WindowsPath('D:/test/2_crocodile_mp.png'))"]
So I tried
import re
cleaned_x = [re.sub("(?<=WindowsPath\(').*?(?='\))",'',a) for a in x]
outputs
["(WindowsPath(''),WindowsPath(''))", "(WindowsPath(''),WindowsPath(''))"]
what I need to have is;
cleaned_x= [('D:/test/1_birds_bp.png','D:/test/1_eagle_mp.png'), ('D:/test/2_reptiles_bp.png','D:/test/2_crocodile_mp.png')]
basically tuples in a list.

You can accomplish this by using re.findall like this:
>>> cleaned_x = [tuple(re.findall(r"[A-Z]:/[^']+", a)) for a in x]
>>> cleaned_x
[('D:/test/1_birds_bp.png', 'D:/test/1_eagle_mp.png'), ('D:/test/2_reptiles_bp.png',
'D:/test/2_crocodile_mp.png')]
>>>
Hope it helps.

Perhaps you could use capturing groups? For instance:
import re
re_winpath = re.compile(r'^\(WindowsPath\(\'(.*)\'\)\,WindowsPath\(\'(.*)\'\)\)$')
def extract_pair(s):
m = re_winpath.match(s)
if m is None:
raise ValueError(f"cannot extract pair from string: {s}")
return m.groups()
pairs = list(map(extract_pair, x))

Here's my take,
not pretty, and I did it in two steps so as not to make regexp spagetti, and you could turn it into a list comprehension if you like, but it should work
for a in x:
a = re.sub('(\()?WindowsPath', '', a)
a = re.sub('\)$','', a)
print(a)

how to find more than one match with a regular expression? [duplicate]

This question already has answers here:
regexes: How to access multiple matches of a group? [duplicate]
(2 answers)
Closed 3 years ago.
i have a string like this:
to_search = "example <a>first</a> asdqwe <a>second</a>"
and i want to find both solutions between like this:
list = ["first","second"]
i know that when searching for one solution i should use this code:
import re
if to_search.find("<a>") > -1:
result = re.search('<a>(.*?)</a>', to_search)
s = result.group(1)
print(s)
but that only prints:
first
i tried result.group(2) and result.group(0) but i get the same solution
how can i make a list of all solutions?

Just use:
import re
to_search = "example <a>first</a> asdqwe <a>second</a>"
matches = re.findall(r'<a>(.*?)</a>', to_search)
print(matches)
OUTPUT
['first', 'second']

best to use a HTML parser than regex, but change re.search to re.findall

to_search = "example <a>first</a> asdqwe <a>second</a>"
for match in re.finditer("<a>(.*?)</a>", to_search):
captured_group = match.group(1)
# do something with captured group

Python - an extremely odd behavior of function lstrip [duplicate]

This question already has answers here:
Python string.strip stripping too many characters [duplicate]
(3 answers)
Closed 6 years ago.
I have encountered a very odd behavior of built-in function lstrip.
I will explain with a few examples:
print 'BT_NAME_PREFIX=MUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=NUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=PUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=SUV'.lstrip('BT_NAME_PREFIX=') # SUV
print 'BT_NAME_PREFIX=mUV'.lstrip('BT_NAME_PREFIX=') # mUV
As you can see, the function trims one additional character sometimes.
I tried to model the problem, and noticed that it persisted if I:
Changed BT_NAME_PREFIX to BT_NAME_PREFIY
Changed BT_NAME_PREFIX to BT_NAME_PREFIZ
Changed BT_NAME_PREFIX to BT_NAME_PREF
Further attempts have made it even more weird:
print 'BT_NAME=MUV'.lstrip('BT_NAME=') # UV
print 'BT_NAME=NUV'.lstrip('BT_NAME=') # UV
print 'BT_NAME=PUV'.lstrip('BT_NAME=') # PUV - different than before!!!
print 'BT_NAME=SUV'.lstrip('BT_NAME=') # SUV
print 'BT_NAME=mUV'.lstrip('BT_NAME=') # mUV
Could someone please explain what on earth is going on here?
I know I might as well just use array-slicing, but I would still like to understand this.
Thanks

You're misunderstanding how lstrip works. It treats the characters you pass in as a bag and it strips characters that are in the bag until it finds a character that isn't in the bag.
Consider:
'abc'.lstrip('ba') # 'c'
It is not removing a substring from the start of the string. To do that, you need something like:
if s.startswith(prefix):
s = s[len(prefix):]
e.g.:
>>> s = 'foobar'
>>> prefix = 'foo'
>>> if s.startswith(prefix):
... s = s[len(prefix):]
...
>>> s
'bar'
Or, I suppose you could use a regular expression:
>>> s = 'foobar'
>>> import re
>>> re.sub('^foo', '', s)
'bar'

The argument given to lstrip is a list of things to remove from the left of a string, on a character by character basis. The phrase is not considered, only the characters themselves.
S.lstrip([chars]) -> string or unicode
Return a copy of the string S with leading whitespace removed. If
chars is given and not None, remove characters in chars instead. If
chars is unicode, S will be converted to unicode before stripping
You could solve this in a flexible way using regular expressions (the re module):
>>> import re
>>> re.sub('^BT_NAME_PREFIX=', '', 'BT_NAME_PREFIX=MUV')
MUV

Python - Most elegant way to extract a substring, being given left and right borders [duplicate]

This question already has answers here:
How to extract the substring between two markers?
(22 answers)
Closed 4 years ago.
I have a string - Python :
string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
Expected output is :
"Atlantis-GPS-coordinates"
I know that the expected output is ALWAYS surrounded by "/bar/" on the left and "/" on the right :
"/bar/Atlantis-GPS-coordinates/"
Proposed solution would look like :
a = string.find("/bar/")
b = string.find("/",a+5)
output=string[a+5,b]
This works, but I don't like it.
Does someone know a beautiful function or tip ?

You can use split:
>>> string.split("/bar/")[1].split("/")[0]
'Atlantis-GPS-coordinates'
Some efficiency from adding a max split of 1 I suppose:
>>> string.split("/bar/", 1)[1].split("/", 1)[0]
'Atlantis-GPS-coordinates'
Or use partition:
>>> string.partition("/bar/")[2].partition("/")[0]
'Atlantis-GPS-coordinates'
Or a regex:
>>> re.search(r'/bar/([^/]+)', string).group(1)
'Atlantis-GPS-coordinates'
Depends on what speaks to you and your data.

What you haven't isn't all that bad. I'd write it as:
start = string.find('/bar/') + 5
end = string.find('/', start)
output = string[start:end]
as long as you know that /bar/WHAT-YOU-WANT/ is always going to be present. Otherwise, I would reach for the regular expression knife:
>>> import re
>>> PATTERN = re.compile('^.*/bar/([^/]*)/.*$')
>>> s = '/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/'
>>> match = PATTERN.match(s)
>>> match.group(1)
'Atlantis-GPS-coordinates'

import re
pattern = '(?<=/bar/).+?/'
string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
result = re.search(pattern, string)
print string[result.start():result.end() - 1]
# "Atlantis-GPS-coordinates"
That is a Python 2.x example. What it does first is:
1. (?<=/bar/) means only process the following regex if this precedes it (so that /bar/ must be before it)
2. '.+?/' means any amount of characters up until the next '/' char
Hope that helps some.
If you need to do this kind of search a bunch it is better to 'compile' this search for performance, but if you only need to do it once don't bother.

Using re (slower than other solutions):
>>> import re
>>> string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
>>> re.search(r'(?<=/bar/)[^/]+(?=/)', string).group()
'Atlantis-GPS-coordinates'

python re.sub, only replace part of match [duplicate]

This question already has answers here:
Why does re.sub replace the entire pattern, not just a capturing group within it?
(4 answers)
Closed 2 years ago.
I am very new to python
I need to match all cases by one regex expression and do a replacement. this is a sample substring --> desired result:
<cross_sell id="123" sell_type="456"> --> <cross_sell>
i am trying to do this in my code:
myString = re.sub(r'\<[A-Za-z0-9_]+(\s[A-Za-z0-9_="\s]+)', "", myString)
instead of replacing everything after <cross_sell, it replaces everything and just returns '>'
is there a way for re.sub to replace only the capturing group instead of the entire pattern?

You can use substitution groups:
>>> my_string = '<cross_sell id="123" sell_type="456"> --> <cross_sell>'
>>> re.sub(r'(\<[A-Za-z0-9_]+)(\s[A-Za-z0-9_="\s]+)', r"\1", my_string)
'<cross_sell> --> <cross_sell>'
Notice I put the first group (the one you want to keep) in parenthesis and then I kept that in the output by using the "\1" modifier (first group) in the replacement string.

You can use a group reference to match the first word and a negated character class to match the rest of the string between <> :
>>> s='<cross_sell id="123" sell_type="456">'
>>> re.sub(r'(\w+)[^>]+',r'\1',s)
'<cross_sell>'
\w is equal to [A-Za-z0-9_].

Since the input data is XML, you'd better parse it with an XML parser.
Built-in xml.etree.ElementTree is one option:
>>> import xml.etree.ElementTree as ET
>>> data = '<cross_sell id="123" sell_type="456"></cross_sell>'
>>> cross_sell = ET.fromstring(data)
>>> cross_sell.attrib = {}
>>> ET.tostring(cross_sell)
'<cross_sell />'
lxml.etree is an another option.

below code tested under python 3.6 , without use group..
test = '<cross_sell id="123" sell_type="456">'
resp = re.sub(r'\w+="\w+"' ,r'',test)
print (resp)
<cross_sell>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spliting string with lookahead/lookbehind assertions for empty string match [duplicate] - python

Not a direct answer to the question, but just an alternative way - use the inflection library (have to handle : separately though): >>> import inflection >>> >>> [inflection.underscore(name.replace(":", "_")) for name in ugly_names] ['some_ugly_name', 'ugly_case_style', 'ugly_named_function']

Related

how to remove parantheses and string from a string [duplicate]

how to find more than one match with a regular expression? [duplicate]

Python - an extremely odd behavior of function lstrip [duplicate]

Python - Most elegant way to extract a substring, being given left and right borders [duplicate]

python re.sub, only replace part of match [duplicate]

Categories

Resources