Regular expression to match a specific pattern - python

I have the following string:
s = "<X> First <Y> Second"
and I can match any text right after <X> and <Y> (in this case "First" and "Second"). This is how I already did it:
import re
s = "<X> First <Y> Second"
pattern = r'\<([XxYy])\>([^\<]+)' # lower and upper case X/Y will be matched
items = re.findall(pattern, s)
print items
>>> [('X', ' First '), ('Y', ' Second')]
What I am now trying to match is the case without <>:
s = "X First Y Second"
I tried this:
pattern = r'([XxYy]) ([^\<]+)'
>>> [('X', ' First Y Second')]
Unfortunately it's not producing the right result. What am I doing wrong? I want to match X or x or Y or y PLUS one whitespace (for instance "X "). How can I do that?
EDIT: this is a possible string too:
s = "<X> First one <Y> Second <X> More <Y> Text"
Output should be:
>>> [('X', ' First one '), ('Y', ' Second '), ('X', ' More '), ('Y', ' Text')]
EDIT2:
pattern = r'([XxYy]) ([^ ]+)'
s = "X First text Y Second"
produces:
[('X', 'First'), ('Y', 'Second')]
but it should be:
[('X', 'First text'), ('Y', 'Second')]

How about something like: <?[XY]>? ([^<>XY$ ]+)
Example in javascript:
const re = /<?[XY]>? ([^<>XY$ ]+)/ig
console.info('<X> First <Y> Second'.match(re))
console.info('X First Y Second'.match(re))

If you know which whitespace char to match, you can just add it to your expression.
If you want any whitespace to match, you can use \s
pattern = r'\<([XxYy])\>([^\<]+)'
would then be
pattern = r'\<([XxYy])\>\s([^\<]+)'
Always keep in mind the the expression within the () is what will be returned as your result.

Assuming that a the whitespace token to match is a single space character, the pattern is:
pattern = r'([XxYy]) ([^ ]+)'

So i came up with this solution:
pattern = r"([XxYy]) (.*?)(?= [XxYy] |$)"

Related

selecting group of numbers in a string in python

I have a list of strings:
str_list = ['123_456_789_A1', '678_912_000_B1', '980_210_934_A1', '632_210_464_B1']
And I basically want another list:
output_list = ['789', '000', '934', '464']
It is always going to be the third group of numbers that will always have a _A of _B
so far I have:
import re
m = re.search('_(.+?)_A', text)
if m:
found = m.group(1)
But I keep getting somthing like: 456_789
Just use simple list comprehension for this
ans = [i.split("_")[-2] for i in lst]
If you only want to match digits followed by an underscore and an uppercase char, you can match the digits and assert the underscore and uppercase char directly to the right.
To match only A or B, use [AB] else use [A-Z] to match that range.
\d+(?=_[AB])
Regex demo
You can use re.search to find the first occurrence in the string.
import re
str_list = ['123_456_789_A1', '678_912_000_B1', '980_210_934_A1', '632_210_464_B1']
str_list = [re.search(r'\d+(?=_[AB])', s).group() for s in str_list]
print(str_list)
Output
['789', '000', '934', '464']
Or using a capturing group version, matching the _ before as well to be a bit more precise as in your pattern you also wanted to match the leading _
str_list = [re.search(r'_(\d+)_[AB]', s).group(1) for s in str_list]

Get string of either side of a desired letter

How to get the string of either side of letter EGemail#gmail.com
If the desired letter was "." it would print "l" and "c" from "gmail" and "com"
I do not think that using [] to separate the letter would work as I think the algorithm is much more complicated
Use index().
def getNeighbors(string, desired):
index = mystring.index(desired)
return mystring[index-1], mystring[index+1]
mystring = 'email#gmail.com'
desired = '.'
print(getNeighbors(mystring, desired)) # >>> ('l', 'c')
A couple notes:
This will return the characters around the first instance of '.'. It also does not perform bounds checking. Finally, it does not check that character actually exists in the string.
One possible solution using re module:
s = 'email#gmail.com'
l = '.'
import re
print( re.findall(r'(.)?{}(.)?'.format(re.escape(l)), s))
Prints:
[('l', 'c')]
EDIT: To get only first match you can use re.search:
s = 'email#gmail.com'
l = 'l'
import re
print( re.search(r'(.)?{}(.)?'.format(re.escape(l)), s).groups() )
Prints:
('i', '#')

Finding element in a string and smarter manipulation

I have a list of characters
a = ["s", "a"]
I have some words.
b = "asp"
c= "lat"
d = "kasst"
I know that the characters in the list can appear only once or in linear order(or at most on small set can appear in the bigger one).
I would like to split my words by putting the elements in a in the middle, an the rest on the left or on the right (and put a "=" if there is nothing)
so b = ["*", "as", "p"]
If a bigger set of characters which contains
d = ["k", "ass", "t"]
I know that the combinations can be at most of length 4.
So I have divided the possible combinations depending on the length:
import itertools
c4 = [''.join(i) for i in itertools.product(a, repeat = 4)]
c3 = [''.join(i) for i in itertools.product(a, repeat = 3)]
c2 = [''.join(i) for i in itertools.product(a, repeat = 2)]
c1 = [''.join(i) for i in itertools.product(a, repeat = 1)]
For each c, starting with the greater
For simplicity, let's say I start with c3 in this case and not with length 4.
I have to do this with a lot of data.
Is there a way to simplify the code ?
You can do something similar using a regular expression:
>>> import re
>>> p = re.compile(r'([sa]{1,4})')
p matches the characters 's' or 'a' repeated between 1 and 4 times.
To split a given string at this pattern, use p.split. The use of capturing parentheses in the pattern leads to the pattern itself being included in the result.
>>> p.split('asp')
['', 'as', 'p']
>>> p.split('lat')
['l', 'a', 't']
>>> p.split('kasst')
['k', 'ass', 't']
Use regex ?
import re
a = ["s", "a"]
text = "kasst"
pattern = re.compile("[" + "".join(a) + "]{1,4}")
match = pattern.search(text)
parts = [text[:match.start()], text[match.start():match.end()], text[match.end():]]
parts = [part if part else "*" for part in parts]
However, note that this won't handle the case when there is no match on the elements in a
I would do a regular expression to simplify the matching.
import re
splitters = ''.join(a)
pattern = re.compile("([^%s]*)([%s]+)([^%s]*)" % (splitters, splitters, splitters))
words = [v if v else '=' for v in pattern.match(s).groups() ]
This doesn't allow the characters in the first or last group, so not all string will match correctly (and throw an exception). You can allow them if you want. Feel free to modify the regular expression to better match what you want it to do.
Also you only need to run the re.compile once, not for every string you are trying to match.

Simplifying the extraction of particular string patterns with a multiple if-else and split()

Given a string like this:
>>> s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"
First I want to split the string by underscores, i.e.:
>>> s.split('_')
['X/NOUN/dobj>',
'hold/VERB/ROOT',
'<membership/NOUN/dobj',
'<with/ADP/prep',
'<Y/PROPN/pobj',
'>,/PUNCT/punct']
We assume that the underscore is solely used as the delimiter and never exist as part of the substring we want to extract.
Then I need to first checks whether each of these "nodes" in my splitted list above starts of ends with a '>', '<', then remove it and put the appropriate bracket as the end of the sublist, something like:
result = []
nodes = s.split('_')
for node in nodes:
if node.endswith('>'):
result.append( node[:-1].split('/') + ['>'] )
elif node.startswith('>'):
result.append( node[1:].split('/') + ['>'] )
elif node.startswith('<'):
result.append( node[1:].split('/') + ['<'] )
elif node.endswith('<'):
result.append( node[:-1].split('/') + ['<'] )
else:
result.append( node.split('/') + ['-'] )
And if it doesn't start of ends with an angular bracket then we append - to the end of the sublist.
[out]:
[['X', 'NOUN', 'dobj', '>'],
['hold', 'VERB', 'ROOT', '-'],
['membership', 'NOUN', 'dobj', '<'],
['with', 'ADP', 'prep', '<'],
['Y', 'PROPN', 'pobj', '<'],
[',', 'PUNCT', 'punct', '>']]
Given the original input string, is there a less verbose way to get to the result? Maybe with regex and groups?
s = 'X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct'
def get_sentinal(node):
if not node:
return '-'
# Assuming the node won't contain both '<' and '>' at a same time
for index in [0, -1]:
if node[index] in '<>':
return node[index]
return '-'
results = [
node.strip('<>').split('/') + [get_sentinal(node)]
for node in s.split('_')
]
print(results)
This does not make it significantly shorter, but personally I'd think it's somehow a little bit cleaner.
Use this:
import re
s_split = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct".split('_')
for i, text in enumerate(s_split):
Left, Mid, Right = re.search('^([<>]?)(.*?)([<>]?)$', text).groups()
s_split[i] = Mid.split('/') + [Left+Right or '-']
print s_split
I can't find a possible answer for a shorter one.
Use ternary to shorten code. Example: print None or "a" will print a. And also use regex to parse the occurence of <> easily.
Yes, although it's not pretty:
s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"
import re
out = []
for part in s.split('_'):
Left, Mid, Right = re.search('^([<>]|)(.*?)([<>]|)$', part).groups()
tail = ['-'] if not Left+Right else [Left+Right]
out.append(Mid.split('/') + tail)
print(out)
Try online: https://repl.it/Civg
It relies on two main things:
a regex pattern which always makes three groups ()()() where the edge groups only look for characters <, > or nothing ([<>]|), and the middle matches everything (non-greedy) (.*?). The whole thing is anchored at the start (^) and end ($) of the string so it consumes the whole input string.
Assuming that you will never have angles on both ends of the string, then the combined string Left+Right will either be an empty string plus the character to put at the end, one way or the other, or a completely empty string indicating a dash is required.
Instead of my other answer with regexes, you can drop a lot of lines and a lot of slicing, if you know that string.strip('<>') will strip either character from both ends of the string, in one move.
This code is about halfway between your original and my regex answer in linecount, but is more readable for it.
s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"
result = []
for node in s.split('_'):
if node.startswith('>') or node.startswith('<'):
tail = node[0]
elif node.endswith('>') or node.endswith('>'):
tail = node[-1]
else:
tail = '-'
result.append( node.strip('<>').split('/') + [tail])
print(result)
Try online: https://repl.it/Civr
Edit: how much less verbose do you want to get?
result = [node.strip('<>').split('/') + [(''.join(char for char in node if char in '<>') + '-')[0]] for node in s.split('_')]
print(result)
This is quite neat, you don't have to check which side the <> is on, or whether it's there at all. One step strip()s either angle bracket whichever side it's on, the next step filters only the angle brackets out of the string (whichever side they're on) and adds the dash character. This is either a string starting with any angle bracket from either side or a single dash. Take char 0 to get the right one.
Even shorter with a list comprehension and some regex magic:
import re
s = "X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct"
rx = re.compile(r'([<>])|/')
items = [list(filter(None, match)) \
for item in s.split('_') \
for match in [rx.split(item)]]
print(items)
# [['X', 'NOUN', 'dobj', '>'], ['hold', 'VERB', 'ROOT'], ['<', 'membership', 'NOUN', 'dobj'], ['<', 'with', 'ADP', 'prep'], ['<', 'Y', 'PROPN', 'pobj'], ['>', ',', 'PUNCT', 'punct']]
Explanation:
The code splits the items by _, splits it again with the help of the regular expression rx and filters out empty elements in the end.
See a demo on ideone.com.
I did not use regex and groups but it can be solution as shorter way.
>>> result=[]
>>> nodes=['X/NOUN/dobj>','hold/VERB/ROOT','<membership/NOUN/dobj',
'<with/ADP/prep','<Y/PROPN/pobj','>,/PUNCT/punct']
>>> for node in nodes:
... nd=node.replace(">",("/>" if node.endswith(">") else ">/"))
... nc=nd.replace("<",("/<" if nd.endswith("<") else "</"))
... result.append(nc.split("/"))
>>> nres=[inner for outer in result for inner in outer] #nres used to join all result at single array. If you dont need single array you can use result.

Regular expression that takes <...> as one item in "foo bar <hello world> and so on" (Goal: Simple music/lilypond parsing)

I am using the re module in Python(3) and want to substitute (re.sub(regex, replace, string)) a string in the following format
"foo <bar e word> f ga <foo b>"
to
"#foo <bar e word> #f #ga <foo b>"
or even
"#foo #<bar e word> #f #ga #<foo b>"
But I can't isolate single words from word boundaries within a <...> construct.
Help would be nice!
P.S 1
The whole story is a musical one:
I have strings in the Lilypond format (or better, a subset of the very simple core format, just notes and durations) and want to convert them to python pairs int(duration),list(of pitch strings). Performance is not important so I can convert them back and forth, iterate with python lists, split strings and join them again etc.
But for the above problem I did not found an answer.
Source String
"c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"
should result in
[
(4, ["c'"]),
(8, ["d"]),
(16, ["e'", "g'"]),
(4, ["fis'"]),
(0, ["a,,"]),
(0, ["g", "b'"]),
(1, ["c''"]),
]
the basic format is String+Number like so : e4 bes16
List item
the string can consist of multiple, at least one, [a-zA-Z] chars
the string is followed by zero or more digits: e bes g4 c16
the string is followed by zero or more ' or , (not combined): e' bes, f'''2 g,,4
the string can be substituted by a list of strings, list limiters are <>: 4 The number comes behind the >, no space allowed
P.S. 2
The goal is NOT to create a Lilypond parser. Is it really just for very short snippets with no additional functionality, no extensions to insert notes. If this does not work I would go for another format (simplified) like ABC. So anything that has to do with Lilypond ("Run it trough lilypond, let it give out the music data in Scheme, parse that") or its toolchain is certainly NOT the answer to this question. The package is not even installed.
I know you are not looking for a general parser, but pyparsing makes this process very simple. Your format seemed very similar to the chemical formula parser that I wrote as one of the earliest pyparsing examples.
Here is your problem implemented using pyparsing:
from pyparsing import (Suppress,Word,alphas,nums,Combine,Optional,Regex,Group,
OneOrMore)
"""
List item
-the string can consist of multiple, at least one, [a-zA-Z] chars
-the string is followed by zero or more digits: e bes g4 c16
-the string is followed by zero or more ' or , (not combined):
e' bes, f'''2 g,,4
-the string can be substituted by a list of strings, list limiters are <>;
the number comes behind the >, no space allowed
"""
LT,GT = map(Suppress,"<>")
integer = Word(nums).setParseAction(lambda t:int(t[0]))
note = Combine(Word(alphas) + Optional(Word(',') | Word("'")))
# or equivalent using Regex class
# note = Regex(r"[a-zA-Z]+('+|,+)?")
# define the list format of one or more notes within '<>'s
note_list = Group(LT + OneOrMore(note) + GT)
# each item is a note_list or a note, optionally followed by an integer; if
# no integer is given, default to 0
item = (note_list | Group(note)) + Optional(integer, default=0)
# reformat the parsed data as a (number, note_or_note_list) tuple
item.setParseAction(lambda t: (t[1],t[0].asList()) )
source = "c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"
print OneOrMore(item).parseString(source)
With this output:
[(4, ["c'"]), (8, ['d']), (16, ["e'", "g'"]), (4, ["fis'"]), (0, ['a,,']),
(0, ['g,', "b'"]), (1, ["c''"])]
Your first question can be answered in this way:
>>> import re
>>> t = "foo <bar e word> f ga <foo b>"
>>> t2 = re.sub(r"(^|\s+)(?![^<>]*?>)", " #", t).lstrip()
>>> t2
'#foo #<bar e word> #f #ga #<foo b>'
I added lstrip() to remove the single space that occurs before the result of this pattern. If you want to go with your first option, you could simply replace #< with <.
Your second question can be solved in the following manner, although you might need to think about the , in a list like ['g,', "b'"]. Should the comma from your string be there or not? There may be a faster way. The following is merely proof of concept. A list comprehension might take the place of the final element, although it would be farily complicated.
>>> s = "c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"
>>> q2 = re.compile(r"(?:<)\s*[^>]*\s*(?:>)\d*|(?<!<)[^\d\s<>]+\d+|(?<!<)[^\d\s<>]+")
>>> s2 = q2.findall(s)
>>> s3 = [re.sub(r"\s*[><]\s*", '', x) for x in s2]
>>> s4 = [y.split() if ' ' in y else y for y in s3]
>>> s4
["c'4", 'd8', ["e'", "g'16"], "fis'4", 'a,,', ['g,', "b'"], "c''1"]
>>> q3 = re.compile(r"([^\d]+)(\d*)")
>>> s = []
>>> for item in s4:
if type(item) == list:
lis = []
for elem in item:
lis.append(q3.search(elem).group(1))
if q3.search(elem).group(2) != '':
num = q3.search(elem).group(2)
if q3.search(elem).group(2) != '':
s.append((num, lis))
else:
s.append((0, lis))
else:
if q3.search(item).group(2) != '':
s.append((q3.search(item).group(2), [q3.search(item).group(1)]))
else:
s.append((0, [q3.search(item).group(1)]))
>>> s
[('4', ["c'"]), ('8', ['d']), ('16', ["e'", "g'"]), ('4', ["fis'"]), (0, ['a,,']), (0, ['g,', "b'"]), ('1', ["c''"])]

Categories