Python - match all repetition of a char between groups of chars - python

I've a little problem with a regex best described below:
Original string is:
{reply_to={message_type=login}|login_id=pippo|user_description=pippo=pluto|version=2013.2.1|utc_offset=7200|login_date=2014-07-03|login_time=09:43:02|error=0}
This is what I would like to obtain:
{reply_to:{message_type:login}|login_id:pippo|user_description:pippo=pluto|version:2013.2.1|utc_offset:7200|login_date:2014-07-03|login_time:09:43:02|error:0}
It happens that if there is an "=" also in the value of the key I cannot substitute it.
What I've tried to do is to match and substitute grouping a set of chars:
re.sub(r'([\{\}\|])=([\{\}\|])',r'\1":"\2',modOutput)
Obiously it doens't work! Any Idea ?

This works at least with the given example:
re.sub(r'=([^{|}]*)', r':\1', s)
We're looking for a =, then capturing up to the next delimiter (one of {|}) in order to skip over subsequent = signs.

Related

How to group inside "or" matching in a regex?

I have two kinds of documents to parse:
1545994641 INFO: ...
and
'{"deliveryDate":"1545994641","error"..."}'
I want to extract the timestamp 1545994641 from each of them.
So, I decided to write a regex to match both cases:
(\d{10}\s|\"\d{10}\")
In the 1st kind of document, it matches the timestamp and groups it, using the first expression in the "or" above (\d{10}\s):
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg="1545994641 INFO: ..."
>>> regex.search(msg).group(0)
'1545994641 '
(So far so good.)
However, in the 2nd kind, using the second expression in the "or" (\"\d{10}\") it matches the timestamp and quotation marks, grouping them. But I just want the timestamp, not the "":
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg='{"deliveryDate":"1545994641","error"..."}'
>>> regex.search(msg).group(0)
'"1545994641"'
What I tried:
I decided to use a non-capturing group for the quotation marks:
(\d{10}\s|(?:\")\d{10}(?:\"))
but it doesn't work as the outer group catches them.
I also removed the outer group, but the result is the same.
Unwanted ways to solve:
I can surpass this by creating a group for each expression in the or,
but I just want it to output a single group (to abstract the code
from the regex).
I could also use a 2nd step of regex to capture the timestamp from
the group that has the quotation marks, but again that would break
the code abstraction.
I could omit the "" in the regex but that would match a timestamp in the middle of the message , as I want it to be objective to capture the timestamp as a value of a key or in the beginning of the document, followed by a space.
Is there a way I can match both cases above but, in the case it matches the second case, return only the timestamp? Or is it impossible?
EDIT:
As noticed by #Amit Bhardwaj, the first case also returns a space after the timestamp. It's another problem (I didn't figure out) with the same solution, probably!
You may use lookarounds if your code can only access the whole match:
^\d{10}(?=\s)|(?<=")\d{10}(?=")
See the regex demo.
In Python, declare it as
rx = r'^\d{10}(?=\s)|(?<=")\d{10}(?=")'
Pattern details
^\d{10}(?=\s):
^ - string start
\d{10} - ten digits
(?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location
| - or
(?<=")\d{10}(?="):
(?<=") - a " char
\d{10} - ten digits
(?=") - a positive lookahead that requires a double quotation mark immediately to the right of the current location.
You could use lookarounds, but I think this solution is simpler, if you can just get the group:
"?(\d{10})(?:\"|\s)
EDIT:
Considering if there is a first " there must be a ", try this:
(^\d{10}\s|(?<=\")\d{10}(?=\"))
EDIT 2:
To also remove the trailing space in the end, use a lookahead too:
(^\d{10}(?=\s)|(?<=\")\d{10}(?=\"))

returning info from a string

return poker_hand(list_of_five_cards) returns a string similar to this:
**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)
and I have created a string out of it I want the information inside the brackets. in this vein I have tried:
s = str(poker_hand(one_man))
print s
the_search = re.search(r"\((\w+)\)", s)
and this returns None when you type print the_search. I have also tried
s[s.find("(")+1:s.find(')')]
print s
which returns the whole string. Does anyone know what I am doing wrong?
EDIT sorry for the confusion I should be better,
input is 7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)
desired output is One pair
re the assigning... trying to assign it now, will post the results
the pattern you are using to find the item in brackets is not right.
you can try to test your regex in http://regexr.com/
import re
s = '**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'
pattern = r'\(.+\.\)'
for item in re.findall(pattern,s):
print item.strip('().')
output:
One pair
IIUC at the end of your string you always have the closed brackets. Then try this:
'**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'.split('(')[1][:-1]
Out[1]: 'One pair.'
The idea is to split by the opening brackets, taking what's after, and deleting the closing brackets.
input is 7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)
desired output is One pair
You can use something like:
import re
string = "7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)"
result = re.findall(r"\((.*?)\.?\)", string )
print result[0]
Ideone Demo
Regex Explanation:
\((.*?)\.?\)
Match the character “(” literally «\(»
Match the regex below and capture its match into backreference number 1 «(.*?)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “.” literally «\.?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character “)” literally «\)»
Use the groups:
import re
s = '**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'
print (s)
m = re.search(r'\(([\s\S]+)\.\)', s)
print(m.group(1))

Isolate the first number after a letter with regular expressions

I am trying to parse a chemical formula that is given to me in unicode in the format C7H19N3
I wish to isolate the position of the first number after the letter, I.e 7 is at index 1 and 1 is at index 3. With is this i want to insert "sub" infront of the digits
My first couple attempts had me looping though trying to isolate the position of only the first numbers but to no avail.
I think that Regular expressions can accomplish this, though im quite lost in it.
My end goal is to output the formula Csub7Hsub19Nsub3 so that my text editor can properly format it.
How about this?
>>> re.sub('(\d+)', 'sub\g<1>', "C7H19N3")
'Csub7Hsub19Nsub3'
(\d+) is a capturing group that matches 1 or more digits. \g<1> is a way of referring to the saved group in the substitute string.
Something like this with lookahead and lookbehind:
>>> strs = 'C7H19N3'
>>> re.sub(r'(?<!\d)(?=\d)','sub',strs)
'Csub7Hsub19Nsub3'
This matches the following positions in the string:
C^7H^19N^3 # ^ represents the positions matched by the regex.
Here is one which literally matches the first digit after a letter:
>>> re.sub(r'([A-Z])(\d)', r'\1sub\2', "C7H19N3")
'Csub7Hsub19Nsub3'
It's functionally equivalent but perhaps more expressive of the intent? \1 is a shorter version of \g<1>, and I also used raw string literals (r'\1sub\2' instead of '\1sub\2').

Why is there an extra result handed back to me during this Python regex example?

Code:
re.findall('(/\d\d\d\d)?','/2000')
Result:
['/2000', '']
Code:
re.findall('/\d\d\d\d?','/2000')
Result:
['/2000']
Why is the extra '' returned in the first example?
i am using the first example for django url configuration , is there a way i can prevent matching of '' ?
Because using the brackets you define a group, and then with ? you ask for 0 to 1 repetitions of the group. Thus the empty string and /2000 both match.
the operator ? will match 0 or 1 repetitions of the preceding expression, in the first case the preceding expression is (/\d\d\d\d), while in the second is the last \d.
Therefore the first case the empty string "" will be matched, as it contain zero repetition of the expression (/\d\d\d\d)
Here is what is happening: The regex engine starts off with its pointer before the first char in the target string. It greedily consumes the whole string and places the match result in the first list element. This leaves the internal pointer at the end of the string. But since the regex pattern can match nothingness, it successfully matches at the position at the end of the string too, Thus, there are two elements in the list.

Regex match even number of letters

I need to match an expression in Python with regular expressions that only matches even number of letter occurrences. For example:
AAA # no match
AA # match
fsfaAAasdf # match
sAfA # match
sdAAewAsA # match
AeAiA # no match
An even number of As SHOULD match.
Try this regular expression:
^[^A]*((AA)+[^A]*)*$
And if the As don’t need to be consecutive:
^[^A]*(A[^A]*A[^A]*)*$
This searches for a block with an odd number of A's. If you found one, the string is bad for you:
(?<!A)A(AA)*(?!A)
If I understand correctly, the Python code should look like:
if re.search("(?<!A)A(AA)*(?!A)", "AeAAi"):
print "fail"
Why work so hard coming up with a hard to read pattern? Just search for all occurrences of the pattern and count how many you find.
len(re.findall("A", "AbcAbcAbcA")) % 2 == 0
That should be instantly understandable by all experienced programmers, whereas a pattern like "(?
Simple is better.
'A*' means match any number of A's. Even 0.
Here's how to match a string with an even number of a's, upper or lower:
re.compile(r'''
^
[^a]*
(
(
a[^a]*
){2}
# if there must be at least 2 (not just 0), change the
# '*' on the following line to '+'
)*
$
''',re.IGNORECASE|re.VERBOSE)
You probably are using a as an example. If you want to match a specific character other than a, replace a with %s and then insert
[...]
$
'''%( other_char, other_char, other_char )
[...]
'*' means 0 or more occurences
'AA' should do the trick.
The question is if you want the thing to match 'AAA'. In that case you would have to do something like:
r = re.compile('(^|[^A])(AA)+(?!A)',)
r.search(p)
That would work for match even (and only even) number of'A'.
Now if you want to match 'if there is any even number of subsequent letters', this would do the trick:
re.compile(r'(.)\1')
However, this wouldn't exclude the 'odd' occurences. But it is not clear from your question if you really want that.
Update:
This works for you test cases:
re.compile('^([^A]*)AA([^A]|AA)*$')
First of all, note that /A*/ matches the empty string.
Secondly, there are some things that you just can't do with regular expressions. This'll be a lot easier if you just walk through the string and count up all occurences of the letter you're looking for.
A* means match "A" zero or more times.
For an even number of "A", try: (AA)+
It's impossible to count arbitrarily using regular expressions. For example, making sure that you have matching parenthesis. To count you need 'memory' which requires something at least as strong as a pushdown automaton, although in this case you can use the regular expression that #Gumbo provided.
The suggestion to use finditeris the best workaround for the general case.

Categories