How to group inside "or" matching in a regex? - python

I have two kinds of documents to parse:
1545994641 INFO: ...
and
'{"deliveryDate":"1545994641","error"..."}'
I want to extract the timestamp 1545994641 from each of them.
So, I decided to write a regex to match both cases:
(\d{10}\s|\"\d{10}\")
In the 1st kind of document, it matches the timestamp and groups it, using the first expression in the "or" above (\d{10}\s):
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg="1545994641 INFO: ..."
>>> regex.search(msg).group(0)
'1545994641 '
(So far so good.)
However, in the 2nd kind, using the second expression in the "or" (\"\d{10}\") it matches the timestamp and quotation marks, grouping them. But I just want the timestamp, not the "":
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg='{"deliveryDate":"1545994641","error"..."}'
>>> regex.search(msg).group(0)
'"1545994641"'
What I tried:
I decided to use a non-capturing group for the quotation marks:
(\d{10}\s|(?:\")\d{10}(?:\"))
but it doesn't work as the outer group catches them.
I also removed the outer group, but the result is the same.
Unwanted ways to solve:
I can surpass this by creating a group for each expression in the or,
but I just want it to output a single group (to abstract the code
from the regex).
I could also use a 2nd step of regex to capture the timestamp from
the group that has the quotation marks, but again that would break
the code abstraction.
I could omit the "" in the regex but that would match a timestamp in the middle of the message , as I want it to be objective to capture the timestamp as a value of a key or in the beginning of the document, followed by a space.
Is there a way I can match both cases above but, in the case it matches the second case, return only the timestamp? Or is it impossible?
EDIT:
As noticed by #Amit Bhardwaj, the first case also returns a space after the timestamp. It's another problem (I didn't figure out) with the same solution, probably!

You may use lookarounds if your code can only access the whole match:
^\d{10}(?=\s)|(?<=")\d{10}(?=")
See the regex demo.
In Python, declare it as
rx = r'^\d{10}(?=\s)|(?<=")\d{10}(?=")'
Pattern details
^\d{10}(?=\s):
^ - string start
\d{10} - ten digits
(?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location
| - or
(?<=")\d{10}(?="):
(?<=") - a " char
\d{10} - ten digits
(?=") - a positive lookahead that requires a double quotation mark immediately to the right of the current location.

You could use lookarounds, but I think this solution is simpler, if you can just get the group:
"?(\d{10})(?:\"|\s)
EDIT:
Considering if there is a first " there must be a ", try this:
(^\d{10}\s|(?<=\")\d{10}(?=\"))
EDIT 2:
To also remove the trailing space in the end, use a lookahead too:
(^\d{10}(?=\s)|(?<=\")\d{10}(?=\"))

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!
The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)
You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Using Regex to move some letter of a string to a new location in the same string in a Series of strings in python

I have a list of 4000 strings. The naming convention needs to be changed for each string and I do not want to go through and edit each one individually.
The list looks like this:
data = list()
data = ['V2-FG2110-EMA-COMPRESSION',
'V2-FG2110-SA-COMPRESSION',
'V2-FG2110-UMA-COMPRESSION',
'V2-FG2120-EMA-DISTRIBUTION',
'V2-FG2120-SA-DISTRIBUTION',
'V2-FG2120-UMA-DISTRIBUTION',
'V2-FG2140-EMA-HEATING',
'V2-FG2140-SA-HEATING',
'V2-FG2140-UMA-HEATING',
'V2-FG2150-EMA-COOLING',
'V2-FG2150-SA-COOLING',
'V2-FG2150-UMA-COOLING',
'V2-FG2160-EMA-TEMPERATURE CONTROL']
I need all each 'SA' 'UMA' and 'EMA' to be moved to before the -FG.
Desired output is:
V2-EMA-FG2110-Compression
V2-SA-FG2110-Compression
V2-UMA-FG2110-Compression
...
The V2-FG2 does not change throughout the list so I have started there and I tried re.sub and re.search but I am pretty new to python so I have gotten a mess of different results. Any help is appreciated.
You can rearrange the strings.
new_list = []
for word in data:
arr = word.split('-')
new_word = '%s-%s-%s-%s'% (arr[0], arr[2], arr[1], arr[3])
new_list.append(new_word)
You can replace matches of the following regular expression with the contents of capture group 1:
(?<=^[A-Z]\d)(?=.*(-(?:EMA|SA|UMA))(?=-))|-(?:EMA|SA|UMA)(?=-)
Demo
The regular expression can be broken down as follows.
(?<=^[A-Z]\d) # current string position must be preceded by a capital
# letter followed by a digit at the start of the string
(?= # begin a positive lookahead
.* # match >= 0 chars other than a line terminator
(-(?:EMA|SA|UMA)) # match a hyphen followed by one of the three strings
# and save to capture group 1
(?=-) # the next char must be a hyphen
) # end positive lookahead
| # or
-(?:EMA|SA|UMA) # match a hyphen followed by one of the three strings
(?=-) # the next character must be a hyphen
(?=-) is a positive lookahead.
Evidently this may not work for versions of Python prior to 3.5, because the match in the second part of the alternation does not assign a value to capture group 1: "Before Python 3.5, backreferences to failed capture groups in Python re.sub were not populated with an empty string.. This quote is from
#WiktorStribiżew 's answer at the link. For what it's worth I confirmed that Ruby has the same behaviour ("V2-FG2110-EMA-COMPRESSION".gsub(rgx,'\1') #=> "V2-EMA-FG2110-COMPRESSION").
One could of course instead replace matches of (?<=^[A-Z]\d)(-[A-Z]{2}\d{4})(-(?:EMA|SA|UMA))(?=-)) with $2 + $1. That's probably more sensible even if it's less interesting.

Matching character "/" in a string

This is my first post and I am a newbie to Python. I am trying to get this to work.
string 1 = [1/0/1, 1/0/2]
string 2 = [1/1, 1/2]
Trying to check the string if I see two / then I just need to replace the 0 with 1 so it becomes 1/1/1 and 1/1/2.
If I don't have two / then I need to add one in along with a 1 and change it to the format 1/1/1 and 1/1/2 so string 2 becomes [1/1/1,1/1/2]
Ultimate goal is to get all strings match the pattern x/1/x. Thanks for all the Input on this.I tried this and it seems to work
for a in Port:
if re.search(r'././', a):
z.append(a.replace('/0/','/1/') )
else:
t1= a.split('/')
if len(t1)>1 :
t2= t1[0] + "/1/" + t1[1]
z.append(t2)
few lines are there to take care of some exceptions but seems to do the job.
The regex pattern for identifying a / is just \/
This could be solved rather simply using the built in string functions without having to add all of the overhead and additional computational time caused by using the RegEx engine.
For example:
# The string to test:
sTest = '1/0/2'
# Test the string:
if(sTest.count('/') == 2):
# There are two forward slashes in the string
# If the middle number is a 0, we'll replace it with a one:
sTest = sTest.replace('/0/', '/1/')
elif(sTest.count('/') == 1):
# One forward slash in string
# Insert a 1 between first portion and the last portion:
sTest = sTest.replace('/', '/1/')
else:
print('Error: Test string is of an unknown format.')
# End If
If you really want to use RegEx, though, you could simply match the string against these two patterns: \d+/0/\d+ and \d+/\d+(?!/) If matching against the first pattern fails, then attempt to match against the second pattern. Then, you can use a either grouping, splitting, or simply calling .replace() (like I'm doing above) to format the string as you need.
EDIT: for clarification, I'll explain the two patterns:
Pattern 1: \d+/0/\d+ could essentially be read as "match any number (consisting of one (1) or more digits) followed by a forward slash, a zero (0), another forward slash and then followed by any number (consisting of one (1) or more digits).
Pattern 2: \d+/\d+(?!/) could be read as "match any number (consisting of one (1) or more digits) followed by a forward slash and any other number (consisting of one (1) or more digits) which is then NOT followed by another forward slash." The last part in this pattern could be a little confusing because it uses the negative lookahead abilities of the RegEx engine.
If you wanted to add stricter rules to these patterns to make sure there are not any leading or trailing non-digit characters, you could add ^ to the start of the patterns and $ to the end, to signify the start of the string and the end of the string respectively. This would also allow you to remove the lookahead expression from the second pattern ((?!/)). As such, you would end up with the following patterns: ^\d+/0/\d+$ and ^\d+/\d+$.
https://regex101.com/r/rE6oN2/1
Click code generator on the left side. You get:
import re
p = re.compile(ur'\d/1/\d')
test_str = u"1/1/2"
re.search(p, test_str)

regex conditional matching

I am trying to use re.findall to find this pattern:
01-234-5678
regex:
(\b\d{2}(?P<separator>[-:\s]?)\d{2}(?P=separator)\d{3}(?P=separator)\d{3}(?:(?P=separator)\d{4})?,?\.?\b)
however, some cases have shortened to 01-234-5 instead of 01-234-0005 when the last four digits are 3 zeros followed by a non-zero digit.
Since there does't seem to be any uniformity in formatting I had to account for a few different separator characters or possibly none at all. Luckily, I have only noticed this shortening when some separator has been used...
Is it possible to use a regex conditional to check if a separator does exist (not an empty string), then also check for the shortened variation?
So, something like if separator != '': re.findall(r'(\b\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)(\d{4}|\d{1})\.?\b)', text)
Or is my only option to include all the possibly incorrect 6 digit patterns then check for a separator with python?
If you want the last group of digits to be "either one or four digits", try:
>>> import re
>>> example = "This has one pattern that you're expecting, 01-234-5678, and another that maybe you aren't: 23:456:7"
>>> pattern = re.compile(r'\b(\d{2}(?P<sep>[-:\s]?)\d{3}(?P=sep)\d(?:\d{3})?)\b')
>>> pattern.findall(example)
[('01-234-5678', '-'), ('23:456:7', ':')]
The last part of the pattern, \d(?:\d{3})?), means one digit, optionally followed by three more (i.e. one or four). Note that you don't need to include the optional full stop or comma, they're already covered by \b.
Given that you don't want to capture the case where there is no separator and the last section is a single digit, you could deal with that case separately:
r'\b(\d{9}|\d{2}(?P<sep>[-:\s])\d{3}(?P=sep)\d(?:\d{3})?)\b'
# ^ exactly nine digits
# ^ or
# ^ sep not optional
See this demo.
It is not clear why you are using word boundaries, but I have not seen your data.
Otherwise you can shorten the entire this to this:
re.compile(r'\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)\d{1,4}')
Note that \d{1,4} matched a string with 1, 2, 3 or 4 digits
If there is no separator, e.g. "012340008" will match the regex above as you are using [-:\s]? which matches 0 or 1 times.
HTH

Referencing previous group possible within the same regex?

I am trying to perform a regex in Python. I want to match on a file path that does not have a domain extension and additionally, I only want to get those file paths that have 20 characters max after the last '\' of the file path. For example, given the data:
c:\users\docs\cmd.exe
c:\users\docs\files\ewyrkfdisadfasdfaffsfdasfsafsdf
c:\users\docs\files\target
I would want to match on 'target', and not the other two lines. It should be noted that in my current situation, using the re module or python operations is not an option, as this regex is fed into the program (which uses re.match() ), so I have do to this within a regex string.
I have two regexes:
^([^.]+)$ will match the the last 2 lines
([^\\]{,20}$) will match 'cmd.exe' and 'target'
How can I combine these two into one regex? I tried backreferencing (?P=, etc), but couldn't get it to work. Is this even possible?
How about \\([^\\.]{1,20})(?:$|\n)? It seems to work for me.
\\ is escaped literal backslash.
( start of capture group.
[^\\.] match anything except literal backslash or literal dot character
{1,20} match class 1-20 times, as many times as possible (greedy).
) end the capture group.
(?: starts a non-capturing group
$ match the end of the string.
| is the 'or' operator for this group
\n matches a line-feed or newline character (ASCII 10)
) end of non-capturing group
To create this, I used https://regex101.com/#python which is a very good resource in my opinion because it explains every part of the regex and neatly shows the captured groups in real time.
>>> s = r"""c:\users\docs\cmd.exe
... c:\users\docs\files\ewyrkfdisadfasdfaffsfdasfsafsdf
... c:\users\docs\files\target""".split('\n')
>>> [re.match(r'.*\\([^.]{,20})$', x) for x in s]
[None, None, <_sre.SRE_Match object at 0x7f6ad9631558>]
also
>>> [re.findall(r'.*\\([^.]{,20})$', x) for x in s]
[[], [], ['target']]
This means:
.*\\ - grab everything up to and including the last \
([^.]{,20}) - make sure there are no . in the remaining upto 20 characters
$ - end of line
The () around the middle group indicate that it should be the group returned as the match

Categories